Batch OCR & Search

From CURATEcamp
Revision as of 20:57, 26 July 2012 by Chris Adams (talk | contribs) (Import of Brendan's notes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
  • Can an existing piece of text be mapped to the layout from an original image?
    • Tesseract can support this.
    • Very valuable for old manuscripts
  • OCR error rate depends on quality of the writing and quality of a scan.
  • Some people are using Hadoop to distribute some of the OCR and analysis.
  • Possibility of author tracking (via handwriting style tracking) being integrated into mainstream tools?
  • Possibility of revision tracking built into the OCR’d contents metadata?
  • How much OCR metadata to include?
  • Human input vs automation vs hybrid approach.
  • Linking OCR’d content back to catalog records is straightforward and easy approach.


Projects mentioned:

  • ActiveOCR Corrections are fed back into the software and it learns to become more accurate.
  • Tesseract
    • Lots of languages
    • Skew on images can produce poor results
  • OCRopus