Batch OCR & Search

From CURATEcamp
Jump to: navigation, search
  • Can an existing piece of text be mapped to the layout from an original image?
    • Tesseract can support this.
    • Very valuable for old manuscripts
  • OCR error rate depends on quality of the writing and quality of a scan.
  • Some people are using Hadoop to distribute some of the OCR and analysis.
  • Possibility of author tracking (via handwriting style tracking) being integrated into mainstream tools?
  • Possibility of revision tracking built into the OCR’d contents metadata?
  • How much OCR metadata to include?
  • Human input vs automation vs hybrid approach.
  • Linking OCR’d content back to catalog records is straightforward and easy approach.


Projects mentioned:

  • ActiveOCR Corrections are fed back into the software and it learns to become more accurate.
  • Tesseract
    • Lots of languages
    • Skew on images can produce poor results
  • OCRopus