Batch OCR & Search
From CURATEcamp
- Can an existing piece of text be mapped to the layout from an original image?
- Tesseract can support this.
- Very valuable for old manuscripts
- OCR error rate depends on quality of the writing and quality of a scan.
- Some people are using Hadoop to distribute some of the OCR and analysis.
- Possibility of author tracking (via handwriting style tracking) being integrated into mainstream tools?
- Possibility of revision tracking built into the OCR’d contents metadata?
- How much OCR metadata to include?
- Human input vs automation vs hybrid approach.
- Linking OCR’d content back to catalog records is straightforward and easy approach.
- hOCR - Standard output format for OCR’d content. http://en.wikipedia.org/wiki/HOCR
- OMR (optical music recognition) - http://en.wikipedia.org/wiki/Music_OCR
Projects mentioned: