Batch OCR & Search

From CURATEcamp

Revision as of 20:57, 26 July 2012 by Chris Adams (talk | contribs) (Import of Brendan's notes)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Can an existing piece of text be mapped to the layout from an original image?
- Tesseract can support this.
- Very valuable for old manuscripts

OCR error rate depends on quality of the writing and quality of a scan.

Some people are using Hadoop to distribute some of the OCR and analysis.

Possibility of author tracking (via handwriting style tracking) being integrated into mainstream tools?

Possibility of revision tracking built into the OCR’d contents metadata?

How much OCR metadata to include?
Human input vs automation vs hybrid approach.
Linking OCR’d content back to catalog records is straightforward and easy approach.

hOCR - Standard output format for OCR’d content. http://en.wikipedia.org/wiki/HOCR

OMR (optical music recognition) - http://en.wikipedia.org/wiki/Music_OCR

Projects mentioned:

ActiveOCR Corrections are fed back into the software and it learns to become more accurate.
Tesseract
- Lots of languages
- Skew on images can produce poor results
OCRopus

Retrieved from "https://wiki.curatecamp.org/index.php?title=Batch_OCR_%26_Search&oldid=1903"

Navigation menu