https://wiki.curatecamp.org/index.php?title=Batch_OCR_%26_Search&feed=atom&action=historyBatch OCR & Search - Revision history2024-03-29T15:26:20ZRevision history for this page on the wikiMediaWiki 1.28.0https://wiki.curatecamp.org/index.php?title=Batch_OCR_%26_Search&diff=1903&oldid=prevChris Adams: Import of Brendan's notes2012-07-26T18:57:35Z<p>Import of Brendan's notes</p>
<p><b>New page</b></p><div>* Can an existing piece of text be mapped to the layout from an original image?<br />
** Tesseract can support this.<br />
** Very valuable for old manuscripts<br />
<br />
* OCR error rate depends on quality of the writing and quality of a scan. <br />
<br />
* Some people are using Hadoop to distribute some of the OCR and analysis.<br />
<br />
* Possibility of author tracking (via handwriting style tracking) being integrated into mainstream tools?<br />
<br />
* Possibility of revision tracking built into the OCR’d contents metadata?<br />
<br />
* How much OCR metadata to include?<br />
* Human input vs automation vs hybrid approach.<br />
* Linking OCR’d content back to catalog records is straightforward and easy approach.<br />
<br />
* hOCR - Standard output format for OCR’d content. http://en.wikipedia.org/wiki/HOCR <br />
<br />
* OMR (optical music recognition) - http://en.wikipedia.org/wiki/Music_OCR <br />
<br />
<br />
Projects mentioned:<br />
<br />
* [http://mith.umd.edu/research/project/active-ocr/ ActiveOCR] Corrections are fed back into the software and it learns to become more accurate.<br />
* [http://code.google.com/p/tesseract-ocr/ Tesseract]<br />
** Lots of languages<br />
** Skew on images can produce poor results<br />
* [http://code.google.com/p/ocropus/ OCRopus]</div>Chris Adams