Improving identification methods
Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012
There are some systematic issues with the tools, outlined here. They are rather big issues, but might be worth considering.
Many formats actually require some degree of parsing to understand the contents, from DOCX (which is a special ZIP, and DROID and Tika handle it as such), through to media codecs (which are well supported by other tools like ffprobe).
There are two issues here:
- Whether we try and sync up how the different tools work (e.g. port DROID's container file signatures and turn that into a Tika Detector module).
- Whether we try and formalise the integration of Tika/DROID/etc into an overall ID workflow so that ffprobe/etc can be reliably called in as needed.
- Whether we can use MIME Type codec parameters to capture the identification information (See http://tools.ietf.org/html/rfc4281).