Improving identification methods

From CURATEcamp
Jump to: navigation, search

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

There are some systematic issues with the tools, outlined here. They are rather big issues, but might be worth considering.

'Container' Format

Many formats actually require some degree of parsing to understand the contents, from DOCX (which is a special ZIP, and DROID and Tika handle it as such), through to media codecs (which are well supported by other tools like ffprobe).

There are two issues here:

  • Whether we try and sync up how the different tools work (e.g. port DROID's container file signatures and turn that into a Tika Detector module).
  • Whether we try and formalise the integration of Tika/DROID/etc into an overall ID workflow so that ffprobe/etc can be reliably called in as needed.
  • Whether we can use MIME Type codec parameters to capture the identification information (See http://tools.ietf.org/html/rfc4281).


Text-based formats

All known tools are bad at identifying most text formats. SGML/HMTL/XML are fine (at least when well formed), but CSS, JavaScript, compressed JavaScript, CSV, Bash, Python, etc. are all poorly identifued by given tools. We could do with Collecting format ID test files of these types to make testing easier, but it's simply not clear how to do it.

Added an idea for discussion --Gary McGath 15:13, 23 October 2012 (PDT)

See also DROID can identify text based formats such as source code and scripting languages.