Improving format ID coverage

From CURATEcamp
Revision as of 12:10, 16 November 2012 by Andy Jackson (talk | contribs) (Easier signature generation)
Jump to: navigation, search

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

We want to be able to identify more different formats. We could do this simply by improving the existing tools, such as Apache Tika, DROID, and the fine free file command. Some guides on how to do this are referenced here. In particular, the National Archives signature development documentation provides useful information on the issues one should consider when developing signatures (IDEA: import the best bits here?).

Note that this does not address formats that are poorly covered by the 'file magic' methodology, such as container formats (e.g. DOCX is identified as a special ZIP), or text formats (which no tool covers well). See also Improving identification methods.

Despite these limitations, this would be very useful, especially if combined with some effort on Collecting format ID test files. However, in order to make the most of the effort, there are some things we could consider doing before hand to speed things up.

Easier signature generation

DROID signatures are tricky to develop, and developing signatures for each tool independently seems unnecessarily cumbersome. So, we could consider developing some tooling to make an improved signature development workflow:

  • Create Tika-compatible signatures.
    • Provide documentation on how to create them, using the Tika-compatible mime-info standard.
    • Note that a tool exists to generate candidate signatures directly from example files. See How to run Percipio.
    • Make a simple tool to make it easy to test them (this is very nearly core Tika functionality. See Using Fidget for an introduction to a tool to make this simpler.
    • There's also a web version of Fidget, if you can't run the tool locally: http://www.opf-labs.org:9081/fidget/
  • Collect them to make submission easier
    • Pool them, eg. in a GitHub repository like this.
  • Generate signatures for other tools from the Tika ones
    • Fidget has some basic support for generating DROID signatures from Tika/Mime-Info ones.
    • Should be fairly easy to do this for file magic
    • The Fido codebase turns Droid sigs into RegEx, which could then be used by Tika. Note, however, that most tools (including Tika) only support beginning-of-file magic.

Tracking format ID coverage improvements

In order to encourage contributions and focus efforts, it would be good to be able to expose the coverage of the different tools. At the most basic, this could just track the number of types, file extensions and signatures is the pooled Tika signature file. Ideally, this could compare different tools, expose differences so they can be resolved, and refer to boarder format lists so that we can more quickly identify gaps across all tools. Some prototype code for parsing Tika and DROID is available here.

This is quite close to the ideas behind Gary McGath's Registry Browser (source code here), which could be repurposed as a web resource that lets us compare format information held in different 'registries'. It currently includes DBPedia, linked data PRONOM, etc. If could also be extended to interact with FreeBase (example query, FreeBase API info).