Difference between revisions of "Improving format ID coverage"
Gary McGath (talk | contribs) m (→Tracking format ID coverage improvements: Put a needed apostrophe after my name (it just kept bugging me :)) |
Andy Jackson (talk | contribs) (→Easier signature generation: Added Percipio link.) |
||
Line 13: | Line 13: | ||
* Create Tika-compatible signatures. | * Create Tika-compatible signatures. | ||
** Provide documentation on how to create them. | ** Provide documentation on how to create them. | ||
+ | ** Note that a tool exists to generate candidate signatures directly from example files. See [https://github.com/anjackson/percipio/wiki/How-to-run-Percipio How to run Percipio]. | ||
** Make a simple tool to make it easy to test them (this is very nearly core Tika functionality, just needs a decent CLI tool - initial experiment [https://github.com/openplanets/nanite/tree/master/nanite-core/src/main/java/uk/bl/wap/nanite/tika here] but should probably be moved into the format-corpus codebase). | ** Make a simple tool to make it easy to test them (this is very nearly core Tika functionality, just needs a decent CLI tool - initial experiment [https://github.com/openplanets/nanite/tree/master/nanite-core/src/main/java/uk/bl/wap/nanite/tika here] but should probably be moved into the format-corpus codebase). | ||
* Collect them to make submission easier | * Collect them to make submission easier |
Revision as of 23:36, 23 October 2012
Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012
We want to be able to identify more different formats. We could do this simply by improving the existing tools, such as Apache Tika, DROID, and the fine free file command. Some guides on how to do this are referenced here. In particular, the National Archives signature development documentation provides useful information on the issues one should consider when developing signatures (IDEA: import the best bits here?).
Note that this does not address formats that are poorly covered by the 'file magic' methodology, such as container formats (e.g. DOCX is identified as a special ZIP), or text formats (which no tool covers well). See also Improving identification methods.
Despite these limitations, this would be very useful, especially if combined with some effort on Collecting format ID test files. However, in order to make the most of the effort, there are some things we could consider doing before hand to speed things up.
Easier signature generation
DROID signatures are tricky to develop, and developing signatures for each tool independently seems unnecessarily cumbersome. So, we could consider developing some tooling to make an improved signature development workflow:
- Create Tika-compatible signatures.
- Provide documentation on how to create them.
- Note that a tool exists to generate candidate signatures directly from example files. See How to run Percipio.
- Make a simple tool to make it easy to test them (this is very nearly core Tika functionality, just needs a decent CLI tool - initial experiment here but should probably be moved into the format-corpus codebase).
- Collect them to make submission easier
- Pool them, eg. in a GitHub repository like this.
- Generate signatures for other tools from the Tika ones
- Example, almost complete code for doing this for DROID is here, leveraging the utility provided by the National Archives.
- Should be fairly easy to do this for file magic
Tracking format ID coverage improvements
In order to encourage contributions and focus efforts, it would be good to be able to expose the coverage of the different tools. At the most basic, this could just track the number of types, file extensions and signatures is the pooled Tika signature file. Ideally, this could compare different tools, expose differences so they can be resolved, and refer to boarder format lists so that we can more quickly identify gaps across all tools. Some prototype code for parsing Tika and DROID is available here.
This is quite close to the ideas behind Gary McGath's Registry Browser, which could be repurposed as a web resource that lets us compare format information held in different 'registries'. It currently includes DBPedia, linked data PRONOM, etc. If could also be extended to interact with FreeBase (example query, FreeBase API info).