Collecting format ID test files

From CURATEcamp
Revision as of 13:26, 29 October 2012 by Andy Jackson (talk | contribs)
Jump to: navigation, search

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

Perhaps the best way of Improving format ID coverage would be to drive this with a corpus of test files of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a").

I started doing this via GitHub, in this format-corpus repository. There are some efforts underway to make submission easier for non technical users, but it may make sense to collect files via GitHub too.

For example, we could augment the way we develop signatures (see Improving format ID coverage) by also collecting test files under CC0 or similar licences, and setting up a monitoring process to run the new Tika signatures against the test documents and check the results.

Possible Sources

There are a number of places that might have possible source files. The licensing of them is not always clear, but in this case they could be referred to by URL rather than embedded directly.