Collecting format ID test files

From CURATEcamp
Revision as of 22:53, 22 October 2012 by Andy Jackson (talk | contribs) (Initial outline of collection ideas.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Perhaps the best way of Improving format ID coverage would be to drive this with a corpus of test files of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a").

I started doing this via GitHub, in this format-corpus repository. There are some efforts underway to make submission easier for non technical users, but it may make sense to collect files via GitHub too.

For example, we could augment the way we develop signatures (see Improving format ID coverage) by also collecting test files under CC0 or similar licences, and setting up a monitoring process to run the new Tika signatures against the test documents and check the results.