Difference between revisions of "Collecting format ID test files"

From CURATEcamp
Jump to: navigation, search
(Initial outline of collection ideas.)
 
(Added navigation.)
Line 1: Line 1:
 +
[[Main Page]] > CURATEcamp iPRES 2012 > [[CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012]]
 +
 
Perhaps the best way of [[Improving format ID coverage]] would be to drive this with a corpus of test files of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a").
 
Perhaps the best way of [[Improving format ID coverage]] would be to drive this with a corpus of test files of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a").
  

Revision as of 23:56, 22 October 2012

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

Perhaps the best way of Improving format ID coverage would be to drive this with a corpus of test files of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a").

I started doing this via GitHub, in this format-corpus repository. There are some efforts underway to make submission easier for non technical users, but it may make sense to collect files via GitHub too.

For example, we could augment the way we develop signatures (see Improving format ID coverage) by also collecting test files under CC0 or similar licences, and setting up a monitoring process to run the new Tika signatures against the test documents and check the results.