Difference between revisions of "Collecting format ID test files"

From CURATEcamp
Jump to: navigation, search
(Added navigation.)
Line 6: Line 6:
  
 
For example, we could augment the way we develop signatures (see [[Improving format ID coverage]]) by also collecting test files under CC0 or similar licences, and setting up a monitoring process to run the new Tika signatures against the test documents and check the results.
 
For example, we could augment the way we develop signatures (see [[Improving format ID coverage]]) by also collecting test files under CC0 or similar licences, and setting up a monitoring process to run the new Tika signatures against the test documents and check the results.
 +
 +
== Possible Sources ==
 +
There are a number of places that might have possible source files. The licensing of them is not always clear, but in this case they could be referred to by URL rather than embedded directly.
 +
*  http://samples.multimedia.cx/

Revision as of 13:26, 29 October 2012

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

Perhaps the best way of Improving format ID coverage would be to drive this with a corpus of test files of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a").

I started doing this via GitHub, in this format-corpus repository. There are some efforts underway to make submission easier for non technical users, but it may make sense to collect files via GitHub too.

For example, we could augment the way we develop signatures (see Improving format ID coverage) by also collecting test files under CC0 or similar licences, and setting up a monitoring process to run the new Tika signatures against the test documents and check the results.

Possible Sources

There are a number of places that might have possible source files. The licensing of them is not always clear, but in this case they could be referred to by URL rather than embedded directly.