Difference between revisions of "Collecting format ID test files"

From CURATEcamp
Jump to: navigation, search
(Long sketch of submission process.)
Line 1: Line 1:
 
[[Main Page]] > CURATEcamp iPRES 2012 > [[CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012]]
 
[[Main Page]] > CURATEcamp iPRES 2012 > [[CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012]]
  
Perhaps the best way of [[Improving format ID coverage]] would be to drive this with a corpus of test files of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a").
+
Perhaps the best way of [[Improving format ID coverage]] would be to drive this with a corpus of test files, ideally of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a"). This project aims to make it easy to submit example files and optional metadata to a shared corpus.
  
I started doing this via GitHub, in this [https://github.com/openplanets/format-corpus format-corpus] repository. There are some efforts underway to make submission easier for non technical users, but it may make sense to collect files via GitHub too.
+
== Introduction to the Format Corpus ==
 +
The basic idea is to collect test files under CC0 licences, and where that licence cannot be applied, to at least record the known URLs of useful example files. This has already started, in the form of the [https://github.com/openplanets/format-corpus format-corpus] repository. The idea of this page is to describe how to add to that collection.
  
For example, we could augment the way we develop signatures (see [[Improving format ID coverage]]) by also collecting test files under CC0 or similar licences, and setting up a monitoring process to run the new Tika signatures against the test documents and check the results.
+
=== Overall Layout ===
 +
The top-level folders are laid out in broad themes, apart from 'tools' which contains software for analysing the corpus. Just pick whichever seems most appropriate, or create a new top-level folder if non suite. Similarly, arrange the files in sub-folders as required. If you're not sure, just add a top-level folder based on your GitHub username, and add files in there. We can always re-arrange them later.
 +
 
 +
When creating a new test file, you can just put it in your chosen folder and commit it. However, we'd also like it if we had more metadata about each file, so we suggest a slightly formalised layout, like this:
 +
 
 +
* my-folder
 +
* my-folder/example-file.ext (the newly created file)
 +
* my-folder/example-file.ext.csv (the metadata in CSV format - see proforma notes below)
 +
* my-folder/example-file.ext.md (a free-text metadata note, in Markdown format, for any general notes)
 +
* my-folder/example-file.ext.screenshot01.png (screenshots of the object, if any)
 +
 
 +
However, the main thing is to have more example files, so don't worry if there's little or no metadata you can add.
 +
 
 +
If the example file is available on the web, but not under a CC0 licence, then you can instead use the CSV metadata to specify the URL of the resource.
 +
 
 +
=== Metadata Conventions ===
 +
At the least, we'd like to see fields along the lines of
 +
* Creating application name
 +
* Format (PUID or extended MIME type)
 +
* Relative or absolute reference to the file that this was generated from, if any.
 +
 
 +
See [Creating an artificial test set using emulation] for proposed metadata that we could turn into a CSV proforma.
 +
 
 +
== How to add files to the corpus ==
 +
Fortunately, thanks to [http://window.github.com GitHub for Windows] and [http://mac.github.com GitHub for Mac], contributing files is actually pretty easy for anyone to do. The steps are as follows.
 +
 
 +
* Sign up for GitHub, [ here].
 +
* Download the relevant Desktop GitHub client ([http://window.github.com GitHub for Windows] or [http://mac.github.com GitHub for Mac])
 +
* Contact XXX with your GitHub username and ask to be added to the list of people who can commit to the repository.
 +
* Go to the [https://github.com/openplanets/format-corpus Format Corpus] repository and hit the 'Clone in Windows' or 'Clone in Mac' button.
 +
* The GitHub client should pop up, ask you where to place the repository, and download it.
 +
* Open up the repository on your local machine ('Show in Finder').
 +
* Add your files to a suitable folder, making a new directory if you want. Don't put it in the tools folder, but anywhere else is fine.
 +
* When you're finished open up your GitHub client and look the format corpus.
 +
* It should show you your changes, and provide you with a commit button.
 +
* Please enter some meaningful text, making it clear that it's a CC0 submission.
 +
** e.g. use this in the long message: "Creative Commons CC0: Public Domain Dedication. See http://creativecommons.org/publicdomain/zero/1.0/ - To the extent possible under law, I have waived all copyright and related or neighboring rights to this work."
 +
* Commit your changes locally.
 +
* Then share them, by pushing to GitHub.
 +
 
 +
TODO This really isn't very difficult, but some screenshots would be immensely helpful.
 +
TODO Use pull requests instead? Somewhat more complex for little gain?
  
 
== Possible Sources ==
 
== Possible Sources ==
 
There are a number of places that might have possible source files. The licensing of them is not always clear, but in this case they could be referred to by URL rather than embedded directly.
 
There are a number of places that might have possible source files. The licensing of them is not always clear, but in this case they could be referred to by URL rather than embedded directly.
 
*  http://samples.multimedia.cx/
 
*  http://samples.multimedia.cx/
 +
 +
== Monitoring Progress ==
 +
For example, we could augment the way we develop signatures (see [[Improving format ID coverage]]) by also setting up a monitoring process to run the new Tika signatures against the test documents and check the results.

Revision as of 14:14, 14 November 2012

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

Perhaps the best way of Improving format ID coverage would be to drive this with a corpus of test files, ideally of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a"). This project aims to make it easy to submit example files and optional metadata to a shared corpus.

Introduction to the Format Corpus

The basic idea is to collect test files under CC0 licences, and where that licence cannot be applied, to at least record the known URLs of useful example files. This has already started, in the form of the format-corpus repository. The idea of this page is to describe how to add to that collection.

Overall Layout

The top-level folders are laid out in broad themes, apart from 'tools' which contains software for analysing the corpus. Just pick whichever seems most appropriate, or create a new top-level folder if non suite. Similarly, arrange the files in sub-folders as required. If you're not sure, just add a top-level folder based on your GitHub username, and add files in there. We can always re-arrange them later.

When creating a new test file, you can just put it in your chosen folder and commit it. However, we'd also like it if we had more metadata about each file, so we suggest a slightly formalised layout, like this:

* my-folder
* my-folder/example-file.ext (the newly created file)
* my-folder/example-file.ext.csv (the metadata in CSV format - see proforma notes below)
* my-folder/example-file.ext.md (a free-text metadata note, in Markdown format, for any general notes)
* my-folder/example-file.ext.screenshot01.png (screenshots of the object, if any)

However, the main thing is to have more example files, so don't worry if there's little or no metadata you can add.

If the example file is available on the web, but not under a CC0 licence, then you can instead use the CSV metadata to specify the URL of the resource.

Metadata Conventions

At the least, we'd like to see fields along the lines of

  • Creating application name
  • Format (PUID or extended MIME type)
  • Relative or absolute reference to the file that this was generated from, if any.

See [Creating an artificial test set using emulation] for proposed metadata that we could turn into a CSV proforma.

How to add files to the corpus

Fortunately, thanks to GitHub for Windows and GitHub for Mac, contributing files is actually pretty easy for anyone to do. The steps are as follows.

  • Sign up for GitHub, [ here].
  • Download the relevant Desktop GitHub client (GitHub for Windows or GitHub for Mac)
  • Contact XXX with your GitHub username and ask to be added to the list of people who can commit to the repository.
  • Go to the Format Corpus repository and hit the 'Clone in Windows' or 'Clone in Mac' button.
  • The GitHub client should pop up, ask you where to place the repository, and download it.
  • Open up the repository on your local machine ('Show in Finder').
  • Add your files to a suitable folder, making a new directory if you want. Don't put it in the tools folder, but anywhere else is fine.
  • When you're finished open up your GitHub client and look the format corpus.
  • It should show you your changes, and provide you with a commit button.
  • Please enter some meaningful text, making it clear that it's a CC0 submission.
    • e.g. use this in the long message: "Creative Commons CC0: Public Domain Dedication. See http://creativecommons.org/publicdomain/zero/1.0/ - To the extent possible under law, I have waived all copyright and related or neighboring rights to this work."
  • Commit your changes locally.
  • Then share them, by pushing to GitHub.

TODO This really isn't very difficult, but some screenshots would be immensely helpful. TODO Use pull requests instead? Somewhat more complex for little gain?

Possible Sources

There are a number of places that might have possible source files. The licensing of them is not always clear, but in this case they could be referred to by URL rather than embedded directly.

Monitoring Progress

For example, we could augment the way we develop signatures (see Improving format ID coverage) by also setting up a monitoring process to run the new Tika signatures against the test documents and check the results.