Collecting format ID test files

From CURATEcamp
Jump to: navigation, search

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

Perhaps the best way of Improving format ID coverage would be to drive this with a corpus of test files, ideally of known type, with documentation and metadata declared so that we can test tools against that info (e.g. a content-type using extended MIME types, like: application/pdf; version="A-1a"). This project aims to make it easy to submit example files and optional metadata to a shared corpus.

The basic idea is to collect test files under CC0 licences, and where that licence cannot be applied, to at least record the known URLs of useful example files. This has already started, in the form of the format-corpus repository. The idea of this page is to describe how to add to that collection.

Some test format-corpus stats are available here:


Contact any of:

Overall Layout

The top-level folders are laid out in broad themes, apart from 'tools' which contains software for analysing the corpus. Just pick whichever seems most appropriate, or create a new top-level folder if non suite. Similarly, arrange the files in sub-folders as required. If you're not sure, just add a top-level folder based on your GitHub username, and add files in there. We can always re-arrange them later.

When creating a new test file, you can just put it in your chosen folder and commit it. However, we'd also like it if we had more metadata about each file, so we suggest a slightly formalised layout, like this:

  • my-folder
  • my-folder/example-file.ext (the newly created file)
  • my-folder/example-file.ext.csv (the metadata in CSV format - see proforma notes below)
  • my-folder/ (a free-text metadata note, in Markdown format, for any general notes)
  • my-folder/example-file.ext.screenshot01.png (screenshots of the object, if any)

However, the main thing is to have more example files, so don't worry if there's little or no metadata you can add.

If the example file is available on the web, but not under a CC0 licence, then you can instead use the CSV metadata to specify the URL of the resource.

Metadata Conventions

We are splitting metadata in two. Basic format-level metadata and format-instance-level significant-properties.

Each file should have some basic format metadata, as outlined in this template:

See Creating an artificial test set using emulation for detailed 'significant properties' metadata that we could turn into a CSV proforma. Euan is working on a spreadsheet form.

How to add files to the corpus

Via Google Drive

The simplest way to submit example files is via this Google Drive folder - login if necessary, and use the big plus (+) menu to select 'Upload File...'. Once you've uploaded the file, edit the Description and include the following text:

Creative Commons CC0: Public Domain Dedication. See - To the extent possible under law, <YOUR NAME HERE> has waived all copyright and related or neighboring rights to this work.

Others will be monitoring the folder, and will move the files out of there and into the GitHub repository as and when they can. There are some downsides to this, however:

  • You probably won't get any credit for it - the person who spends time channeling the submissions to GitHub gets most of that.
  • Metadata might not make it.
  • Items cannot be removed or updated once submitted this way.

For a richer editing experience and a full audit trail, we recommend you try using the GitHub client directly. It's not much harder than using the Amazon website or the DropBox client.

Using GitHub directly

Fortunately, thanks to GitHub for Windows and GitHub for Mac, contributing files is actually pretty easy for anyone to do. The steps are as follows.

  • Sign up for GitHub.
  • Download the relevant Desktop GitHub client (GitHub for Windows or GitHub for Mac)
  • Contact XXX with your GitHub username and ask to be added to the list of people who can commit to the repository.
  • Go to the Format Corpus repository and hit the 'Clone in Windows' or 'Clone in Mac' button.
  • The GitHub client should pop up, ask you where to place the repository, and download it.
  • Open up the repository on your local machine ('Show in Finder').
  • Add your files to a suitable folder, making a new directory if you want. Don't put it in the tools folder, but anywhere else is fine.
  • When you're finished open up your GitHub client and look the format corpus.
  • It should show you your changes, and provide you with a commit button.
  • Please enter some meaningful text, making it clear that it's a CC0 submission.
    • e.g. use this in the long message: "Creative Commons CC0: Public Domain Dedication. See - To the extent possible under law, I have waived all copyright and related or neighboring rights to this work."
  • Commit your changes locally.
  • Then share them, by pushing to GitHub.

Possible Sources

There are a number of places that might have possible source files. The licensing of them is not always clear, but in this case they could be referred to by URL rather than embedded directly.

Monitoring Progress

For example, we could augment the way we develop signatures (see Improving format ID coverage) by also setting up a monitoring process to run the new Tika signatures against the test documents and check the results.