CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

From CURATEcamp
Revision as of 08:55, 17 November 2012 by Courtney C. Mumma (talk | contribs)
Jump to: navigation, search

Main Page > CURATEcamp iPRES 2012 > CURATEcamp and Open Planets Foundation 24 hour file id hackathon Nov 16 2012

Results

Thanks to participants!@mopennock @WilliamKilbride @GaryM03062 @anjacks0n @carusb @benfinoradin @peshkira @petemay @Britpunk80 @HeatherBowden -- GaryM03062 discovered a but testing FITS a bug in JHOVE: https://sourceforge.net/tracker/?func=detail&aid=3587890&group_id=221311&atid=1052190 -- File corpus! https://github.com/openplanets/format-corpus/commit/b0971e1c32b2df7a9bceafe1f00d81f49cb45990 -- (@benfinoradin) Kept all #fileidhack tweets today pic.twitter.com/7QI1DfmD -- (@mopennock, @anjacks0n, @petemay et al) British Library team worked on eBook format identification -- (@peshkira) OpenFITS compiles -- (@petemay) Tika signatures for PDB, Kindle AZW and LRF files created, re-testing over sample file set #fileidhack #eBook -- (@mopennock) Added 7 new eBook signatures to Tika this morning -- Encouraged pinging PRONOM (@Britpunk80) to create/test/submit: [1] -- @Britpunk80 handed some droid signature files to @anjacks0n on rocketbook, epub, and ibooks. -- @HeatherBowden shared some Quark and InDesign files. -- @GaryM03062: New commit of OpenFITS allows setting max no. of threads in fits.xml [2] -- @benfinoradin shared resource on RIFF/RIFX [3] -- the Quicktime motherload! by @mistydemeo - Quicktime videos [4] -- OpenFITS : FITS#Improving_JHOVE_performance_within_FITS -- @mistydemeo: Created @machomebrew formula for fidget to make file ID signatures for #fileidhack [5] -- #openarchives chat /nick artefactualmtgroom #fileidhack pic.twitter.com/1Ffp1v6Y -- @anjacks0n: new | Percipio and | Fidget available dev and feedback -- ‏@pjvangarderen @archivematica: Artefactual picks up #fileidhack baton. OpenFits debian package for testing [6] -- @GaryM03062 uploaded source changes to JHOVE. [7] -- @jordanheit testing OpenFITS FITS

Background

One break-out session at the CURATEcamp iPRES 2012 was affectionately branded "file id confessional" where we commiserated on the state of our file id tools and processes. We also talked about:

  • We can do better job specifying and documenting our file id requirements / use cases
  • We're all hooked on that FITS.xml but FITS needs performance optimization ASAP (also, Is Harvard up for extra dev?)
  • Apache Tika is very actively supported and useful tool for file id and content extraction. How much of our file id requirements can it in fact cover?
  • Archivematica Format Policy Registry use case (see also DAITSS action plans)
  • Jason Scott's "Let's Just Solve the Problem" campaign to boldly catalog as much file format info as possible in the month of November.
  • also, CURATEcamp iPres participant Paul Wheatley has since posted: We Need Better Characterization as well as link to Online Hack Event. This led to Twitter discussion between @pjvangarderen @anjacks0n @prwheatley about this 24 hr hackathon event.

What

24hour+ live hackathon event where multi-time zone teams work on common technical projects related to the CURATEcamp iPres 2012 file id discussions.

Project proposals can be made by anyone.

We will start the day with New Zealand (GMT +12:00) and end with North America West Coast wrapping up project(s), hopefully with one or two solid deliverables by 12 midnight-ish PST (GMT -8:00).

Why

  • Because we'll probably get some useful stuff done
  • Because its fun to work with CURATEcamp people in a CURATEcamp way
  • Because doing a 24hr+ worldwide hack with real time collaboration tools is cool

Logistics

When: Fri Nov 16

  • Friday, November 16, 2012
    • OPF Emulation Hackathon is Nov 13-15. Freiburg, Germany. Sorry, Nov 16th was chosen somewhat haphazardly. We didn't mean to compete with OPF Hackathon event. But emulation needs file characterization too? Maybe OPF Emulation Hackathon can hand off some "File Id for Emulation" use cases to the Nov 16 24 hr Hackathon...or better yet, extend the Freiburg event to include participation in the Nov 16 24 hr worldwide #fileidhack event. Great way to cap off their Hackathon week! --PeterVG 11:48, 23 Oct 2012 (PDT)
  • Friday, November 23, 2012
    • RT @declan: @pjvangarderen neat idea! You know that date is the day after US Thanksgiving, right? people might be on vacation

How

  • Twitter: #fileidhack (made it shorter)
  • CURATEcamp Mediawiki: Log-in and please help update this page

Let's put together a schedule, tasklist, & volunteers to road-test these tools for Nov 16:


Who (Sign up)

  • GMT +12:00 Digital Preservation Practical Implementers Guild (@DP_PIG)
  •  ?
  • GMT +7:00 Euan Cochrane (@euanc)
  •  ?
  • GMT +2:00 TechMaurice (NANETH)
  • GMT +1:00 Nicholas Clarke (@nclarkedk) - netarkivet.dk
  • GMT +0:00 Andy Jackson (@anjacks0n), Paul Wheatley (@prwheatley), BL digital preservation team - Maureen (@mopennock), PeterM, Lynn, William, and maybe more...; David Underdown (@davidunderdown9) and maybe some more TNA folk
  •  ?
  • GMT -5:00 Kara Van Malssen (@kvanmalssen), Dave Rice (@dericed), Ben Fino-Radin (@benfinoradin), Gary McGath (@Garym03062), @anarchivist
  • GMT -5:00 @lljohnston @blefurgy et al!
  • GMT -5:00 Greg Jansen @gregj, Ben Pennell @pennellben
  • GMT -5:00 Heather Bowden @heatherbowden - will help when/where I can. Happy to help US East Coasters and Artefactual Team, or whomever. Contact me if you need an extra hand.
  •  ?
  • GMT -8:00 Artefactual: peter (@pjvangarderen), courtney (@snarkivist), evelyn, joseph, mikeC (@mcantelon), mikeG, austin, dan...plus any VanCity people wanting to participate from Artefactual office.

Project Proposals

  • Document file id requirements / use cases
  • ArchiveTeam "Just Solve the Problem" wiki scraping -> structured data (CSV?, XML?, RDF?); as an ongoing service?
  • Improving format ID coverage
  • Collecting format ID test files
  • Improving identification methods
    • Develop a Format ID "Emulation Workbench" for format analysis
    • Document software input and output formats to use in limiting the option set for files of a particular time period (if we know all formats that were creatable during a period when a file was created then we can limit results to only those formats), and for use in format intelligence mining.
  • Archivematica Format Policy Registry testing
    • @archivematica team & volunteers
  • @kvanmalssen Improved file id /characterization support for AV files in existing tools like Tika and FITS. An update of Exiftool and inclusion of MediaInfo would be a good start. Or maybe test applicability of ffprobe/avprobe for this task.
    • @dericed This is exactly what ffprobe/avprobe does. Whereas the many of the digipres tools do identification by sampling x bytes from the head and tail, ffprobe/avprobe incorporate one of the many extensive demuxing libraries to manage identification of the contents.
    • @kvanmalssen - Yes, so can we get avprobe to output in a structured way? And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?
    • @dericed - Yes ffprobe/avprobe have the -print_format (-of) option so you can get json, xml, csv, or others. There's also an xsd published for the output. I suppose ffprobe could be incorporated into FITS but not sure if this is an efficient idea. The premise of FITS seems to put all preservation metadata considerations on the container (file format) but in AV collections the codecs and contained bitstreams are far more significant to consider.
    • @kvanmalssen - Issue is we need AV support (including track/bitstream support) in these general tools so people can process mixed collections. That's what I'd like to figure out.

And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?

  • FITS or Tika bugfix marathon (e.g. this one).
    • Perhaps consider refactoring FITS to re-use existing dependency management tools like Maven and apt/yum/etc instead of manual dependency management? Andy Jackson 05:16, 23 October 2012 (PDT)
      • I'm willing to put a fork of FITS on Github if a couple of people say they want it. --Gary McGath 13:27, 11 November 2012 (PST)
  • TechMaurice: Replace container identification function of FIDO using PRONOM container signature.
  • Misty De Meo Just a thought... it strikes me that the basic functionality of FITS is not super complicated. As well, in my experience, most users are using a fairly minimal set of features. Given some of the problems we're having with FITS, it may be worth doing a minimal rewrite of FITS (in, say, Python or C) with a focus on a) speed, and b) maintainability. This is more than a day's work but could get a start if this is something other people would be interested in. Things I'd want to see would include:
    • Don't vendor tools - just recommend versions, but draw form whatever tools the user has installed.
    • Implement better AV support (with all the caveats listed above)
    • Possibly restrict the number of tools?
    • +1 to FITS refactoring --Greg Jansen 11:33, 15 November 2012 (PST)
    • Implement only the configuration options most people use, and let those be specified on the commandline instead of via XML.
      • Gary McGrath on IRC points out that the use of external tools means that in FITS scanned files are independently loaded from disk by multiple tools, introducing unneeded IO overhead. Could be fixed in FITS itself.

Should we take a poll a day in advance to select 2 or 3 projects or should we just let everyone work on whatever proposal they wish?

Preparation TODO

  • GitHub How To
    • Set up temporary FITS and/or Tika forks that we can work on?
  • Set up Archivematica instances to test FPR
  • Easier signature development tools and/or signature contribution tracking, now partially complete, as outlined in Improving format ID coverage
  • Example file contribution How To document, c.f. Collecting format ID test files

Results

Nov 17. 07:30 UTC -- 30 hours later


Peter Van Garderen @pjvangarderen
Proud to lead 24hr real time R&D cycle. Thanks #fileidhack people for your passion RT @jordanheit: testing OpenFITS wiki.curatecamp.org/index.php/FITS

Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @GaryM03062: Testing FITS led me to discover a bug in JHOVE, so #fileidhack is worth something. sourceforge.net/tracker/?func=…


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @anjacks0n: Thanks for the files, @carusb github.com/openplanets/fo… #fileidhack


Peter Van Garderen @pjvangarderen
Thanks #fileidhack live link? RT @benfinoradin: Archiving all #fileidhack tweets today pic.twitter.com/7QI1DfmD


Peter Van Garderen @pjvangarderen
Thanks BL! #fileidhack RT @mopennock: BL team are working on eBook format identification today for #fileidhack - @anjacks0n @petemay et al


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @peshkira: OpenFITS current status: It compiles! #fileidhack


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @petemay: Tika sigs for PDB, Kindle AZW and LRF files created, re-testing over sample file set #fileidhack #eBook


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @mopennock: We've added 7 new eBook signatures to Tika this morning #fileidhack. Great work all!


Peter Van Garderen @pjvangarderen
Thanks #fileidhack Everyone ping PRONOM pls! RT @Britpunk80: #fileidhack if you want to create/test/submit your own: …keddatapronom.nationalarchives.gov.uk/sigdev/index.h…


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @Britpunk80: I've handed some droid sig files to @anjacks0n on rocketbook, epub, and ibooks. #fileidhack


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @HeatherBowden: @euanc @anjacks0n I have some Quark and InDesign files. You interested? #fileidhack


Peter Van Garderen @pjvangarderen
Thanks #fileidhack Nov1614:00UTC RT @pjvangarderen: Wazzup! West Coast in da fileidhacking house! RT @declan: good morning #fileidhack!


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @peshkira: #fileidhack Current status: FITS mavenized. PullRequest/Wiki \w explanation follow. /cc @GaryM03062


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @GaryM03062: New commit of OpenFITS allows setting max no. of threads in fits.xml #fileidhack github.com/gmcgath/openfi…


Peter Van Garderen @pjvangarderen
Thanks #fileidhack Great work on OpenFITS! Lets keep this alive RT @GaryM03062: Calling a day for #fileidhack. Great working with everyone!


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @Snarkivist
#fileidhack team - just catching up on your work today - was internetless - look for summary in the morning


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @pjvangarderen: Nov1601:17UTC @euanc (Perth) #fileidhack IRC - Nov1704:16UTC @archivematica crew still hacking #24hrs+


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @benfinoradin: Good resource on RIFF/RIFX: johnloomis.org/cpe102/asgn/as…


Peter Van Garderen @pjvangarderen
Thanks #fileidhack Holy cow, the Quicktime motherload! RT @mistydemeo: Have some Quicktime videos, #fileidhack github.com/openplanets/fo…


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @GaryM03062: Another update for OpenFITS. Please read the wiki: wiki.curatecamp.org/index.php/FITS… #fileidhack


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @mistydemeo: Created @MacHomebrew formula for fidget to make file ID signatures for #fileidhack github.com/mistydemeo/hom…


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @pjvangarderen: @mistydemeo meet #openarchives /nick artefactualmtgroom #fileidhack pic.twitter.com/1Ffp1v6Y



Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @archivematica: Artefactual picks up #fileidhack baton. OpenFits debian package launchpad.net/~archivematica… test time!


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @anjacks0n: @benfinoradin tweaked your sig, now identifies all test files you sent github.com/openplanets/fo… #fileidhack


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @GaryM03062: As a side effect of #fileidhack, I've been uploading source changes to JHOVE. sourceforge.net/projects/jhove/

Peter Van Garderen @pjvangarderen
Thanks #fileidhack Thanks GMT! RT @mopennock: It's all go this morning for the #fileidhack! wiki.curatecamp.org/index.php/CURA…


Peter Van Garderen @pjvangarderen
Thanks #fileidhack RT @WilliamKilbride: It's #dpc #ff follow friday. look at #fileidhack Better still, get involved wiki.curatecamp.org/index.php/CURA…


Summary