Difference between revisions of "CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012"

From CURATEcamp
Jump to: navigation, search
(When)
(When)
Line 22: Line 22:
  
 
* Friday, November 16, 2012
 
* Friday, November 16, 2012
** [http://wiki.opf-labs.org/display/KB/2012-11-13+OPF+Hackathon+-+Emulation%2C+learn+from+the+experts OPF Emulation Hackathon] is Nov 13-15. Freiburg, Germany. Sorry, didn't mean to conflict events. But emulation needs file characterization too? Maybe OPF Emulation Hackathon can hand off some File Id for Emulation use cases to Nov 16 24 hr Hackathon...or better yet, extend their event to participate in Nov 16 day. Great way to cap off their Hackathon week!
+
** [http://wiki.opf-labs.org/display/KB/2012-11-13+OPF+Hackathon+-+Emulation%2C+learn+from+the+experts OPF Emulation Hackathon] is Nov 13-15. Freiburg, Germany. Sorry, Nov 16th was chosen somewhat haphazardly. We didn't mean to compete with OPF Hackathon event. But emulation needs file characterization too? Maybe OPF Emulation Hackathon can hand off some "File Id for Emulation" use cases to the Nov 16 24 hr Hackathon...or better yet, extend the Freiburg event to include participation in the Nov 16 day 24 hr worldwide hackathon event. Great way to cap off their Hackathon week!
 
* <strike>Friday, November 23, 2012</strike>
 
* <strike>Friday, November 23, 2012</strike>
 
** RT @declan: @pjvangarderen neat idea!  You know that date is the day after US Thanksgiving, right?  people might be on vacation
 
** RT @declan: @pjvangarderen neat idea!  You know that date is the day after US Thanksgiving, right?  people might be on vacation

Revision as of 19:47, 23 October 2012

Main Page > CURATEcamp iPRES 2012 > CURATEcamp and Open Planets Foundation 24 hour file id hackathon Nov 16 2012

Background

One break-out session at the CURATEcamp iPRES 2012 was affectionately branded "file id confessional" where we commiserated on the state of our file id tools and processes. We also talked about:

  • We can do better job specifying and documenting our file id requirements / use cases
  • We're all hooked on that FITS.xml but FITS needs performance optimization ASAP (also, Is Harvard up for extra dev?)
  • Apache Tika is very actively supported and useful tool for file id and content extraction. How much of our file id requirements can it in fact cover?
  • Archivematica Format Policy Registry use case
  • Jason Scott's "Let's Just Solve the Problem" campaign to boldly catalog as much file format info as possible in the month of November.
  • also, CURATEcamp iPres participant Paul Wheatley has since posted: We Need Better Characterization as well as link to Online Hack Event. This led to Twitter discussion between @pjvangarderen @anjacks0n @prwheatley about this 24 hr hackathon event.

What

24hour+ live hackathon event where multi-time zone teams work on common technical projects related to the CURATEcamp iPres 2012 file id discussions.

Project proposals can be made by anyone.

We will start the day with New Zealand (GMT +12:00) and end with North America West Coast wrapping up project(s), hopefully with one or two solid deliverables by 12 midnight-ish PST (GMT -8:00).

When

  • Friday, November 16, 2012
    • OPF Emulation Hackathon is Nov 13-15. Freiburg, Germany. Sorry, Nov 16th was chosen somewhat haphazardly. We didn't mean to compete with OPF Hackathon event. But emulation needs file characterization too? Maybe OPF Emulation Hackathon can hand off some "File Id for Emulation" use cases to the Nov 16 24 hr Hackathon...or better yet, extend the Freiburg event to include participation in the Nov 16 day 24 hr worldwide hackathon event. Great way to cap off their Hackathon week!
  • Friday, November 23, 2012
    • RT @declan: @pjvangarderen neat idea! You know that date is the day after US Thanksgiving, right? people might be on vacation

How

  • Twitter: #fileidhack (made it shorter)
  • CURATEcamp Mediawiki: Log-in and please help update this page

Let's put together a schedule, tasklist, & volunteers to road-test these tools for Nov 16:

  • Google Hangout: fire up a webcam
  • GoogleDocs: we can live edit any docs we feel the urge to produce
  • IRC: use existing channel or create one just for event?
  • GitHub: get those pull requests going

Why

  • Because we'll probably get some useful shit done
  • Because its fun to work with CURATEcamp people in a CURATEcamp type of way
  • Because doing a 24hr+ worldwide hack with real time collaboration tools is cool

Who (Sign up)

  • GMT +12:00 Euan Cochrane (@euanc)
  •  ?
  • GMT +0:00 Andy Jackson (@anjacks0n), Paul Wheatley (@prwheatley), BL digital preservation team - Maureen (@mopennock) PeteC, PeterM, Lynn, William, and maybe more...; David Underdown (@davidunderdown9) and maybe some more TNA folk
  • GMT -5:00 Kara Van Malssen (@kvanmalssen), Dave Rice (@dericed), Ben Fino-Radin (@benfinoradin), Gary McGath (@Garym03062), @anarchivist
  • GMT -5:00 @lljohnston @blefurgy et al!
  •  ?
  • GMT -8:00 Artefactual: peter (@pjvangarderen), courtney (@snarkivist), evelyn, joseph, mikeC (@mcantelon), mikeG, austin, dan...plus any VanCity people wanting to participate from Artefactual office.

Project Proposals

  • Document file id requirements / use cases
  • ArchiveTeam "Just Solve the Problem" wiki scraping -> structured data (CSV?, XML?, RDF?); as an ongoing service?
  • Improving format ID coverage
  • Collecting format ID test files
  • Improving identification methods
  • Archivematica / Tika integration
    • @archivematica team & volunteers
  • Archivematica Format Policy Registry testing
    • @archivematica team & volunteers
  • @kvanmalssen Improved file id /characterization support for AV files in existing tools like Tika and FITS. An update of Exiftool and inclusion of MediaInfo would be a good start. Or maybe test applicability of ffprobe/avprobe for this task.
    • @dericed This is exactly what ffprobe/avprobe does. Whereas the many of the digipres tools do identification by sampling x bytes from the head and tail, ffprobe/avprobe incorporate one of the many extensive demuxing libraries to manage identification of the contents.
    • @kvanmalssen - Yes, so can we get avprobe to output in a structured way? And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?
    • See also Improving identification methods, which could perhaps be split into two or three and one of which merged with the above tweet discussion? Andy Jackson 15:20, 22 October 2012 (PDT)
  • FITS or Tika bugfix marathon (e.g. this one).
    • Perhaps consider refactoring FITS to re-use existing dependency management tools like DROID and apt/yum/etc instead of manual dependency management? Andy Jackson 05:16, 23 October 2012 (PDT)

Should we take a poll a day in advance to select 2 or 3 projects or should we just let everyone work on whatever proposal they wish?

Preparation TODO

  • GitHub How To
    • Set up temporary FITS and/or Tika forks that we can work on?
  • Easier signature development tools and/or signature contribution tracking, as outlined in Improving format ID coverage
  • Example file contribution How To document, c.f. Collecting format ID test files
  • Prep Archivematica dev VMs (incl Tika checkout), spin up & grant IPs/SSH to Hackfest participants upon request (Artefactual: Austin)