Association of Moving Image Archivists & Digital Library Federation Hack Day 2013

From CURATEcamp
Jump to: navigation, search

Main Page > AMIA & DLF Hack Day 2013 > Association of Moving Image Archivists & Digital Library Federation Hack Day 2013

>>> When, Where, What time?

  • Date: November 6, 2013
  • Time: ~9am-5pm (with option of continued work projects throughout the conference in our Developer Lounge at Richmond Mariott, Apple Boardroom - available all day Thursday and Friday)
  • Location: Salon B at the Crowne Plaza Richmond Downtown in Richmond, VA
  • hashtag: #AVhack13
  • IRC: #curatecamp_avpres_1 If using an IRC client the server is chat.freenode.net, or you can use your browser and connect to webchat.freenode.net. If you are unfamiliar with IRC, take a look at this ☞ brief introduction.
  • Light breakfast, snacks and coffee will be provided throughout the day!

Contents

How can I participate?

Sign up! As this will be a highly participatory event, registration is limited to those willing to get their hands dirty, so no onlookers please.

If you are unsure whether you can or want to participate in the hack day itself, you can still see the results by attending the AMIA closing plenary, where hack day projects will be presented, and the audience will have an opportunity to vote on their favorites.

What will be the format of the event?

In advance of the hack day, project ideas will be collected through the registration form and the event wiki. In advance of the event, participants will review and discuss submitted project ideas. We’ll then break into groups consisting of technologists and practitioners, selecting an idea to work on together for the day and (if desired) throughout the duration of the AMIA conference in the developers lounge.

The day itself will be structured something like this. Breakfast, coffee/tea, and snacks will be provided. Lunch is on your own.

9am – Welcome, introductions, and breakfast

9:30 - noon - Hacking. Snacks and coffee to be served.

Noon-1pm – Lunch on your own.

1 - 4:30 - Hacking. Snacks and coffee will be served.

4:30 - 5 - Wrap up.

Closing plenary & prizes

Projects will be presented during the conference closing plenary, Saturday November 9 at 9:30am. Projects will be judged by a panel as well as by conference attendees.

Summary

In association with the annual conference, the Association of Moving Image Archivists will host its first ever hack day on November 6, 2013 in Richmond, VA. The event will be a unique opportunity for practitioners and managers of digital audiovisual collections to join with developers and engineers for an intense day of collaboration to develop solutions for digital audiovisual preservation and access. It will be fun and practical…and there will be prizes!

This year's hack day is a partnership between AMIA and the Digital Library Federation. A robust and diverse community of practitioners who advance research, teaching and learning through the application of digital library research, technology and services, DLF brings years of experience creating and hosting events designed to foster collaboration and develop shared solutions for common challenges.

What if I’m not a developer?

Content managers and preservation practitioners are as central to the success of the event as having keen developers. YOU will be responsible for setting the agenda and the outcomes. The goal is to foster collaboration between audiovisual preservation specialists and technologists, to solve problems together and share expertise.

Background

What is a hack day?

A hack day or hackathon is an event that brings together computer technologists and practitioners for an intense period of problem solving through computer programming. Within digital preservation and curation communities, hack days provide an opportunity for archivists, collection managers, and others to work together with technologists to develop software solutions for digital collections management needs. Hack days have been held independently by groups such as the Open Planets Foundation, as well as in association with preservation and access oriented conferences including Open Repositories and Museums and the Web.

The manifesto of a recent event at the Open Repositories conference framed the benefits this way: “Transparent, fun, open collaboration in diversely constituted teams...The creation of new professional networks over the ossification of old ones. Effective engagement of non-developers (researchers, repository managers) in development...Work done at the conference over presentation of something prepared earlier.”

Why an AMIA hack day?

An audiovisual preservation-themed CURATEcamp was held in April 2013, drawing over 120 registrants from at least 3 continents for a day of great conversations and lightning talks. CURATEcamp is as series of unconference-style events focused on connecting practitioners and technologists interested in digital curation. The event generated a lot of documentation and articulated many shared concerns. Topics covered included digitization of video, film scanning, digital storage strategies, proprietary digital video files in collections, and technical metadata for preservation. The participants of the event agreed that more work needed to be done and action taken, so the idea for an AMIA hack day was born.

Discussions between managers of audiovisual collections and solutions developers provided a fruitful starting point for a hack day project ideas, including:

  • Simple fixity tools to use when transferring files from one storage medium to another
  • Technical metadata extraction and making use of these reports (MediaInfo, ffprobe)
  • Simple cataloging tools for AV, with eye towards contemporary frameworks/schema
  • Discovery tools/UX for audiovisual collections, access at scale

Our Manifesto

Manifesto:

  • Transparent, fun, open collaboration in diversely constituted teams over individual brilliance and/or groups of like individuals in cut-throat competition.
  • The creation of new professional networks over the ossification of old ones.
  • Effective engagement of non-developers (researchers, repository managers) in development over purely developer driven projects.
  • Work done at the conference over presentation of something prepared earlier.

Project proposals

Please register for the hack day (we're currently at capacity, but forming a wait list) and we will start adding your ideas here for voting in advance of the Hack Day!

Possible topics projects could touch on: fixity checking; transcoding; metadata validation; automating file movement; altering fdupes so that it will show user md5 checksum hash; alter Archivematica 1.0 code to bypass zipping the AIP.

Loose metadata projects ideas: Segmentation and time-based annotation of video segments on the web (maybe leveraging Media Fragments?); XSLT mapping; Turn CSV fields into PREMIS xml; Using geolocation information to facilitate new access pathways to video; RDFing PBCore, potentially to leverage in Fedora 4

Loose non-code projects ideas: Editing/adding wikipedia pages, create a manual for a tool or a workflow, create a webpage

Please submit your project ideas using the format below. Remember, the more specific the better. Have a look at the project descriptions from Open Repositories 2013 for inspiration.

Project Sign Up Sheet

Sign up for projects you are interested in here

Signing up in advance does not mean you are committed to work on that project. And it does not mean these are the only projects. There will still be an opportunity to add additional projects on the day of the event and sign up for those as well.

The projects below were discussed during a Google Hangout on November 1, 2013. For more information, please see the notes from that conversation.

1. The 608ers: Timebased transcript/caption display

Two proposals have merged into one:

Extraction of EIA-608/line 21 closed caption information: Ability to extract and reuse closed caption information from NTSC video.

+

Interactive Video/Transcript Streaming: This project would use the open source Interactive Video/Transcript viewer package as a baseline for streaming video and transcripts. This package has weak support and is becoming increasingly difficult to maintain. The hope is to come up with an approach to build or improve upon the existing system to reliably stream video files with their time coded transcripts across multiple browser and OS types.

Original input is mp4 interviews spoken in Inuktitut with their English transcripts in doc format. The video and transcript need to be streamed/played simultaneously.

Notes

Notes from the Nov 1 planning call

Possible starting points

Maybe: http://ccextractor.sourceforge.net/ Also: http://dev.w3.org/html5/webvtt/

+

The original IVT package is here: IVT.zip

Data set required

Uncompressed video files that contain line 21 closed caption information

Sample Data: CanadaVideoTranscripts.zip

The existing IVT player is running here: Live Site

Submitted by

Steven Villereal

+

Chris McNeave

Interested team members/participant roles

Who wants to work on this project?

2. Team McGruff: Integration of mediainfo generated metadata into a forensic imaging workflow

Would like to generate and include mediainfo key/value pairs into DFXML for forensic disk images that contain audio or video files. This could be accomplished through the FIWalk utility's DGI interface.

Use case: An archive has acquired hard drives containing mixed file formats, including media formats. In order to prevent any further modifications of the drives they have been forensically imaged. To plan for the work necessary to preserve, process and provide access to the media files in the future, the repository would like the ability generate a report on the types and extents of media files within disk images.

+

Reconciling filenames with embedded technical metadata/named parameters: I'd like to explore if it would be possible to compare embedded technical metadata (file/MIME type/external signature) to existing media filenames to ensure that all files in a given directory are what they are supposed to be according to the extension. There can be messages/a report if any files do not match your named parameters.

Potential User Story: As a CONTENT MANAGER, I need to verify that files with an "mov" extension in a named directory (*.mov) are Quicktime files so that I can ensure filenames accurately represent embedded technical metadata.

Pre-conditions: Specifications of files already determined (ie all access files are qt wrapped .mov), Have associated utilities available to read metadata

Post conditions: Filenames include accurate extension, content manager is delivered a report of any/all inaccurately named files in directory.

Tool Name

MediaWalker (Texas Ranger)

GitHub

Here is our GitHub repository: https://github.com/dmmd/AMIA_HACK/

Includes the Python script and sample files.

dmmd kgrons walterforsberg yvonneng groakus mistydemeo

Notes

Notes from the Nov 1 planning call

MediaWalker Documentation (public): https://docs.google.com/spreadsheet/ccc?key=0ArMWuWMTUNRgdGNMbDYyRzZPOTBpTDJsU2R6cFZWRnc&usp=sharing

a pdf of what dfxml looks like + mocked up mediainfo: https://docs.google.com/file/d/0B1hVT_M0h1f_VnVqZnV4R0J1amc/edit

FFprobe output description: http://stackoverflow.com/questions/3199489/meaning-of-ffmpeg-output-tbc-tbn-tbr

Possible starting points

Registries for extension associations (ex. PRONOM: http://www.nationalarchives.gov.uk/PRONOM/Default.aspx)

MediaInfo: http://mediaarea.net/en/MediaInfo

Exiftool: http://www.sno.phy.queensu.ca/~phil/exiftool/

Georgetown University Lib File Analyzer?: https://github.com/Georgetown-University-Libraries/File-Analyzer

http://www.sleuthkit.org/sleuthkit/
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.5362&rep=rep1&type=pdf
https://raw.github.com/dfxml-working-group/dfxml_schema/v1.1.0/dfxml.xsd
http://mediaarea.net/en/MediaInfo

FFMPEG: http://www.ffmpeg.org/download.html

Submitted by

Donald Mennerich

+

Kathryn Gronsbell

Data set required

Forensic disk images containing audio and video files

Team members

Misty De Meo
Jason Evans Groth
Walter Forsberg
Kathryn Gronsbell
Donald Mennerich
Yvonne Ng

output

https://github.com/dmmd/AMIA_HACK/blob/master/fiwalk/amia.xml

HowToRunThe Media Walker

https://github.com/dmmd/AMIA_HACK/blob/master/fiwalk/mediainfo.py

https://github.com/dmmd/AMIA_HACK/blob/master/fiwalk/ficonfig.txt

Ensure that the following are installed on your computer:
xcode: https://developer.apple.com/xcode/
homebrew: http://brew.sh/
mediainfo: On the command line enter $ brew install mediainfo
sleuthkit: On the command line enter $ brew install sleuthkit
python: On the command line enter $ brew install python

#I don't have Python installed on my computer!!! Don't freak. PIP is a Python package installer. You need it to install LXML, which is a popular parser for Python. Here's the command line:

$ sudo easy_install pip

#Then, type the command, below (it takes awhile). Lxml is the most popular xml parser, by the way:

$ sudo pip install lxml

#While you're waiting, you need to customize the config file!

Open "ficonfig.txt" in a text editor. Then, update the location of the script on your local drive. It initially looks like: * dgi python fiwalk/mediainfo.py and you insert your filepath before the fiwalk.mediainfo.py portion...thus, becoming (as an eg.) * dgi python /Users/yvonne/Desktop/amia_hack/fiwalk/mediainfo.py. Whew! Save your doc, and close.


Pre-config of .txt

Post-config sample of .txt


#Then, you want to run Mediawalker in fiwalk. This command will create your DFXML file with audio/video metadata! Make sure that you redirect the standard output with a ">" to a filename ending in ".xml"

$ fiwalk -xc /FILEPATH/TO/ficonfig.txt /FILEPATH/TO/yourDiskImage > /PREFERRED/DESTINATION/FOR/THE/DFXML.xml


3. Metadata Schemers: Metadata schema developing and mapping tool

Original project proposal

Metadata is becoming more and more important for various aspects of video archiving (i.e. conservation, management, access etc.), but there is little help for non-AV specialist practicioners. An easy-to-use tool with a simple graphical interface could be one valuable element. The project could be to develop a tool for editing existing (or self-developed) metadata schemas/standards with export functionalities producing schemas in useful formats (like XML, stylesheets, etc.) useable in widespread programmes used for collection management/description (like FileMaker, Excel, Access, etc.). An additional part of such a tool could be a mapping and data transformation element, allowing users to map one existing schema (in different file formats like XML, CSV) to a target schema (like EBUCore) and transform existing data. An online version of such a tool could collect and disseminate edited schemas, crosswalks, mapping schemas, etc. and serve as exchange platform.

Possible starting points

  • any interest in creating mappings to allow [dp.la/info/about/faq/ DPLA] to expose richer metadata about sound/moving image content? DPLA crosswalks here or more info as needed...

Submitted by

Yves Niederhäuser

Team Members

  • Esha Datta
  • Meghan Fitzgerald
  • Yves Niederhäuser
  • Lai-Tee Phang
  • Nick Richardson
  • Neale Stokes
  • Pamela Vizner

M.O.D.E.M. (metadata organising, developing, editing and mapping)

Your Path to a Shiny New Schema!

Problem

Many schemas and frameworks exist for AV metadata, which rarely fully meet the needs of any given organization or collection; the use of standards is not widespread in AV conservation, leading to problems for sharing and aggregating on access platforms. Additionally, data is often supplied in forms that are not suitable for direct upload into the collection management system/database, requiring time and IT skills to manually correct prior to ingest.


Project Scope

This application will be developed iteratively. Right now, it will create a custom metadata schema derived from an uploaded data set and existing metadata schemas.

In its first phase, this application will not perform data mapping and transformation; these functions might be available in a later phase. (See Further Work.) It will be a web-based application using Javascript. In its first iteration, it will only be able to import files in XML or CSV formats, and will only export schemas in XML and CSV formats.


Intended User Base

Librarians, archivists, and other professionals with limited knowledge of AV metadata standards, who nevertheless may be tasked with the creation, management, or transformation of moving image metadata, and are also limited in proficiency with programming or do not have advanced IT knowledge. This application is very source-agnostic and it can be used to create custom schemas from any metadata standards.


Benefits

• Provides a generic interface to develop a custom metadata schema based on existing data sets (extracting elements) and (standard) schemas.

• Provides a generic interface to map disparate data sources to various metadata schemas.

• Is scalable to different projects and different types of metadata.

• Creates efficiencies as staff time previously used to manually manage metadata can be used for other tasks.

• Allows for ease of metadata standardization across databases and systems.


High Level Functional Requirements

• Ability to import metadata schemas for AV materials

• Ability to extract fields from an uploaded data set as data elements

• Ability to rearrange, split, merge, rename, compare and map data elements from imported schemas and data set, to create a custom metadata model

• Ability to export custom metadata model as XML and CSV

• Ability to keep track of what is being mapped and the sources of the constituent data elements


User Stories

• As a metadata manager, I work with multiple streams of non-standardized data that need to be collated and made to conform to my organization’s data model. I would like a system where I can integrate the multiple streams of metadata, map them to my model, and create a custom schema that correctly describes the asset to an accepted standard according to that model.

• As a metadata manager, I work with legacy systems that do not all use the same metadata schema. I would like a system that will allow me to easily map the data in those legacy systems to a single standard.

• As a metadata analyst, I have user generated metadata coming in from various projects. I work with books that are digitized, e-books, postcard digitization and digital video projects. I want a system where I can map user generated metadata to various fields from other metadata standards.

• As an archivist, I have to develop a metadata schema suitable for AV media, which can be integrated with existing tools (finding aids etc.).

• As an archivist, I receive information from content generators that does not conform to a format that I can use. It seems unlikely that they can be initially "trained" to provide data in a more standardised format. I would like a tool that will allow me to make my work more efficient while more easily providing feedback to data suppliers on how this data might be improved for my use.


Further Work

Suggestions for future development

• Creation of data entry form based on custom schema and rules determined in data mapping step so that content providers can conform to proper data formatting

• Ability to save custom models previously created as templates

• Ability to collect and disseminate edited schemas, crosswalks, mapping schemas, etc. and serve as a sharing platform

• Create RDF functionality from generated metadata schemas.

• Explore data mapping and transformation functionality: the ability to import sets of data and perform mapping and transformation to conform to the custom metadata model/schema; user-friendly interface to create rules for simple data transformation, e.g. standardise date format, provide default value for fields with no data, etc.

4. The Amazing METS: Creating a Sample METS (Addressing METS Specification) for Digitization Project of Analog Audiovisual Collections

Several sets of specifications are already available for creating a METS schema. But I have not really heard of any complete METS example that is boilerplated to work for a real digitization project. University of Michigan, after several trials to look for existing schema that we can piggyback, is currently creating an example METS for outsourced digitization project that can be used from end to end. The application programmer at Digital Library Production Service department has created it out of the existing audio METS xml, VideoMD, and other spreadsheets that U of M has been using as interim means. And several related people are now discussing and examining that sample section by section. I would like to know if a group can sit and investigate this current sample and give comments/feedback about its possible limitations/errors/issues to make a better version out of it. If the whole sample is too big to work on in a day, I would like to propose to review the process history/provenance section only since that could be the most challenging section to tackle due to complicated video digitization process itself. If we can come up with anything that seems to work as a working sample, it can be shared/distributed and used at this standard-less age.

And here again, an online version of such a tool could collect and disseminate edited schemas, crosswalks, mapping schemas etc. and serve as exchange platform.

Notes

Notes from the Nov 1 planning call

Day of notes

Possible starting points

Here are very drafty draft that UM programmer created. There are many notes and it does not quite look complete but I believe this can be a starting point. More than anything, we are in need of any outsiders who can review this with fresh eyes and many other different experiences.

Both the video process history schema and example METS are located in the directory: http://www-personal.umich.edu/~grosscol/vprocesshistory/

Related skillsets

Knowledge of video/audio metadata, familiarity with audiovisual digitization project?

Data set required

Existing metadata set that are created from the digitization project at each institution

Submitted by

JungYun Oh

Team Members

The Amazing METS!
JungYun Oh
Hannah Frost
Kara Van Malssen
Emily Nabasny

Solution

Use Case: Our use case for project is a single content item, on analog video tape, which is reformatted to a digital file set. The source object, digitization process, and resulting file set should be described in one METS file.

Solution: Our objective for the purposes of the hack day is to articulate a content model within METS for reformatted video content. Our goals include:

  • Identify existing schemas that can be used to express the various components of the METS file
  • Determine how those schemas should be expressed within METS containers
  • Identify minimal fields that should be captured using those schemas within METS
  • Articulate controlled vocabularies when applicable

Deliverables:

  • Create a sample METS file for a U-matic source object, which is reformatted and results in a preservation master and mezzanine file, document the sample describing element usage, and enumerating recommended controlled vocabularies where appropriate
  • Refine the model with input from the community with the goal of eventually creating a METS profile which can be added to the profile registry maintained by the standards office at the Library of Congress.

The project is maintained in a github repository. Documentation and notes are available here.

5. Fast Forward: Produce easy-to-follow documentation for the installation and use of FFMPEG transcoding software

Specific usage topics might include batch transcoding, metadata extraction, common output profiles, and FFMPEG version upgrades. Evaluation of available GUI's might also be included as a secondary goal.

Possible starting points

http://www.ffmpeg.org/

http://avanti.arrozcru.com/

http://sourceforge.net/projects/ffmpeg-gui/

check also: http://www.reto.ch/training/2013/20130503/ (its in German, but commands are commands...)

Kathryn Gronsbell: Helpful hints for basic FFMPEG from Kelly Haydon https://docs.google.com/document/d/1zbThoqnEl50Yw_fG9prHSptlIjo6tdteieVq4XP4K_E/edit?usp=sharing

Related skillsets

Windows/MAC/Linux Operating Systems, Document Writing, Digital Media Transcoding

Data set required

Sample media files for transcode tests

Submitted by

Nash Bly

Interested team members/participant roles

Software Testers, Media Transcoders, Document Writers - Who wants to work on this project?

Working Documents

ffmpeg hackday notes - https://docs.google.com/document/d/1RFlXJGXChbIwNXs3Ka01sHj-RXNEAt1h9yWPpFvZUJ4/edit?usp=sharing

Merged with Timecoded transcripts and FFMPEG documentation: Moving Image Research Collections Digital Video Repository

Several potential ideas for improving this DVR that can hopefully be integrated into other sites…
- Timecode-based tagging in videos or other ways to allow for user-generated metadata
- A way to connect related video material
- scripts for transcoding video (modifying an existing script)
- Issues in XACML restrictions / easy way to make records public/non-public

Possible starting points

DVR: http://mirc.sc.edu
Git: https://github.com/DGI-USC

Related skillsets

Drupal knowledge, Fedora/Islandora, ffmpeg, Python

Data set required

Video files, records, scripts, the DVR itself? (Providable.)

Submitted by

Ashley Blewer

Replaced by Format/Codec selection tool: Digitization workflow development tool

Where do I start once I decided digitization is the right thing to do for my video collection? How do I decide whether to built up infrastructure/know how in-house or to outsource digitization? How do I need to prepare analog tapes for best results and minimal risk? What information do I need and which requirements do I have to ask for in a call for tenders? What do I have to do and how to controll the quality of digitization? How do I store the new archive masters and access copies? Which codecs/formats are best in my case? A little stand-alone or online tool for video collections/non-specialist practicioners, maybe something like an interactive flow-chart or decision path, that helps to ask the right questions and produces an automatic report after running it could be a big help for lots of non-video-specialist collection managers and serve as starting point for consultations, evaluation of tenders, convincing of decision makers etc. A possible online version of a tool like this could integrate a "similar projects"-functionality, pointing collection managers to other projects/people with experience in similar cases and thereby built up/strenghten a network for exchange. I think there is still a big potential in bringing people of this field together!

Possible starting points

There are tons of online survey tools that maybe could be used as technical starting point; the right set of questions could be collected/priorized/structured during the hack day.

Related skillsets

Unknown technical/developer's skills and some video digitization and collection management expertise is needed for this project.

Data set required

None.

Submitted by

Yves Niederhäuser

==4. Format/codec evaluation/selection tool== "What format should I use when digitizing my videos?" This is by far the most heard question for video archiving consultants, I guess. But possible answers are complicated and very context-related, say: often frustrating for the asking non-specialist practicioners as for consultants. For a possible hack day project see description of submitted idea for a digitization workflow development tool above. A format/codec evaluation/selection tool could be part of or a first element of this bigger tool.

Notes

Notes from the Nov 1 planning call

Possible starting points

See idea for a digitization workflow development tool above.

Related skillsets

See idea for a digitization workflow development tool above.

Data set required

ee idea for a digitization workflow development tool above.

Submitted by

Yves Niederhäuser

Interested team members/participant roles

Who wants to work on this project?

==8. CURATEcamp-syle discussion==

For those that are more interested in meeting up with other folks for discussion and brainstorming on specific topics, we are setting aside an area for a CURATEcamp style "unconference" breakout groups. Folks interested should come prepared with potential topics for discussion. These will be gathered on the morning of the event, and voted on by those registrants in the CURATEcamp stream. For more information, please visit the CURATEcamp website, and see the documentation from CURATEcamp AVpres 2013 held in April 2013.

Please note that while discussion groups are not discouraged, these groups will not be eligible for awards.

===Depreciated topics:===

==3. RDFing PBCore== Let's see if we can come up with a RDF expression for PBCore. Could be useful for things like the up and coming Fedora 4.

Notes

Notes from the Nov 1 planning call

Possible starting points

http://pbcore.org/index.php
http://www.w3.org/TR/REC-rdf-syntax/
http://dublincore.org/documents/dc-rdf/
Bawstun app from WGBH...can output PBCore XML from EBUCore RDF...could be reverse engineered? https://github.com/curationexperts/bawstun/tree/master/app/models

Related skillsets

Any of: knowledge of pbcore, XML/RDF, OWL, metadata schema in general

Data set required

Sample PBCore (to be provided)

Submitted by

Kara Van Malssen (idea by Karen Cariani)

Interested team members/participant roles

Who wants to work on this project?