Archivematica

From CURATEcamp
Jump to: navigation, search

Archivematica

Peter Van Garderen

(peter@artefactual.com)

Artifactual systems in British Columbia, 8 employees, most with library science degrees;

creating open source solutions and providing consultative services.


Beta release at end of this calendar year. They started out as a proof of concept -- 3 years ago, there was no hydra, no commercial safe deposit box.

There were no solutions; needed to fill a gap.

Upcoming archives space is a great project.

Guiding principle of Archivematica is OAIS.

Software released under CC license; see arcivematica.org


There're plenty of open source utilities floating around -- they mapped available tools to the OAIS concept slide and started stitching it together and filling in gaps.

Archivematica is an integrated stack of open source apps. They created debian packages for the tools, and keep them up to date with each release to support the software.

It's an alternative to the Fedora based approach, which has high overhead.

"We were doing microservices before they were called microservices."

They get a bunch of files and process them, by using the watched directory approach (using cron jobs).

As soon as files are dropped into the directory, scripts collect checksums, extract metadata, etc, and run microservices.

Either it works or doesn't -- errors are output into another watched directory. Using XML config files to chain together workflows.

Goes from ingest to access compliant with OAIS.


Managing complexity as simply as possible. Provide simple interface, no need to be familiar with commandline options or complex metadata.

12-minute screencast is available online -- look up archivematica071.avi on youtube

The main focus to handle ingest.

Uses the "foobar"? file manager which ships with Linux distribution

They work with limited budget, limited scope, tight deadlines.

With each iteration, adding functionality and usability.


To use, install Archivematica on a client, bring the files into that system via system transfer or off hard drives, whatever -- it requires a local file manager.

Running On Ubuntu, but will run in virtual machine (Virtualbox, virtualVM, Xen). GPL license. about 25 opensource tools included at this point.


You can run the whole system on USB key.


Master Controller Program (MCP) server is what it runs on. MCP hands off services to other processing clients, to speed work.

We don't want to be the storage system. Mostly working with network storage, but looking ahead to clouds and LOCKSS as well.

Don't want to be metadata management and web discovery tool.


Archivematica creates SIP and hands off the package (AIP) to storage, and also creates DIP package for the consumer.

Working on monitoring and syncing with OAI stored content,

Hoping to leverage OAI or AtomPub for this. Also working to sync with format registries.

(Get OAI identifier on submission and then poll)

Using REST API


Lots of legacy rescue work with archives. Much work must happen pre-ingest (we call it transfer processing).

What's the best method to approach that? No nice neat export?

Pre-Ingest processing workstation, perhaps. Looking at Curators workbench and digital forensics tools.

Constraint-- we would need to bundle this into the Archivematica stack and include it in the dashboard. A challenge. Cal Lee's project could solve this for us (UNC professor).

End goal to put away high-quality preservation packages.


We do apply normalization strategies, and some conversion to open formats. We keep the original, and create PREmis records for everything

MODS, EAD, DC accepted

Will create data entry form for DC, but no more.

Using METS to tie all the content together.


The problem is the chaos in content coming in. What do we do with it?

Use external tools? would then have to support them. We decided to add additional microservices --

including some of the same capabilities as these external tools.

Doing requirements analysis on those now. Want feedback on requirements and design.

See transfer requirements page (linked from wiki)

To do this, would add additional watched directory, to which one would drag external media.

This would also move additional services up in pipeline. Considering what's the difference between transfer and ingest?

Includes virus checks and quarantines -- everything is optional.

Will capture snapshot of original directory structure, because it has meaning in itself.

Extracts zips and tar files.



Syncing how they create rights records with access tool

Doing analysis on rights metadata

No standards for accession records yet -- using the Archivists Toolkit version.

There is a rights dialog kit (cool!).


Currently in a pilot at Rockefeller ... center, with Archivists toolkit -- importing EAD. With DSpace, we copy it over to archviematica -- if not in preservation form, we transform it.

No existing apps are doing rights metadata well.

The goal is for the rights metadata in the accession to come along for the ride.



Unstructured transfers. What do we do with them?

User needs to be able to right-click and create SIPs, but then the archivist is required to make sense of all the content

NO intellectual control. What do we do? Will have physical file system constraints at some point.

What would be good to do is add a microservice providing an index of the folder selected, pulling keywords and pattern matching for privacy/security sensitive infor, and a list of PDFs not been OCR'd.

Using Tika and elasticSearch.

Should be able to feed in institution-specific codes to hunt for.

Then file visualization analysis, for example provide relative sizes of subdirectories, etc. to help the archivist determine what should and should not be included in the storage.

May need to add tools to assist in viewing older files from the file manager.



Believe strongly in making this free and open source.

Suggested use of Autopsy to pull strings, etc, out of files, to give info on older files that can't be opened in original form without additional software. Sleuth kit is the software -- and Autopsy is the web site.

Will reduce the current file list when they go into production to those they are confident they can manage well.


Relation between DIP and SIP? One-to-one relationship between number of files submitted. Identification of duplicates included, provide option to keep them?

Heather (Cal Lee's student) is helping them. Want an RDF endpoint for these formats supported.


Interface says what tools are used, and commands used, and they can be modified. That's a local copy.

There will be a way for institutions to sync to new preservation rules and formats in their own registry, and their registry will interoperate with other format registries, piggybacking on others.

Archivematica Will put their own out in a structured form and leverage crowdsourcing.

Will index all AIPs and offer faceted search interface into archival storage.

AIP contains links in metadata to the files (using METS).