CURATEcamp SAA 2013 Notes

Session 1: 11am - 11:40am

Topic 1: Email

Workflow: Records Management> Administrative Assistant> IT
Accept that donors will probably never curate their own e-mail
Focus on security/privacy when weeding/redacting, don't overweed
ePADD Project = NHPRC grant to Stanford to develop a software program to process email archives and make them discoverable:
- http://library.stanford.edu/spc/more-about-us/projects-and-initiatives/epadd-project
Consider processing on-demand e-mail or e-records in general
How do we manage e-mails with restrictions and time seals in the long-term?

Topic 2: Preserving Business Systems

Getting a handle on multiple information systems
Preserving Sharepoint (doc libraries, linkages, rules and processes)
Preservation of Databases (emulation? extraction?)
Normalization and emulation for outdated software
Business system preservation
Workflows and different combinations of things like:
- ArchivesSpace
- Archivematica
- Islandora
- Fedora
- Hydra etc.

Enforcing records management control over information extracted from different business systems
How much information do you need to extract? Documentation? Decision-making process/selection/appraisal?
Performing functional analysis of businesses to determine what kinds of records exist, where they originate etc.
Screencasts of interaction w/ systems to capture context/use?
Sharepoint - grabbing static documents more straightforward, next challenge is dynamic content
CMIS - Content Management Interoperability Services - becoming documented by vendors (exchanging information/content between systems)
- http://en.wikipedia.org/wiki/Content_Management_Interoperability_Services
How important is the data vs the environment? How will researchers use these business systems in their research?
Emulation as solution?

PREMIS Environment http://www.loc.gov/standards/premis/index.html

what is the record? is it the data? or the system that makes the data usable.
is a static snapshot acceptable?
copying share point to servers, do you lose data? metadata?
screencast/youtube video of how creator interacts with their systems (british library)
sharepoint//ethics of trying to preserve proprietary software
go for static content-documents first (low hanging fruit). challenge: grabbing collaborative projects in share point
users want to be able to push content to archives.
physical versus legal custody
SERI (CoSA) for moving toward a better governance model.
CMIS (content management interoperability services)
Functional analysis of organizations, doing appraisal at that level, then going to individual users later to understand what their roles are
KS Historical Society was able to have input for IT programs over $250k, getting involved on ground floor.
SIARD-Archiving of databases (keep persistent information in xml, or ingest information into independent database)

Topic 3: “Small shop tips, resources, and best practices”

Proposed Topics:

Managing digital collections with limited staff
Curation in a single person shop
Best e-records practices for lone arrangers
Getting started with digital curation: hurdles, tips, resources
Software that needs little or no IT support.

I. Brief introductions from participants, bringing up issues of interest.

II. Discussions

Staffing issues / funding:
- We are funded, but we have staffing issues as far as overworked staff, not enough support from IT.
- Most of us have dealt with downsizing in archives and library, and in IT. That’s been a big challenge for us in terms of level of support for projects in house. Collaboration with IT took three years.
- Do people use students? Sort of - only enough for maybe one. Interns? Some, some volunteers.
- Library consortium was able to work with development office at the university, and the funding office was able to assign some people to work with archives - especially for events, big alumni year, or centennial, etc. It wasn’t a fix but it was helpful in the short term - the development office can be a great ally in a university setting.
- How is digital curation split up, if you have multiple archivists? We do share the work and split it up depending on the time - but I think most of that is because both of us are inclined to that. People wear many hats.

IT support:
- Is it a challenge getting IT support for this kind of stuff? Yes. IT is overworked. One proposed setting up DSpace and they were totally into it (open source), but it took six months to get the program installed.
- Is hosting a solution? Yes, sometimes we can spend money, but not money on staffing. Do you have to get permission from IT to go with hosting? Our IT dept is excited about hosting, because it saves them time. But we have legal forms that are somewhat of a barrier. Is Open Source an issue? Not really, unless we have to pay for it.
- It really depends on the culture of the place, especially the IT culture.

Best practices: in terms of living up to “best practices” - some “best practices” are kind of beyond us right now, like digital forensics. Is anyone having these issues?
- If you have no resources, what is the minimum you can do so you’re not setting yourself up to fail? Is it just creating disk images of everything? Is it not just setting disks on the shelf.
- We tried to set up a DSpace instance for that exact reason. Getting it off of disk and to somewhere where you can do checks on it. Did you see the OCLC “Getting Started”? That might be the best for now.
- We had piles of CDs that we moved with Duke Data Sessioner.
- What kinds of digital collections are people talking about here? Disks in boxes from donation, other types of media, born-digital documents, political collections from congressman, we’re hoping to get more official university documents. Pictures, PDFs, speech drafts (docs), CD with a proprietary database (IQ database system).

Does anyone have any tips?
- I want to be able to tell donors up front the formats that we want to receive materials in, of course having to be delicate.
- Focusing on the front-end, and setting standards to follow.
- We have an “IR” and we’re getting things like ETDs, student work, architectural photograph collection, student pubs, etc. People understand the desire to provide access to this stuff - what they don’t get is the behind-the-scenes work to make it last. Has anyone encountered that in terms of selling a system?
- Our people wanted to outsource it all. Is that a reluctance to hire staff, or have properly trained staff? They didn’t want to pay more money for that kind of expertise? Was there commitment for long-term outsourcing.
- Are you working with purchasing department for contracts? We have to go year-to-year. Others are lucky to get the language in the contract written by the law department to get data back.
- If you can’t control how the data is given to you, and you can’t control who’s going to access it, how do you develop standards? And how do you educate people about preservation?

Back to the idea of “what is the minimum we can do given that the flow of materials is far outstripping tech resources”?
- “Do no harm” is about the best we can do, and we’re holding on to the original media just in case. We’re trying to find a backend that will accommodate all this.
- Are people using OAIS standards? We don’t have a repository right now, but we want to work towards that.
- Has anyone played with Archivematica? Some people have, but it might be difficult to get running? “A little tricky.”
- Has anyone done segmenting on your collection for “least risk” - “most risk”? Not really... “Digital Value at Risk” calculator analyses a collection and assigns a DVAR value and work out the consequences of retaining that material over time.
- Has anyone looked at the NDSA categories for analyzing a collection? It offers another way to look at what you’re doing.

Topic 4: Access and Miscellaneous

Emulation -- can you provide it to users?
- MoMA in NYC is using Citrix to serve emulated software to workstations
- What if software is out of date but still copyrighted?
  - Institutions have to decide what’s an acceptable risk -- a thorny legal and ethical problem
  - Videogames an example of emulation where the hardware emulators exist but the legal right to the software remains proprietary

Problem of privacy/security/access controls when making digital archives available to the public web
- Do you have a single repository with both public and restricted materials, or two separate systems?
  - Consensus is that the combo of a dark archive and a public system is the best way to go right now, although some systems can handle both in one

How are institutions putting multimedia online?
- Avalon (from Hydra)-- a system for access to multimedia collections
- Jazz Project/ rock and roll hall of fame
- Is anyone moving their content to linked open data? How? Is this a workflow question?
  - Is linked open data something that would benefit institutional archives?
    - Possible for positioning yourself and making yourself available for donations, researchers
    - Make institutional info available to digital humanities researchers, complementing the work done by cultural heritage institutions

Strategies for putting content online
- DSpace at Mount Sinai -- dspace.mssm.edu
- Georgia Tech: has a DSpace instance and an Omeka instance which use the same Dublin Core metadata, although upload to both is not automated
- Moving to Hydra (Fedora) -- allows multiple interfaces to a single underlying databases
- Problem of rights: oral history recordings from the pre-computer era often have limited or unclear rights
- Do you put a watermark or header on digital objects? In the embedded metadata? Could a repository system do this on the fly?

Digital Humanities
- What is the authentic object being preserved?
- How does digital humanities data differ from science data?
- Metadata specific to oral history
  - What does StoryCorps use?
  - Many different disciplines work with oral history data -- these disciplines need to collaborate
- Working directly with researchers to describe their data
  - They fill out an “information sheet” -- no need to describe it as “metadata”!
- How can you make humanistic data available to researchers for data mining?

Session 2: 1:45-2:25pm

Topic 5: Organizational Culture

Possible Topics:

Organizational cultural collisions
Interdepartmental struggles in your institution and who is supposed to do what
Methods to encourage management to see the value in digitization/access
Barriers to building a business case for digital preservation
Strategies for developing relevant value of digital assets

Dealing with issues of resistance to change - integrating new work, new ideas, etc. into old organizational culture.

Issues like having technical departments coming to special collections for work of digitization. Departments not coming to special collections for other things like preservation. Issues with collaborating within own institution

Communication about new systems; getting buy-in

Working within an existing organization, trying to figure about who in the organization should be responsible for which parts of a program or project

Dealing with existing bureaucracies, working with existing IT, making sure our needs are getting heard

Importance of advocacy - essential tool in the work of the archivist - Lack of understanding about what archivists do - Change dynamic - present the positive angle instead of being prescriptive - Identify high-level supporters

Mission statement - how does our work align with it? - Archivists tend not to talk in terms of business value - the value of what they provide - How do archivists measure ROI? - Articulate how archivists can minimize risk

We can recast mission statement to include “access over time” instead of just preservation

How do you convince people that they need more than storing records on their own computer?

Storyboard for digital preservation - demonstrates people into their roles (eg http://go.preservica.com/how-it-works)

Barrier - IT that sees digital preservation as a non-essential function

We can appeal to people’s interest in their own legacy; potential loss of reputation

Topic 6: Workflows

Possible Topics:

How do you document workflows?
Workflows > archives > digitized > IT storage (working with University)
Intake and processing large collections with lots o’digital
Workflows > embedded metadata in digitized objects > repurpose and preservation
Pre-ingest or SIP formation activities
Acquisition workflows in institutional archives - coordinating the process when material often arrives on an ad hoc basis
Digital preservation workflows
Documentation for manual digital preservation processes
Curation workflows (not tools)

Ad hoc digital material donations -- small, discrete, often one-item transfers or donations -- how to create useful workflows for these types of accessions

Concern that not all of the metadata you might want to have is captured for these ad hoc digital accessions

TRAC very helpful for describing what makes a repository trustworthy in the long term -- useful document to have to show to IT folks to explain that a trusted repository is different from other information systems --whether or not you actually get certified (which is a lengthy and pricey process) -- there is a "staying on TRAC" service that LYRASIS has to get prepared to do it --very useful to do informal self-audits --NDSA a good lighter weight alternative (or complement) to TRAC

-[paraphrased] Nicholas said that for him, the missing thing between the ad hoc accessions and having something you could consider a semi-standard workflow is the things that would bring the process up to something like best practices: the PREMIS metadata for information events, doing the checksums as the data go from transfer to transfer, etc, --the ideal way to do that is to have the transfer standardized (like based on a retention schedule that is actually followed), but there is a realization that in many institutions, you'll never be able to avoid one-off transfers

Microservices and microtools that can strung together

Connecting to institutional CMS like having a system used for managing institutional Policies & Procedures and having it automatically output to the IR

If you don't have PREMIS, save it as custom Dublin Core knowing that you want to do it in a structured way that will allow for arbitrary crosswalks

Topic 7: Web Archiving

Proposed Topics:

Web archiving: tools, policy, access
Web archiving strategy

Discussion:

Tools being used by the group:
- Great list of tools: http://netpreserve.org/web-archiving/tools-and-software
- Archive-It (Internet Archive’s paid crawling and hosting service, uses Heritrix and Wayback)
  - Crawls Facebook well (not sure about Twitter)
  - A pro of using a service is to avoid having to store large crawls on your own server
  - Or could store service-crawled WARC files in your own storage
- HTTrack (crawler)
- Adobe Acrobat (crawler)
- Heritrix (crawler)
- Wayback Machine (access interface)
- Working with website creators to collect the content directly from them
- Memento and SiteStory: http://mementoweb.org/
- UC Digital Library’s Web Archiving Service (paid crawling and hosting service, uses Heritrix)
- Web Curator Tool
- Archivematica for WARC processing
- WAIL (Heritrix tool): http://matkelly.com/wail/
- Columbia grant: https://library.columbia.edu/bts/web_resources_collection/call_for_proposals.html
- Social Media tools > Twitter:
  - George Washington University Library open source tool for collecting Twitter content:
    - https://github.com/gwu-libraries/social-feed-manager
  - GNIP paid service Library of Congress is using for Twitter archiving project
  - HootSuite
  - Google Spreadsheet method from Martin Hawksey:
    - http://mashe.hawksey.info/2013/02/twitter-archive-tagsv5/

Challenges:

Access
- Rights to crawl, rights to provide access
- Hard to anticipate use cases and prepare for the researcher
- Only a few options for public access interface at this point (Wayback interface, etc.) → need for more access options
- Assume permission for crawling the public web
Social media
Crawling tricky web content (Drupal sites, javascript, databases, etc.)
Selection criteria:
- Necessity of refining the scope of your crawl (entire university domain, also social media, etc. → gets big fast)
- Consider including selection criteria, naming conventions, errors encountered, etc. in metadata for crawl or in readme file alongside crawl
Linking finding aid to web archive
Frequency of crawling (maybe less often than originally planned is actually sufficient)
Creating interoperable, linked data collections (compatible with aggregation efforts like HathiTrust)
Deduplicating content that hasn’t changed between crawls
Size of crawl content
Weeding content of crawl (how deep do you let the crawl go)
Versioning
How to integrate web archiving collections into larger digital archiving/processing workflow, systems, and access points
Links to external documents (do you capture those documents too?)
How do you define “the site” for the purposes of capture
If Library of Congress is preserving most Twitter content, why should we bother capturing it?

Topic 8: Dealing with Difficult Stuff

Preserving blank space: gaps in tape, blank pages in digitization

Digitization of text: do you scan both sides regardless of whether there is text on front and back?
- Example: Biodiversity Heritage Library fieldbooks (http://www.biodiversitylibrary.org/item/123656)
- Example: correspondence with mix of single-page, multi-page letters, sometimes with filing notes written on the back
Large collections of papers could have millions of pages, lots of blanks
Image deduplication tools- would that work for eliminating blank pages for access?
- Auslogics duplicate file finder: http://www.auslogics.com/en/software/duplicate-file-finder/
- Similar Images: http://similarimages.en.softonic.com/
- Visipics: http://www.visipics.info/index.php?title=Main_Page
Can handle potential issues with blanks at the access stage
Partly artificial situation created by difference between digital and paper (i.e. having two sides is simply a property of paper)
Is there value in keeping blank space on A/V materials that justifies extra storage costs?
- Additional example: imaging entire hard drive whether or not data written throughout

Metadata for born-digital oral history ‘Oral History Care” - what is essential?

Doesn’t have to be officially sanctioned as a standard
Guidelines (possibly a form/checklist) for what to capture about each oral history
Who was in the room during the conversation? (not always recorded)
What changes may have been made after initial recording?
Oral history brings together wide range of communities, disciplines, contexts

Preserving large scale A/V collections: standards for normalization, management, access to AV

Example: planetarium shows made to be broadcast in 360 degree theater experience (museum archives)
Is it even possible to capture the experience of a planetarium?
Will it be possible to reuse the materials 10 years in the future?

Session 3: 2:35-3:15pm

Topic 9: Digital Forensics

When to image? When not to image?-- determined based on resources/resource allocation, content

Cd/dvd project not done through FRED, Because formats were still accessible.
Federally funded research can’t image first because of possibly classified information.
Changing policies on imaging would mean changing donor agreement references to hard drives.
Bit Curator can be used on directories. Now has a GUI which makes it more difficult to customize workflows

Logical disc images vs. Forensic Disk images -
Logical when media is not on hand, for example if someone comes in with a thumb drive with files to transfer. For collections- forensic disk images are made for preservation rather than access purposes.
Yale and Archivematica dig forensic image ingest workflow: https://www.archivematica.org/wiki/Digital_forensics_image_ingest

Manuscript collections are problematic, because it easy to do minimal description on a box.

Not doing selection, running batches without assessing.

Logical disk image formats an issue - FTK imager produces proprietary file types.
Forensic disk imaging provides the least difficult process for rights blocked media.
Also using Kryoflux, which does not work with FTK imager.

How long do you keep logical image, after processing?

Set policy to wait three years to revisit the question.
This question is still up in the air, along with how long to keep original medias.

Suggestions for donors who want you to come to their site?

RSync is a useful tool.
With that much data (500 gigs)no benefit to create a logical disk image.
A logical disk image can be created over a network, but network performance will be an issue.
Kari's blog link to field kit ( libraries.MIT.edu/digital-archives/

How important are disk images vs. just the (last version of the) e-content?
Workflow of bit curator assumes images first, but sometimes you need to know if you have the right to copy
Communicate risk with donors when doing forensic images

FRED

FRED vs FTK Imager
Not a lot of case studies with FRED on prioritizing collections etc
FTK imager is free

Other tools:

Karen’s Directory Printer - http://www.karenware.com/powertools/ptdirprn.asp
Tree for Linux
Then link to AT for accession information.
Curator’s Workbench lets you document accession process more fully.
Duke Data Accessioner - no one in the group is using it.
AVPreserve Fixity - seen demoed.
Documentation and help on tools is lacking, BitCurator for example. Documentation is not geared towards archivists.
Yale- prototyping a new tool to accession metadata of images into a database, on GitHub.
OCLC has a project for outsourcing disk imaging.

Are photographs of the items being imaged necessary?

Time consuming and not a lot of additional information. Generally transcribing labels is more useful, and less disruptive for workflow.

Topic 10: Description

Describing electronic records in finding aids
Describing paper/digital hybrid collections

Dealing with hybrid - separate series? separate collection?
Tools - what will ArchivesSpace bring?
Describe at the aggregate level vs. item level
- Using scale as the deciding factor
Tension between finding aid and digital repository
Hard to create best practices and standard workflows when processing and description vary so much collection by collection
Penn State is developing ArchiveSphere, a possible solution to integrate ArchivesSpace-generated finding aids with digital repository (Fedora/Hydra):
- http://stewardship.psu.edu/2013/07/introducing-archivesphere.html
- http://stewardship.psu.edu/2013/07/archivesphere-faqs.html
- Daines, J. Gordon III and Cory L. Nimer. “Re-Imagining Archival Display: Creating User-Friendly Finding Aids.” Journal of Archival Organization 9:1 (2011): 4-31. http://www.tandfonline.com/doi/abs/10.1080/15332748.2011.574019#.UgqUgm3EjIU
Is automatically extracted metadata from file headers sufficient to use as descriptive metadata?
- Example of a finding aid that may meet Google-like end user expectation for search and access, while still being a finding aid: http://findingaids.princeton.edu/

Topic 11: Donor Agreements, Confidentiality

Possible Topics:

Donor agreements for electronic records
Developing best practice guides for private paper donations
Workflows for visiting donors to acquire content from their home organizational computers
Documenting authenticity of small/ad hoc accretions to ongoing collections (i.e. you grab a pdf from the web or a departmental secretary email you a misc. word doc)
Managing privacy and confidentiality in electronic collections
Digital curation and private collections

Deeds of Gift -- what has changed with e-records?

should they be more specific than many of our existing DoGs
what parts of this should be in the Collection Policy vs in the Deed of Gift
some do specifically say that it may be digitized and put online
effect of Digital Millenium Copyright law -- some DoGs do not specify that the copyright is transferred
Some add an electronic-records paragraph to the end of the agreement

For some records (like corporate) lots of issues with donors around confidentiality, litigation, intellectual property, etc.

Agreements often focus on details of access, timing for opening, funding for processing, etc.
In the agreement and discussions, some places say that they will accept material (in trickier formats) but not make any promises that it can be managed perfectly or
Other places specify which formats are and are not acceptable
Suggestion that the material in a donation is easier to grasp for physical material than for electronic -- easier to know what "surprises" are in the new accession
All about managing expectations of donors, whether in an institutional records setting or with donated collections -- A lot of the details about use, formats, and other things come down to the discussions, education, and understanding you've established with the donors
How much can you assure donors that user access will not expose any privacy info
- Can you commit to cleansing the data of all PII?

Some folks are encouraged to using Creative Commons licensing

It would be nice to include in an agreement something like "For the material for which the donor holds the copyright, the donor [retains the copyright, turns over the copyright, applies a CC license]"

What are the institution's responsibility and ability to pay for reformatting

This has to be laid out prior to the completed donation, otherwise the burden falls on the institution...you need to be ready to ask the donor for the money to do the work
It is a disservice to both the donor and the institution to take on a collection that you can't manage
- A terabyte of material doesn't cost $100 for the disk, it's the ongoing cost

What is "good enough" when you're unable to follow strict best practices

Unfortunately, you don't know that something was not good enough when you find out that it wasn't

Discussion about concern that hybrid collections include a lot of duplication between paper and electronic

Suggestion that this might be more of a concern with stuff from the 1990s and earlier, when people still printed out lots of stuff

Any Deeds of Gift that include language on capturing online content (including social media)?

Topic 12: Storage

Is digital preservation in the cloud safe?

Will these cloud servers be the target of “mischief” in the future?
What will the future of these services be?
Is there a psychological perception of risk? “It’s in the cloud” so it’s out of my control?
Is there a psychological perception of benefit? “It’s in the cloud” so someone is totally taking care of it. “No problem.”
Has there been a cultural shift? New professionals having grown up with social acceptance of sharing media - people have already transitioned, and this idea will only grow.
What about “ownership” issues (photos on facebook). Cloud storage is different but still worth thinking about.

Storage: (backup) (checksums) (reconciling)

creative solutions for offsite storage
Case study: Significant born-digital documents and 140,000 digital images, etc. But recently came up that very few items are duplicated off-site, so are now looking for creative solutions for storage and backup.
IT is managing the data - backups, storage, etc.
Better practice to have multiple locations for storage; checksums: when, how frequently, and what kind of algorithm (MD5 requires less computational effort, but higher chance of collision at high scale; SHA series are more highly encrypted / less collisions, but require more computational power.) Is there an audit trail?
Cloud storage is becoming more and more viable - storage prices are coming down (Amazon-> Glacier (won’t discuss infrastructure but significantly cheaper but it’s less accessible) / Google / etc)
A presentation on Friday will present paper on options for storage
Concerns of the cloud: security, location of servers, speed of upload, speed of download.
Benefits: fills geographic separation gap, 6 copies “under the covers” (Amazon), 11 9s of reliability (99.99999999999% reliable).
Make sure there is a clear exit strategy from whatever provider.
Usage patterns: if you go with a cloud provider - do you want public access? If so, look carefully for access fees. Do you want to access it frequently? There may be additional fees.

How far is far enough?

What to do about geographic distance? Some people say “100 miles” but that’s arbitrary.
How to determine the local minimum?
Some schools do “swapping” with another school in a nearby state.

LOCKSS