How do you catch a cloud and pin it down
notes from how do you catch a cloud
Liz started it out, explained what is behind the title, Problem is, we are asked to report to managers and others how much stuff we've done. What is product of our work. We don't have a great unit, depends on who is asking the question?
If we're talking about a ten volume book, 1 thing to the cataloger. 10 things to the shelvers, to the scanners, it's 1000 things. Those making derivatives, four thousand things. All answers are correct. Records and files, can do really well. But the middle numbers are hard. Report things in TB, numbers of files, # of descriptive records, we've done all that at LC. In CTS tool in use at LC, content custodians can put in number that is meaningful to them. "500 letters" but there's a lot out there that doesn't have those numbers.
How do people at other insitutitons deal with this question? Are you getting these questions? What sorts of people units use to report statistics. What questions have you been asked, what tools you used.
At SCOLA, have 12 diff. services at the digital archive. Have to individually search, and across website. Now asked to count number of assets we have. If you have four copies of one asset, is that four things?
Cataloger in the room - just ONE :)
Kate: at GPO, we called one package an intellectual entitiy, mets wrapper, derivitaves (but backups didn't count). We counted packages. What is that equivalent to. (nothiing - it's a unit we made up!) Reporting numbers to congress and library partners.
Liz: phonecalls from the press - they always want to know how much stuff.
Leslie: ongoing series on our blog re: ways people describe the size of the Library of Congress. What is the size? how much we have inventoried in CTS. How many PB (easiest)
web archives at LC: TB easier to talk about rather than sites or documents.
Mark - Is the site the intellectual entitiy, # of files . Numbe of sites may not be impressive by could be large sites with tons of files.
discussion: being asked How many items are missing or lost. What is an acceptable amount of loss. what about in print world?
Mark at Yale: when talking about digital library things, described in number of collections and items. With archival collections, squishier. Don't have an adequete level of description. Listing a number of files is a worse metric than the number of items.
Makes sense to count intellectual entities when providing access to researchers. but when talking about budget for infrastructure and technology, number of files does matter. Both caluclations are valuable. Lot of $ goes into maintaining these files.
Decisions about doing something in house vs. outsourcing, or other political issues saying one collection is more valuable than others.
Physical world had this problem too. Est. size of collection of gov docs (at GPO) - measured in linear feet the shelves, counted samples. But then stacks of pamphlets threw off the calculations.
Three copyright card records for one publication. Might cost more to go through and dedup or figure something out.
Also a concern what is publicly accessible. Define what you mean by publicly accessible? Offsite? Onsite?
And "how much stuff is text" - define what you mean by text. Is a scan of a bookpage (without OCR) text?
Issue of size vs. volume. Both are problems.
SCOLA - cult of personality. the idea of intellectual units is appealing. But we end up counting everything as one thing, because that's the biggest.
That's not a small thing - many want to report out the biggest number.
Liz: how many things are searchable from our website, a useful number. Some work done in search to dedupe results.
In describing digital stuff, are we at least all in agreement to talking about size of data? Will items and collections graduate away? Probably not. In context of archives, want to ID intellectual objects. Sometimes files are the intellectual object. Need to have some comfort in how you are describing/counting.
Are we trying to sound impressive? Or provide useful information.
Where do finding aids fit in? you've tried to break it down already. Can describe collection, containers, folders. "the joy of the EAD" (chuckles) is that it can be at any granularity.
No simple answer.
Fantasy: tools that could do some of the counting of the intellectual pieces where file names might give clues ("d" in file name in LC digitized collections, used in display, but no way to quantify those. Is there a way to define rules to automate things? Who defines the rules, how are they defined? Is it worth it? What is the next use case to plan for. (programmers in room thought possible, but use cases need thinking through).
Last minute announcement from Kate: Beware of rapid beavers in Lake Anna, Virginia.