https://wiki.curatecamp.org/api.php?action=feedcontributions&user=Mark+Jordan&feedformat=atomCURATEcamp - User contributions [en]2024-03-28T17:44:20ZUser contributionsMediaWiki 1.28.0https://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012&diff=2453CURATEcamp iPRES 20122012-12-20T16:08:28Z<p>Mark Jordan: </p>
<hr />
<div>'''Preservation Synthesis'''<br />
<br />
There will be a one-day pre-conference CURATEcamp at this year's [https://ipres.ischool.utoronto.ca/ iPRES 2012] conference in Toronto. Registration is open now! & Space is limited.<br />
<br />
The Camp will be facilitated by Mark Jordan, Head of Library Systems at the W.A.C. Bennett Libray at Simon Fraser University, and Courtney C. Mumma, systems analyst and product manager for the Archivematica project with Artefactual Systems, Inc.<br />
<br />
Lunch will be included so we don't lose any momentum. You will have an opportunity to let us know of any dietary limitations on the registration form.<br />
<br />
For this intense, one-day event, we would like to focus on Preservation Synthesis. We know there is no single, standalone solution for a complete digital curation system. Solving the digital preservation puzzle means integrating systems and services for a total curation solution.<br />
<br />
At this CurateCamp, let's consider answers to the question "How many systems and services does it take to screw in the curation lighbulb?" Deploying curation strategies and tools of a variety of sizes, from vertical preservation applications to the smallest microservices, poses both challenges and opportunities. Join us for a day-long CURATECamp where participants will share their experiences and plans for achieving fully integrated digital curation.<br />
<br />
*WHEN: Tuesday, Oct 2, 2012 (9am - 5pm)<br />
*WHERE: University of Toronto Chestnut Conference Centre (Armoury Room)<br />
*COST: $40 <br />
*REGISTRATION: [http://curatecamp2012-toronto.eventbrite.com/ CURATEcamp iPRES 2012]<br />
Space is limited to 40 registrants, so reserve your spot while you can.<br />
<br />
----<br />
<br />
[[CURATEcamp iPRES 2012 Discussion Ideas]]<br />
<br />
[[Letter to Campers - How to Prepare for Preservation Synthesis Camp]]<br />
<br />
[http://www.ariadne.ac.uk/issue70/ipres-curatecamp-2012-rpt Event report in Ariadne]</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_24_hour_worldwide_file_id_hackathon_Nov_16_2012&diff=2447CURATEcamp 24 hour worldwide file id hackathon Nov 16 20122012-11-17T19:19:31Z<p>Mark Jordan: </p>
<hr />
<div>[[Main Page]] > CURATEcamp iPRES 2012 > CURATEcamp and Open Planets Foundation 24 hour file id hackathon Nov 16 2012<br />
<br />
=Summary=<br />
<br />
At the end of the day, we got A LOT done!<br />
<br />
Thanks to participants!@mopennock @WilliamKilbride @GaryM03062 @anjacks0n @carusb @benfinoradin @peshkira @petemay @Britpunk80 @HeatherBowden @pjvangarderen @jordanheit and everyone else (please add who is missing)<br />
<br />
* GaryM03062 discovered a but testing FITS a bug in JHOVE: https://sourceforge.net/tracker/?func=detail&aid=3587890&group_id=221311&atid=1052190 <br />
* File corpus! https://github.com/openplanets/format-corpus/commit/b0971e1c32b2df7a9bceafe1f00d81f49cb45990 <br />
* (@benfinoradin) Kept all #fileidhack tweets today [pic.twitter.com/7QI1DfmD]<br />
* (@mopennock, @anjacks0n, @petemay et al) British Library team worked on eBook format identification <br />
* (@peshkira) OpenFITS compiles<br />
* (@petemay) Tika signatures for PDB, Kindle AZW and LRF files created, re-testing over sample file set #fileidhack #eBook<br />
* (@mopennock) Added 7 new eBook signatures to Tika this morning <br />
* Encouraged pinging PRONOM (@Britpunk80) to create/test/submit: [http://test.linkeddatapronom.nationalarchives.gov.uk/sigdev/index.htm]<br />
* @Britpunk80 handed some droid signature files to @anjacks0n on rocketbook, epub, and ibooks. * @HeatherBowden shared some Quark and InDesign files. <br />
* @GaryM03062: New commit of OpenFITS allows setting max no. of threads in fits.xml [https://github.com/gmcgath/openfits]<br />
* @benfinoradin shared resource on RIFF/RIFX [http://www.johnloomis.org/cpe102/asgn/asgn1/riff.html]<br />
* the Quicktime motherload! by @mistydemeo - Quicktime videos [https://github.com/openplanets/format-corpus/tree/master/video/Quicktime]<br />
* OpenFITS : [[FITS#Improving_JHOVE_performance_within_FITS]]<br />
* @mistydemeo: Created @machomebrew formula for fidget to make file ID signatures for #fileidhack [https://github.com/mistydemeo/homebrew-formulae]<br />
* #openarchives chat /nick artefactualmtgroom #fileidhack pic.twitter.com/1Ffp1v6Y<br />
* @anjacks0n: new [https://github.com/anjackson/percipio/downloads | Percipio] and [https://github.com/openplanets/format-corpus/downloads | Fidget] available dev and feedback <br />
* @pjvangarderen @archivematica: Artefactual picks up #fileidhack baton. OpenFits debian package for testing [https://launchpad.net/~archivematica/+archive/externals-dev/+build/3989642]<br />
* @GaryM03062 uploaded source changes to JHOVE. [https://sourceforge.net/projects/jhove/]<br />
* @berwin22 @jordanheit @mcantelon @pjvangarderen +ARTi +epmcellan testing OpenFITS [[FITS#Results_of_optimization_tests|FITS]]<br />
<br />
=Background=<br />
One break-out session at the CURATEcamp iPRES 2012 was affectionately branded "file id confessional" where we commiserated on the state of our file id tools and processes. We also talked about:<br />
<br />
*We can do better job specifying and documenting our file id requirements / use cases<br />
*We're all hooked on that FITS.xml but [[FITS]] needs performance optimization ASAP (also, Is Harvard up for extra dev?)<br />
*Apache Tika is very actively supported and useful tool for file id and content extraction. How much of our file id requirements can it in fact cover?<br />
* Archivematica [https://www.archivematica.org/wiki/Format_policy_registry_requirements Format Policy Registry] use case (see also [http://actionplan.fcla.edu/ DAITSS action plans])<br />
* Jason Scott's "[http://ascii.textfiles.com/archives/3645 Let's Just Solve the Problem]" campaign to boldly catalog as much file format info as possible in the month of November.<br />
* also, CURATEcamp iPres participant Paul Wheatley has since posted: [http://www.openplanetsfoundation.org/blogs/2012-10-19-practitioners-have-spoken-we-need-better-characterisation We Need Better Characterization] as well as link to [http://willsworld.blogs.edina.ac.uk/2012/10/18/online-hack-event/ Online Hack Event]. This led to Twitter discussion between @pjvangarderen @anjacks0n @prwheatley about this 24 hr hackathon event.<br />
<br />
==What==<br />
<br />
24hour+ live hackathon event where multi-time zone teams work on common technical projects related to the CURATEcamp iPres 2012 file id discussions. <br />
<br />
Project proposals can be made by anyone.<br />
<br />
We will start the day with New Zealand (GMT +12:00) and end with North America West Coast wrapping up project(s), hopefully with one or two solid deliverables by 12 midnight-ish PST (GMT -8:00).<br />
<br />
==Why==<br />
* Because we'll probably get some useful stuff done<br />
* Because its fun to work with CURATEcamp people in a CURATEcamp way<br />
* Because doing a 24hr+ worldwide hack with real time collaboration tools is cool<br />
<br />
=Logistics=<br />
<br />
==When: '''Fri Nov 16'''==<br />
<br />
* Friday, November 16, 2012<br />
** [http://wiki.opf-labs.org/display/KB/2012-11-13+OPF+Hackathon+-+Emulation%2C+learn+from+the+experts OPF Emulation Hackathon] is Nov 13-15. Freiburg, Germany. Sorry, Nov 16th was chosen somewhat haphazardly. We didn't mean to compete with OPF Hackathon event. But emulation needs file characterization too? Maybe OPF Emulation Hackathon can hand off some "File Id for Emulation" use cases to the Nov 16 24 hr Hackathon...or better yet, extend the Freiburg event to include participation in the Nov 16 24 hr worldwide #fileidhack event. Great way to cap off their Hackathon week! --[[User:PeterVG|PeterVG]] 11:48, 23 Oct 2012 (PDT)<br />
* <strike>Friday, November 23, 2012</strike><br />
** RT @declan: @pjvangarderen neat idea! You know that date is the day after US Thanksgiving, right? people might be on vacation<br />
<br />
==How==<br />
* Twitter: [https://twitter.com/search/realtime?q=%23fileidhack #fileidhack] (made it shorter)<br />
* CURATEcamp Mediawiki: [[Special:UserLogin|Log-in]] and please help update this page<br />
<br />
Let's put together a schedule, tasklist, & volunteers to road-test these tools for Nov 16:<br />
* Google Hangout: [[Google Hangout for CURATEcamp|fire up a webcam]], make it public and share the link<br />
* GoogleDocs: we can live edit any docs we feel the urge to produce<br />
**[[Collecting_format_ID_test_files|Format ID Test Files Project]]'s [[Collecting_format_ID_test_files#Via_Google_Drive|Google Drive]]<br />
* IRC: The chat room is on the irc.OFTC.net server, and the room name is #openarchives [irc://#openarchives@irc.OFTC.net|irc://#openarchives@irc.OFTC.net]<br />
** Chat room help and browser chat option: [https://www.archivematica.org/wiki/Chat_room https://www.archivematica.org/wiki/Chat_room]<br />
* GitHub: get those pull requests going<br />
** [[Collecting_format_ID_test_files|Format ID Test Files Project]]'s [[Collecting_format_ID_test_files#Via_Google_Drive|Git repo]]<br />
<br />
<br />
==Who ([[Special:UserLogin|Sign up]])==<br />
* '''GMT +12:00''' Digital Preservation Practical Implementers Guild (@DP_PIG)<br />
* ?<br />
* '''GMT +7:00''' [[User:Euan_Cochrane|Euan Cochrane]] (@euanc)<br />
* ?<br />
* '''GMT +2:00''' [[User:Maurice_de_Rooij|TechMaurice]] (NANETH)<br />
* '''GMT +1:00''' [[User:Nicholas_Clarke|Nicholas Clarke]] (@nclarkedk) - netarkivet.dk<br />
* '''GMT +0:00''' [[User:Andy_Jackson|Andy Jackson]] (@anjacks0n), Paul Wheatley (@prwheatley), BL digital preservation team - Maureen (@mopennock), PeterM, Lynn, William, and maybe more...; [[User:David Underdown|David Underdown]] (@davidunderdown9) and maybe some more TNA folk<br />
* ?<br />
* '''GMT -5:00''' Kara Van Malssen (@kvanmalssen), Dave Rice (@dericed), Ben Fino-Radin (@benfinoradin), Gary McGath (@Garym03062), @anarchivist<br />
* '''GMT -5:00''' @lljohnston @blefurgy et al!<br />
* '''GMT -5:00''' [[User:Greg Jansen|Greg Jansen]] @gregj, [[User:Ben Pennell|Ben Pennell]] @pennellben<br />
* '''GMT -5:00''' [[User:Heather Bowden|Heather Bowden]] @heatherbowden - will help when/where I can. Happy to help US East Coasters and Artefactual Team, or whomever. Contact me if you need an extra hand.<br />
* ?<br />
* '''GMT -8:00''' [http://artefactual.com/team Artefactual]: peter (@pjvangarderen), courtney (@snarkivist), evelyn, joseph, mikeC (@mcantelon), mikeG, austin, dan...plus any VanCity people wanting to participate from [http://artefactual.com/contact.html Artefactual office].<br />
<br />
=Project Proposals=<br />
* Document file id requirements / use cases<br />
* ArchiveTeam "Just Solve the Problem" wiki scraping -> structured data (CSV?, XML?, RDF?); as an ongoing service?<br />
* [[Improving format ID coverage]]<br />
** Maybe incorporate [http://www.ace.net.nz/tech/TechFileFormat.html "Almost Every file format in the world!"]<br />
* [[Collecting format ID test files]]<br />
** [[Creating an artificial test set using emulation]]<br />
* [[Improving identification methods]]<br />
** Develop a Format ID [http://digitalcontinuity.org/post/7327791836/emulation-workbench-for-digital-object-format-analysis "Emulation Workbench"] for format analysis<br />
** Document software input and output formats to use in limiting the option set for files of a particular time period (if we know all formats that were creatable during a period when a file was created then we can limit results to only those formats), and for use in [http://digitalcontinuity.org/post/7325561455/mining-application-documentation-for-file-format format intelligence mining].<br />
* Archivematica [https://www.archivematica.org/wiki/Format_policy_registry_requirements Format Policy Registry] testing<br />
** @archivematica team & volunteers<br />
* @kvanmalssen Improved file id /characterization support for AV files in existing tools like Tika and FITS. An update of Exiftool and inclusion of MediaInfo would be a good start. Or maybe test applicability of ffprobe/avprobe for this task.<br />
** @dericed This is exactly what ffprobe/avprobe does. Whereas the many of the digipres tools do identification by sampling x bytes from the head and tail, ffprobe/avprobe incorporate one of the many extensive demuxing libraries to manage identification of the contents.<br />
** @kvanmalssen - Yes, so can we get avprobe to output in a structured way? And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?<br />
** @dericed - Yes ffprobe/avprobe have the -print_format (-of) option so you can get json, xml, csv, or others. There's also an xsd published for the output. I suppose ffprobe could be incorporated into FITS but not sure if this is an efficient idea. The premise of FITS seems to put all preservation metadata considerations on the container (file format) but in AV collections the codecs and contained bitstreams are far more significant to consider.<br />
** @kvanmalssen - Issue is we need AV support (including track/bitstream support) in these general tools so people can process mixed collections. That's what I'd like to figure out.<br />
And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?<br />
** See also [[Improving identification methods]], which could perhaps be split into two or three and one of which merged with the above tweet discussion? [[User:Andy Jackson|Andy Jackson]] 15:20, 22 October 2012 (PDT)<br />
* FITS or Tika bugfix marathon (e.g. [https://issues.apache.org/jira/browse/TIKA-539 this one]).<br />
** Perhaps consider refactoring FITS to re-use existing dependency management tools like Maven and apt/yum/etc instead of manual dependency management? [[User:Andy Jackson|Andy Jackson]] 05:16, 23 October 2012 (PDT)<br />
*** I'm willing to put a fork of FITS on Github if a couple of people say they want it. --[[User:Gary McGath|Gary McGath]] 13:27, 11 November 2012 (PST)<br />
* [[User:Maurice_de_Rooij|TechMaurice]]: Replace container identification function of [https://github.com/openplanets/fido FIDO] using PRONOM container signature.<br />
* [[User:Misty De Meo|Misty De Meo]] Just a thought... it strikes me that the basic functionality of FITS is not super complicated. As well, in my experience, most users are using a fairly minimal set of features. Given some of the problems we're having with FITS, it may be worth doing a minimal rewrite of FITS (in, say, Python or C) with a focus on a) speed, and b) maintainability. This is more than a day's work but could get a start if this is something other people would be interested in. Things I'd want to see would include:<br />
** Don't vendor tools - just recommend versions, but draw form whatever tools the user has installed.<br />
** Implement better AV support (with all the caveats listed above)<br />
** Possibly restrict the number of tools?<br />
** +1 to FITS refactoring --[[User:Greg Jansen|Greg Jansen]] 11:33, 15 November 2012 (PST)<br />
** Implement only the configuration options most people use, and let those be specified on the commandline instead of via XML.<br />
*** [[User:Gary_McGath|Gary McGrath]] on IRC points out that the use of external tools means that in FITS scanned files are independently loaded from disk by multiple tools, introducing unneeded IO overhead. Could be fixed in FITS itself.<br />
''Should we take a poll a day in advance to select 2 or 3 projects or should we just let everyone work on whatever proposal they wish?''<br />
<br />
==Preparation TODO==<br />
* GitHub How To<br />
** Set up temporary FITS and/or Tika forks that we can work on?<br />
* Set up Archivematica instances to test FPR<br />
* Easier signature development tools and/or signature contribution tracking, now partially complete, as outlined in [[Improving format ID coverage]]<br />
* Example file contribution How To document, c.f. [[Collecting format ID test files]]<br />
<br />
=Results=<br />
'''Nov 17. 07:30 UTC -- 30 hours later'''<br />
<pre style="white-space: pre-wrap; <br />
white-space: -moz-pre-wrap;<br />
white-space: -pre-wrap;<br />
white-space: -o-pre-wrap; <br />
word-wrap: break-word"><br />
<br />
Peter Van Garderen @pjvangarderen<br />
Proud to lead 24hr real time R&D cycle. Thanks #fileidhack people for your passion RT @jordanheit: testing OpenFITS wiki.curatecamp.org/index.php/FITS<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @GaryM03062: Testing FITS led me to discover a bug in JHOVE, so #fileidhack is worth something. sourceforge.net/tracker/?func=…<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @anjacks0n: Thanks for the files, @carusb github.com/openplanets/fo… #fileidhack<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack live link? RT @benfinoradin: Archiving all #fileidhack tweets today pic.twitter.com/7QI1DfmD<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks BL! #fileidhack RT @mopennock: BL team are working on eBook format identification today for #fileidhack - @anjacks0n @petemay et al<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @peshkira: OpenFITS current status: It compiles! #fileidhack<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @petemay: Tika sigs for PDB, Kindle AZW and LRF files created, re-testing over sample file set #fileidhack #eBook<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @mopennock: We've added 7 new eBook signatures to Tika this morning #fileidhack. Great work all!<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack Everyone ping PRONOM pls! RT @Britpunk80: #fileidhack if you want to create/test/submit your own: …keddatapronom.nationalarchives.gov.uk/sigdev/index.h…<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @Britpunk80: I've handed some droid sig files to @anjacks0n on rocketbook, epub, and ibooks. #fileidhack<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @HeatherBowden: @euanc @anjacks0n I have some Quark and InDesign files. You interested? #fileidhack<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack Nov1614:00UTC RT @pjvangarderen: Wazzup! West Coast in da fileidhacking house! RT @declan: good morning #fileidhack!<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @peshkira: #fileidhack Current status: FITS mavenized. PullRequest/Wiki \w explanation follow. /cc @GaryM03062<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @GaryM03062: New commit of OpenFITS allows setting max no. of threads in fits.xml #fileidhack github.com/gmcgath/openfi…<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack Great work on OpenFITS! Lets keep this alive RT @GaryM03062: Calling a day for #fileidhack. Great working with everyone!<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @Snarkivist<br />
#fileidhack team - just catching up on your work today - was internetless - look for summary in the morning<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @pjvangarderen: Nov1601:17UTC @euanc (Perth) #fileidhack IRC - Nov1704:16UTC @archivematica crew still hacking #24hrs+<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @benfinoradin: Good resource on RIFF/RIFX: johnloomis.org/cpe102/asgn/as…<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack Holy cow, the Quicktime motherload! RT @mistydemeo: Have some Quicktime videos, #fileidhack github.com/openplanets/fo…<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @GaryM03062: Another update for OpenFITS. Please read the wiki: wiki.curatecamp.org/index.php/FITS… #fileidhack<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @mistydemeo: Created @MacHomebrew formula for fidget to make file ID signatures for #fileidhack github.com/mistydemeo/hom…<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @pjvangarderen: @mistydemeo meet #openarchives /nick artefactualmtgroom #fileidhack pic.twitter.com/1Ffp1v6Y<br />
<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @archivematica: Artefactual picks up #fileidhack baton. OpenFits debian package launchpad.net/~archivematica… test time!<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @anjacks0n: @benfinoradin tweaked your sig, now identifies all test files you sent github.com/openplanets/fo… #fileidhack<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @GaryM03062: As a side effect of #fileidhack, I've been uploading source changes to JHOVE. sourceforge.net/projects/jhove/<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack Thanks GMT! RT @mopennock: It's all go this morning for the #fileidhack! wiki.curatecamp.org/index.php/CURA…<br />
<br />
<br />
Peter Van Garderen @pjvangarderen<br />
Thanks #fileidhack RT @WilliamKilbride: It's #dpc #ff follow friday. look at #fileidhack Better still, get involved wiki.curatecamp.org/index.php/CURA…<br />
<br />
<br />
</pre></div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=FITS&diff=2430FITS2012-11-17T05:36:03Z<p>Mark Jordan: /* Results of optimization tests */</p>
<hr />
<div>This page is for notes on how to optimize FITS.<br />
<br />
==Whence OpenFITS?==<br />
I created the OpenFITS fork (and at least one fork has been made from that) for the Hackathon, but have no long term plans for it. Randy Stern at Harvard has said he would rather not have other forks around, thinking they'll create confusion or something. They'll hopefully take what's good in OpenFITS and merge it back to their Google Code repository. <br />
<br />
To keep this in perspective, the Harvard Library regards FITS primarily as an ingest tool for their internal purpose, and makes it available to others as a side issue. Convincing them to incorporate serious changes could be difficult. If someone else wants to create a permanent fork of OpenFITS, that's not my concern, but I'm keeping my own fork to changes that I think they're likely to accept at Harvard. --[[User:Gary McGath|Gary McGath]] 16:59, 16 November 2012 (PST)<br />
<br />
==Thread parallelism and memory consumption==<br />
FITS runs all its tools in parallel threads. This can result in heavy memory consumption, particularly if a big file is being processed. Harvard optimized FITS for DRS ingest, which can afford a lot of memory. In other environments this might cause thrashing. <br />
<br />
One approach might be to add a command line or config parameter to limit the number of simultaneous threads.<br />
<br />
==FITS on Github==<br />
<br />
A [https://github.com/gmcgath/openfits fork of FITS] is now up on Github. Let me know if you want to be added as a contributor. I've changed the repository name from fits to openfits to assuage concerns from Harvard.<br />
<br />
The JHOVE 1.8 jars are now there.<br />
<br />
==Optimization tip==<br />
<br />
The HTML module in JHOVE is very slow and not very useful unless you're rejecting the 90% of the HTML files that aren't strictly valid. If you don't need it, edit xml/jhove/jhove.conf and remove the module element which refers to edu.harvard.hul.ois.jhove.module.HtmlModule .<br />
* Similarly, at UNC we had to exclude XML files from being processed by jhove, as moderately long documents could cause the jhove process to hang for long periods of time. (bbpennel)<br />
<br />
==Improving JHOVE performance within FITS==<br />
<br />
Spencer McEwen, who wrote most or all of FITS, tells me that JHOVE is the biggest bottleneck in FITS and that each time it's called, it grabs required schemas from the Web, instead of using local copies. I'll see if that's fixable.<br />
CORRECTION: Never mind. He was talking about validating XML files that are given to JHOVE, not validation of JHOVE's own configuration files. Oh, well, onward...<br />
<br />
But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile?<br />
<br />
I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the '''module''' element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example:<br />
<nowiki> <module> </nowiki><br />
<br />
<nowiki> <class>edu.harvard.hul.ois.jhove.module.XmlModule</class> </nowiki><br />
<br />
<nowiki> <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/mods/v3;/Users/myself/schemas/mods-3-4.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/standards/mods/v3/mods-3-2.xsd;/Users/myself/schemas/mods-3-2.xsd</param> </nowiki><br />
<br />
<nowiki> </module> </nowiki><br />
<br />
--[[User:Gary McGath|Gary McGath]] 12:39, 16 November 2012 (PST)<br />
<br />
==Mavenizing FITS?==<br />
<br />
Peshkira has some remarks on Mavenizing FITS at [[https://github.com/peshkira/openfits/wiki/FITS-&-Maven]]. While this may have merit in the abstract, it leads to the question of what to do with it. Harvard LTS isn't currently using Maven for anything I can think of, so it would be a hard sell. Unless OPF wants to create its own long-term fork of FITS, this could be a difficult sell with Harvard. --[[User:Gary McGath|Gary McGath]] 15:05, 16 November 2012 (PST)<br />
<br />
==Results of optimization tests==<br />
These are some preliminary tests done by the Archivematica team, using OpenFITS. Machine specs are 1 GB RAM, no swap, 1 2000Mhz core, 1 or 2 threads (not sure, it's late). Input is the standard (complete) Archivematica sample data.<br />
<br />
time fits.sh -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp1<br />
real 2m47.007s<br />
user 1m41.178s<br />
sys 0m29.278s<br />
<br />
time openfits -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp2<br />
<br />
with default 20 threads<br />
<br />
real 0m4.355s<br />
user 0m3.420s<br />
sys 0m0.412s<br />
<br />
Still lots of:<br />
edu.harvard.hul.ois.fits.exceptions.FitsToolException: Error parsing Exiftool XML Output (Error on line 31: The content of elements must consist of well-formed character data or markup.)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.createXml(Exiftool.java:197)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.extractInfo(Exiftool.java:118)<br />
at edu.harvard.hul.ois.fits.tools.ToolBase.run(ToolBase.java:141)<br />
at java.lang.Thread.run(Thread.java:679)<br />
<br />
Set to 4 threads<br />
real 3m22.685s<br />
user 2m2.144s<br />
sys 0m40.235s<br />
<br />
Set to 2 threads<br />
real 3m27.194s<br />
user 1m31.018s<br />
sys 0m29.726s<br />
<br />
Set memory to 1 GB<br />
-Xmx1024m<br />
<br />
real 3m51.133s<br />
user 1m37.506s<br />
sys 0m30.574s<br />
<br />
Set to 512MB<br />
real 3m39.913s<br />
user 1m42.002s<br />
sys 0m29.050s<br />
<br />
no changes<br />
real 2m28.219s<br />
user 1m17.677s<br />
sys 0m19.325s<br />
<br />
3 threads<br />
noticed max ram<br />
real 3m50.741s<br />
user 1m38.278s<br />
sys 0m34.466s<br />
<br />
Results after increasing RAM to 4 GB (3 threads, JVM allocated 512MB, as above):<br />
<br />
real 3m39.226s<br />
user 1m52.487s<br />
sys 0m41.527s<br />
<br />
Results after increasing threads in OpenFITS config from 3 to 10:<br />
<br />
real 3m6.449s<br />
user 1m58.223s<br />
sys 0m32.542s<br />
<br />
Results after increasing threads in OpenFITS config from 10 to 40:<br />
<br />
real 2m54.929s<br />
user 1m50.299s<br />
sys 0m30.662s<br />
<br />
These tests don't show an appreciable decrease in processing time by increasing the machine's RAM, the number of threads in the fits.xml config file, or the amount of RAM allocated to the JVM.</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=FITS&diff=2429FITS2012-11-17T05:33:23Z<p>Mark Jordan: /* Results of optimization tests */</p>
<hr />
<div>This page is for notes on how to optimize FITS.<br />
<br />
==Whence OpenFITS?==<br />
I created the OpenFITS fork (and at least one fork has been made from that) for the Hackathon, but have no long term plans for it. Randy Stern at Harvard has said he would rather not have other forks around, thinking they'll create confusion or something. They'll hopefully take what's good in OpenFITS and merge it back to their Google Code repository. <br />
<br />
To keep this in perspective, the Harvard Library regards FITS primarily as an ingest tool for their internal purpose, and makes it available to others as a side issue. Convincing them to incorporate serious changes could be difficult. If someone else wants to create a permanent fork of OpenFITS, that's not my concern, but I'm keeping my own fork to changes that I think they're likely to accept at Harvard. --[[User:Gary McGath|Gary McGath]] 16:59, 16 November 2012 (PST)<br />
<br />
==Thread parallelism and memory consumption==<br />
FITS runs all its tools in parallel threads. This can result in heavy memory consumption, particularly if a big file is being processed. Harvard optimized FITS for DRS ingest, which can afford a lot of memory. In other environments this might cause thrashing. <br />
<br />
One approach might be to add a command line or config parameter to limit the number of simultaneous threads.<br />
<br />
==FITS on Github==<br />
<br />
A [https://github.com/gmcgath/openfits fork of FITS] is now up on Github. Let me know if you want to be added as a contributor. I've changed the repository name from fits to openfits to assuage concerns from Harvard.<br />
<br />
The JHOVE 1.8 jars are now there.<br />
<br />
==Optimization tip==<br />
<br />
The HTML module in JHOVE is very slow and not very useful unless you're rejecting the 90% of the HTML files that aren't strictly valid. If you don't need it, edit xml/jhove/jhove.conf and remove the module element which refers to edu.harvard.hul.ois.jhove.module.HtmlModule .<br />
* Similarly, at UNC we had to exclude XML files from being processed by jhove, as moderately long documents could cause the jhove process to hang for long periods of time. (bbpennel)<br />
<br />
==Improving JHOVE performance within FITS==<br />
<br />
Spencer McEwen, who wrote most or all of FITS, tells me that JHOVE is the biggest bottleneck in FITS and that each time it's called, it grabs required schemas from the Web, instead of using local copies. I'll see if that's fixable.<br />
CORRECTION: Never mind. He was talking about validating XML files that are given to JHOVE, not validation of JHOVE's own configuration files. Oh, well, onward...<br />
<br />
But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile?<br />
<br />
I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the '''module''' element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example:<br />
<nowiki> <module> </nowiki><br />
<br />
<nowiki> <class>edu.harvard.hul.ois.jhove.module.XmlModule</class> </nowiki><br />
<br />
<nowiki> <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/mods/v3;/Users/myself/schemas/mods-3-4.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/standards/mods/v3/mods-3-2.xsd;/Users/myself/schemas/mods-3-2.xsd</param> </nowiki><br />
<br />
<nowiki> </module> </nowiki><br />
<br />
--[[User:Gary McGath|Gary McGath]] 12:39, 16 November 2012 (PST)<br />
<br />
==Mavenizing FITS?==<br />
<br />
Peshkira has some remarks on Mavenizing FITS at [[https://github.com/peshkira/openfits/wiki/FITS-&-Maven]]. While this may have merit in the abstract, it leads to the question of what to do with it. Harvard LTS isn't currently using Maven for anything I can think of, so it would be a hard sell. Unless OPF wants to create its own long-term fork of FITS, this could be a difficult sell with Harvard. --[[User:Gary McGath|Gary McGath]] 15:05, 16 November 2012 (PST)<br />
<br />
==Results of optimization tests==<br />
These are some preliminary tests done by the Archivematica team, using OpenFITS. Machine specs are 1 GB RAM, no swap, 1 2000Mhz core, 1 or 2 threads (not sure, it's late). Input is the standard (complete) Archivematica sample data.<br />
<br />
time fits.sh -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp1<br />
real 2m47.007s<br />
user 1m41.178s<br />
sys 0m29.278s<br />
<br />
time openfits -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp2<br />
<br />
with default 20 threads<br />
<br />
real 0m4.355s<br />
user 0m3.420s<br />
sys 0m0.412s<br />
<br />
Still lots of:<br />
edu.harvard.hul.ois.fits.exceptions.FitsToolException: Error parsing Exiftool XML Output (Error on line 31: The content of elements must consist of well-formed character data or markup.)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.createXml(Exiftool.java:197)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.extractInfo(Exiftool.java:118)<br />
at edu.harvard.hul.ois.fits.tools.ToolBase.run(ToolBase.java:141)<br />
at java.lang.Thread.run(Thread.java:679)<br />
<br />
Set to 4 threads<br />
real 3m22.685s<br />
user 2m2.144s<br />
sys 0m40.235s<br />
<br />
Set to 2 threads<br />
real 3m27.194s<br />
user 1m31.018s<br />
sys 0m29.726s<br />
<br />
Set memory to 1 GB<br />
-Xmx1024m<br />
<br />
real 3m51.133s<br />
user 1m37.506s<br />
sys 0m30.574s<br />
<br />
Set to 512MB<br />
real 3m39.913s<br />
user 1m42.002s<br />
sys 0m29.050s<br />
<br />
no changes<br />
real 2m28.219s<br />
user 1m17.677s<br />
sys 0m19.325s<br />
<br />
3 threads<br />
noticed max ram<br />
real 3m50.741s<br />
user 1m38.278s<br />
sys 0m34.466s<br />
<br />
Results after increasing RAM to 4 GB (3 threads, JVM allocated 512MB, as above):<br />
<br />
real 3m39.226s<br />
user 1m52.487s<br />
sys 0m41.527s<br />
<br />
Results after increasing threads in OpenFITS config from 3 to 10:<br />
<br />
real 3m6.449s<br />
user 1m58.223s<br />
sys 0m32.542s<br />
<br />
Results after increasing threads in OpenFITS config from 10 to 40:<br />
<br />
real 2m54.929s<br />
user 1m50.299s<br />
sys 0m30.662s</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=FITS&diff=2428FITS2012-11-17T04:55:55Z<p>Mark Jordan: </p>
<hr />
<div>This page is for notes on how to optimize FITS.<br />
<br />
==Whence OpenFITS?==<br />
I created the OpenFITS fork (and at least one fork has been made from that) for the Hackathon, but have no long term plans for it. Randy Stern at Harvard has said he would rather not have other forks around, thinking they'll create confusion or something. They'll hopefully take what's good in OpenFITS and merge it back to their Google Code repository. <br />
<br />
To keep this in perspective, the Harvard Library regards FITS primarily as an ingest tool for their internal purpose, and makes it available to others as a side issue. Convincing them to incorporate serious changes could be difficult. If someone else wants to create a permanent fork of OpenFITS, that's not my concern, but I'm keeping my own fork to changes that I think they're likely to accept at Harvard. --[[User:Gary McGath|Gary McGath]] 16:59, 16 November 2012 (PST)<br />
<br />
==Thread parallelism and memory consumption==<br />
FITS runs all its tools in parallel threads. This can result in heavy memory consumption, particularly if a big file is being processed. Harvard optimized FITS for DRS ingest, which can afford a lot of memory. In other environments this might cause thrashing. <br />
<br />
One approach might be to add a command line or config parameter to limit the number of simultaneous threads.<br />
<br />
==FITS on Github==<br />
<br />
A [https://github.com/gmcgath/openfits fork of FITS] is now up on Github. Let me know if you want to be added as a contributor. I've changed the repository name from fits to openfits to assuage concerns from Harvard.<br />
<br />
The JHOVE 1.8 jars are now there.<br />
<br />
==Optimization tip==<br />
<br />
The HTML module in JHOVE is very slow and not very useful unless you're rejecting the 90% of the HTML files that aren't strictly valid. If you don't need it, edit xml/jhove/jhove.conf and remove the module element which refers to edu.harvard.hul.ois.jhove.module.HtmlModule .<br />
* Similarly, at UNC we had to exclude XML files from being processed by jhove, as moderately long documents could cause the jhove process to hang for long periods of time. (bbpennel)<br />
<br />
==Improving JHOVE performance within FITS==<br />
<br />
Spencer McEwen, who wrote most or all of FITS, tells me that JHOVE is the biggest bottleneck in FITS and that each time it's called, it grabs required schemas from the Web, instead of using local copies. I'll see if that's fixable.<br />
CORRECTION: Never mind. He was talking about validating XML files that are given to JHOVE, not validation of JHOVE's own configuration files. Oh, well, onward...<br />
<br />
But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile?<br />
<br />
I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the '''module''' element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example:<br />
<nowiki> <module> </nowiki><br />
<br />
<nowiki> <class>edu.harvard.hul.ois.jhove.module.XmlModule</class> </nowiki><br />
<br />
<nowiki> <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/mods/v3;/Users/myself/schemas/mods-3-4.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/standards/mods/v3/mods-3-2.xsd;/Users/myself/schemas/mods-3-2.xsd</param> </nowiki><br />
<br />
<nowiki> </module> </nowiki><br />
<br />
--[[User:Gary McGath|Gary McGath]] 12:39, 16 November 2012 (PST)<br />
<br />
==Mavenizing FITS?==<br />
<br />
Peshkira has some remarks on Mavenizing FITS at [[https://github.com/peshkira/openfits/wiki/FITS-&-Maven]]. While this may have merit in the abstract, it leads to the question of what to do with it. Harvard LTS isn't currently using Maven for anything I can think of, so it would be a hard sell. Unless OPF wants to create its own long-term fork of FITS, this could be a difficult sell with Harvard. --[[User:Gary McGath|Gary McGath]] 15:05, 16 November 2012 (PST)<br />
<br />
==Results of optimization tests==<br />
These are some preliminary tests done by the Archivematica team, using OpenFITS. Machine specs are 1 GB RAM, 1 2000Mhz core, 1 or 2 threads (not sure, it's late).<br />
<br />
time fits.sh -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp1<br />
real 2m47.007s<br />
user 1m41.178s<br />
sys 0m29.278s<br />
<br />
time openfits -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp2<br />
<br />
with default 20 threads<br />
<br />
real 0m4.355s<br />
user 0m3.420s<br />
sys 0m0.412s<br />
<br />
Still lots of:<br />
edu.harvard.hul.ois.fits.exceptions.FitsToolException: Error parsing Exiftool XML Output (Error on line 31: The content of elements must consist of well-formed character data or markup.)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.createXml(Exiftool.java:197)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.extractInfo(Exiftool.java:118)<br />
at edu.harvard.hul.ois.fits.tools.ToolBase.run(ToolBase.java:141)<br />
at java.lang.Thread.run(Thread.java:679)<br />
<br />
Set to 4 threads<br />
real 3m22.685s<br />
user 2m2.144s<br />
sys 0m40.235s<br />
<br />
Set to 2 threads<br />
real 3m27.194s<br />
user 1m31.018s<br />
sys 0m29.726s<br />
<br />
Set memory to 1 GB<br />
-Xmx1024m<br />
<br />
real 3m51.133s<br />
user 1m37.506s<br />
sys 0m30.574s<br />
<br />
Set to 512MB<br />
real 3m39.913s<br />
user 1m42.002s<br />
sys 0m29.050s<br />
<br />
no changes<br />
real 2m28.219s<br />
user 1m17.677s<br />
sys 0m19.325s<br />
<br />
3 threads<br />
noticed max ram<br />
real 3m50.741s<br />
user 1m38.278s<br />
sys 0m34.466s</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=FITS&diff=2427FITS2012-11-17T04:55:22Z<p>Mark Jordan: </p>
<hr />
<div>This page is for notes on how to optimize FITS.<br />
<br />
==Whence OpenFITS?==<br />
I created the OpenFITS fork (and at least one fork has been made from that) for the Hackathon, but have no long term plans for it. Randy Stern at Harvard has said he would rather not have other forks around, thinking they'll create confusion or something. They'll hopefully take what's good in OpenFITS and merge it back to their Google Code repository. <br />
<br />
To keep this in perspective, the Harvard Library regards FITS primarily as an ingest tool for their internal purpose, and makes it available to others as a side issue. Convincing them to incorporate serious changes could be difficult. If someone else wants to create a permanent fork of OpenFITS, that's not my concern, but I'm keeping my own fork to changes that I think they're likely to accept at Harvard. --[[User:Gary McGath|Gary McGath]] 16:59, 16 November 2012 (PST)<br />
<br />
==Thread parallelism and memory consumption==<br />
FITS runs all its tools in parallel threads. This can result in heavy memory consumption, particularly if a big file is being processed. Harvard optimized FITS for DRS ingest, which can afford a lot of memory. In other environments this might cause thrashing. <br />
<br />
One approach might be to add a command line or config parameter to limit the number of simultaneous threads.<br />
<br />
==FITS on Github==<br />
<br />
A [https://github.com/gmcgath/openfits fork of FITS] is now up on Github. Let me know if you want to be added as a contributor. I've changed the repository name from fits to openfits to assuage concerns from Harvard.<br />
<br />
The JHOVE 1.8 jars are now there.<br />
<br />
==Optimization tip==<br />
<br />
The HTML module in JHOVE is very slow and not very useful unless you're rejecting the 90% of the HTML files that aren't strictly valid. If you don't need it, edit xml/jhove/jhove.conf and remove the module element which refers to edu.harvard.hul.ois.jhove.module.HtmlModule .<br />
* Similarly, at UNC we had to exclude XML files from being processed by jhove, as moderately long documents could cause the jhove process to hang for long periods of time. (bbpennel)<br />
<br />
==Improving JHOVE performance within FITS==<br />
<br />
Spencer McEwen, who wrote most or all of FITS, tells me that JHOVE is the biggest bottleneck in FITS and that each time it's called, it grabs required schemas from the Web, instead of using local copies. I'll see if that's fixable.<br />
CORRECTION: Never mind. He was talking about validating XML files that are given to JHOVE, not validation of JHOVE's own configuration files. Oh, well, onward...<br />
<br />
But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile?<br />
<br />
I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the '''module''' element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example:<br />
<nowiki> <module> </nowiki><br />
<br />
<nowiki> <class>edu.harvard.hul.ois.jhove.module.XmlModule</class> </nowiki><br />
<br />
<nowiki> <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/mods/v3;/Users/myself/schemas/mods-3-4.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/standards/mods/v3/mods-3-2.xsd;/Users/myself/schemas/mods-3-2.xsd</param> </nowiki><br />
<br />
<nowiki> </module> </nowiki><br />
<br />
--[[User:Gary McGath|Gary McGath]] 12:39, 16 November 2012 (PST)<br />
<br />
==Mavenizing FITS?==<br />
<br />
Peshkira has some remarks on Mavenizing FITS at [[https://github.com/peshkira/openfits/wiki/FITS-&-Maven]]. While this may have merit in the abstract, it leads to the question of what to do with it. Harvard LTS isn't currently using Maven for anything I can think of, so it would be a hard sell. Unless OPF wants to create its own long-term fork of FITS, this could be a difficult sell with Harvard. --[[User:Gary McGath|Gary McGath]] 15:05, 16 November 2012 (PST)<br />
<br />
==Results of optimization tests==<br />
These are some preliminary tests done by the Archivematica team, using OpenFITS. Machine specs are 1 GB RAM, 1 2000Mhz core, 1 or 2 threads (not sure, it's late).<br />
<br />
1 core<br />
<br />
time fits.sh -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp1<br />
real 2m47.007s<br />
user 1m41.178s<br />
sys 0m29.278s<br />
<br />
time openfits -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp2<br />
<br />
with default 20 threads<br />
<br />
real 0m4.355s<br />
user 0m3.420s<br />
sys 0m0.412s<br />
<br />
Still lots of:<br />
edu.harvard.hul.ois.fits.exceptions.FitsToolException: Error parsing Exiftool XML Output (Error on line 31: The content of elements must consist of well-formed character data or markup.)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.createXml(Exiftool.java:197)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.extractInfo(Exiftool.java:118)<br />
at edu.harvard.hul.ois.fits.tools.ToolBase.run(ToolBase.java:141)<br />
at java.lang.Thread.run(Thread.java:679)<br />
<br />
Set to 4 threads<br />
real 3m22.685s<br />
user 2m2.144s<br />
sys 0m40.235s<br />
<br />
Set to 2 threads<br />
real 3m27.194s<br />
user 1m31.018s<br />
sys 0m29.726s<br />
<br />
Set memory to 1 GB<br />
-Xmx1024m<br />
<br />
real 3m51.133s<br />
user 1m37.506s<br />
sys 0m30.574s<br />
<br />
Set to 512MB<br />
real 3m39.913s<br />
user 1m42.002s<br />
sys 0m29.050s<br />
<br />
no changes<br />
real 2m28.219s<br />
user 1m17.677s<br />
sys 0m19.325s<br />
<br />
3 threads<br />
noticed max ram<br />
real 3m50.741s<br />
user 1m38.278s<br />
sys 0m34.466s</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=FITS&diff=2426FITS2012-11-17T04:46:36Z<p>Mark Jordan: </p>
<hr />
<div>This page is for notes on how to optimize FITS.<br />
<br />
==Whence OpenFITS?==<br />
I created the OpenFITS fork (and at least one fork has been made from that) for the Hackathon, but have no long term plans for it. Randy Stern at Harvard has said he would rather not have other forks around, thinking they'll create confusion or something. They'll hopefully take what's good in OpenFITS and merge it back to their Google Code repository. <br />
<br />
To keep this in perspective, the Harvard Library regards FITS primarily as an ingest tool for their internal purpose, and makes it available to others as a side issue. Convincing them to incorporate serious changes could be difficult. If someone else wants to create a permanent fork of OpenFITS, that's not my concern, but I'm keeping my own fork to changes that I think they're likely to accept at Harvard. --[[User:Gary McGath|Gary McGath]] 16:59, 16 November 2012 (PST)<br />
<br />
==Thread parallelism and memory consumption==<br />
FITS runs all its tools in parallel threads. This can result in heavy memory consumption, particularly if a big file is being processed. Harvard optimized FITS for DRS ingest, which can afford a lot of memory. In other environments this might cause thrashing. <br />
<br />
One approach might be to add a command line or config parameter to limit the number of simultaneous threads.<br />
<br />
==FITS on Github==<br />
<br />
A [https://github.com/gmcgath/openfits fork of FITS] is now up on Github. Let me know if you want to be added as a contributor. I've changed the repository name from fits to openfits to assuage concerns from Harvard.<br />
<br />
The JHOVE 1.8 jars are now there.<br />
<br />
==Optimization tip==<br />
<br />
The HTML module in JHOVE is very slow and not very useful unless you're rejecting the 90% of the HTML files that aren't strictly valid. If you don't need it, edit xml/jhove/jhove.conf and remove the module element which refers to edu.harvard.hul.ois.jhove.module.HtmlModule .<br />
* Similarly, at UNC we had to exclude XML files from being processed by jhove, as moderately long documents could cause the jhove process to hang for long periods of time. (bbpennel)<br />
<br />
==Improving JHOVE performance within FITS==<br />
<br />
Spencer McEwen, who wrote most or all of FITS, tells me that JHOVE is the biggest bottleneck in FITS and that each time it's called, it grabs required schemas from the Web, instead of using local copies. I'll see if that's fixable.<br />
CORRECTION: Never mind. He was talking about validating XML files that are given to JHOVE, not validation of JHOVE's own configuration files. Oh, well, onward...<br />
<br />
But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile?<br />
<br />
I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the '''module''' element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example:<br />
<nowiki> <module> </nowiki><br />
<br />
<nowiki> <class>edu.harvard.hul.ois.jhove.module.XmlModule</class> </nowiki><br />
<br />
<nowiki> <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/mods/v3;/Users/myself/schemas/mods-3-4.xsd</param></nowiki><br />
<br />
<nowiki> <param>schema=http://www.loc.gov/standards/mods/v3/mods-3-2.xsd;/Users/myself/schemas/mods-3-2.xsd</param> </nowiki><br />
<br />
<nowiki> </module> </nowiki><br />
<br />
--[[User:Gary McGath|Gary McGath]] 12:39, 16 November 2012 (PST)<br />
<br />
==Mavenizing FITS?==<br />
<br />
Peshkira has some remarks on Mavenizing FITS at [[https://github.com/peshkira/openfits/wiki/FITS-&-Maven]]. While this may have merit in the abstract, it leads to the question of what to do with it. Harvard LTS isn't currently using Maven for anything I can think of, so it would be a hard sell. Unless OPF wants to create its own long-term fork of FITS, this could be a difficult sell with Harvard. --[[User:Gary McGath|Gary McGath]] 15:05, 16 November 2012 (PST)<br />
<br />
==Results of optimization tests==<br />
These are some preliminary tests done by the Archivematica team, using OpenFITS. Machine specs are 1 GB ram 1 core, 1 or 2 threads (not sure, it's late).<br />
<br />
1 core<br />
<br />
time fits.sh -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp1<br />
real 2m47.007s<br />
user 1m41.178s<br />
sys 0m29.278s<br />
<br />
time openfits -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp2<br />
<br />
with default 20 threads<br />
<br />
real 0m4.355s<br />
user 0m3.420s<br />
sys 0m0.412s<br />
<br />
Still lots of:<br />
edu.harvard.hul.ois.fits.exceptions.FitsToolException: Error parsing Exiftool XML Output (Error on line 31: The content of elements must consist of well-formed character data or markup.)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.createXml(Exiftool.java:197)<br />
at edu.harvard.hul.ois.fits.tools.exiftool.Exiftool.extractInfo(Exiftool.java:118)<br />
at edu.harvard.hul.ois.fits.tools.ToolBase.run(ToolBase.java:141)<br />
at java.lang.Thread.run(Thread.java:679)<br />
<br />
Set to 4 threads<br />
real 3m22.685s<br />
user 2m2.144s<br />
sys 0m40.235s<br />
<br />
Set to 2 threads<br />
real 3m27.194s<br />
user 1m31.018s<br />
sys 0m29.726s<br />
<br />
Set memory to 1 GB<br />
-Xmx1024m<br />
<br />
real 3m51.133s<br />
user 1m37.506s<br />
sys 0m30.574s<br />
<br />
Set to 512MB<br />
real 3m39.913s<br />
user 1m42.002s<br />
sys 0m29.050s<br />
<br />
no changes<br />
real 2m28.219s<br />
user 1m17.677s<br />
sys 0m19.325s<br />
<br />
3 threads<br />
noticed max ram<br />
real 3m50.741s<br />
user 1m38.278s<br />
sys 0m34.466s</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_24_hour_worldwide_file_id_hackathon_Nov_16_2012&diff=2171CURATEcamp 24 hour worldwide file id hackathon Nov 16 20122012-11-02T18:43:02Z<p>Mark Jordan: </p>
<hr />
<div>[[Main Page]] > CURATEcamp iPRES 2012 > CURATEcamp and Open Planets Foundation 24 hour file id hackathon Nov 16 2012<br />
<br />
=Background=<br />
One break-out session at the CURATEcamp iPRES 2012 was affectionately branded "file id confessional" where we commiserated on the state of our file id tools and processes. We also talked about:<br />
<br />
*We can do better job specifying and documenting our file id requirements / use cases<br />
*We're all hooked on that FITS.xml but FITS needs performance optimization ASAP (also, Is Harvard up for extra dev?)<br />
*Apache Tika is very actively supported and useful tool for file id and content extraction. How much of our file id requirements can it in fact cover?<br />
* Archivematica [https://www.archivematica.org/wiki/Format_policy_registry_requirements Format Policy Registry] use case (see also [http://actionplan.fcla.edu/ DAITSS action plans])<br />
* Jason Scott's "[http://ascii.textfiles.com/archives/3645 Let's Just Solve the Problem]" campaign to boldly catalog as much file format info as possible in the month of November.<br />
* also, CURATEcamp iPres participant Paul Wheatley has since posted: [http://www.openplanetsfoundation.org/blogs/2012-10-19-practitioners-have-spoken-we-need-better-characterisation We Need Better Characterization] as well as link to [http://willsworld.blogs.edina.ac.uk/2012/10/18/online-hack-event/ Online Hack Event]. This led to Twitter discussion between @pjvangarderen @anjacks0n @prwheatley about this 24 hr hackathon event.<br />
<br />
=What=<br />
<br />
24hour+ live hackathon event where multi-time zone teams work on common technical projects related to the CURATEcamp iPres 2012 file id discussions. <br />
<br />
Project proposals can be made by anyone.<br />
<br />
We will start the day with New Zealand (GMT +12:00) and end with North America West Coast wrapping up project(s), hopefully with one or two solid deliverables by 12 midnight-ish PST (GMT -8:00).<br />
<br />
=When: '''Fri Nov 16'''=<br />
<br />
* Friday, November 16, 2012<br />
** [http://wiki.opf-labs.org/display/KB/2012-11-13+OPF+Hackathon+-+Emulation%2C+learn+from+the+experts OPF Emulation Hackathon] is Nov 13-15. Freiburg, Germany. Sorry, Nov 16th was chosen somewhat haphazardly. We didn't mean to compete with OPF Hackathon event. But emulation needs file characterization too? Maybe OPF Emulation Hackathon can hand off some "File Id for Emulation" use cases to the Nov 16 24 hr Hackathon...or better yet, extend the Freiburg event to include participation in the Nov 16 24 hr worldwide #fileidhack event. Great way to cap off their Hackathon week! --[[User:PeterVG|PeterVG]] 11:48, 23 Oct 2012 (PDT)<br />
* <strike>Friday, November 23, 2012</strike><br />
** RT @declan: @pjvangarderen neat idea! You know that date is the day after US Thanksgiving, right? people might be on vacation<br />
<br />
=How=<br />
* Twitter: [https://twitter.com/search/realtime?q=%23fileidhack #fileidhack] (made it shorter)<br />
* CURATEcamp Mediawiki: [[Special:UserLogin|Log-in]] and please help update this page<br />
<br />
Let's put together a schedule, tasklist, & volunteers to road-test these tools for Nov 16:<br />
* Google Hangout: [[Google Hangout for CURATEcamp|fire up a webcam]] <br />
* GoogleDocs: we can live edit any docs we feel the urge to produce<br />
* IRC: use existing channel or create one just for event?<br />
* GitHub: get those pull requests going<br />
<br />
=Why=<br />
* Because we'll probably get some useful shit done<br />
* Because its fun to work with CURATEcamp people in a CURATEcamp type of way<br />
* Because doing a 24hr+ worldwide hack with real time collaboration tools is cool<br />
<br />
=Who ([[Special:UserLogin|Sign up]])=<br />
* '''GMT +12:00''' Digital Preservation Practical Implementers Guild (@DP_PIG)<br />
* ?<br />
* '''GMT +7:00''' [[User:Euan_Cochrane|Euan Cochrane]] (@euanc)<br />
* ?<br />
* '''GMT +2:00''' [[User:Maurice_de_Rooij|TechMaurice]] (NANETH)<br />
* '''GMT +1:00''' [[User:Nicholas_Clarke|Nicholas Clarke]] (@nclarkedk) - netarkivet.dk<br />
* '''GMT +0:00''' [[User:Andy_Jackson|Andy Jackson]] (@anjacks0n), Paul Wheatley (@prwheatley), BL digital preservation team - Maureen (@mopennock) PeteC, PeterM, Lynn, William, and maybe more...; [[User:David Underdown|David Underdown]] (@davidunderdown9) and maybe some more TNA folk<br />
* ?<br />
* '''GMT -5:00''' Kara Van Malssen (@kvanmalssen), Dave Rice (@dericed), Ben Fino-Radin (@benfinoradin), Gary McGath (@Garym03062), @anarchivist<br />
* '''GMT -5:00''' @lljohnston @blefurgy et al!<br />
* ?<br />
* '''GMT -8:00''' [http://artefactual.com/team Artefactual]: peter (@pjvangarderen), courtney (@snarkivist), evelyn, joseph, mikeC (@mcantelon), mikeG, austin, dan...plus any VanCity people wanting to participate from [http://artefactual.com/contact.html Artefactual office].<br />
<br />
=Project Proposals=<br />
* Document file id requirements / use cases<br />
* ArchiveTeam "Just Solve the Problem" wiki scraping -> structured data (CSV?, XML?, RDF?); as an ongoing service?<br />
* [[Improving format ID coverage]]<br />
** Maybe incorporate [http://www.ace.net.nz/tech/TechFileFormat.html "Almost Every file format in the world!"]<br />
* [[Collecting format ID test files]]<br />
* [[Improving identification methods]]<br />
** Develop a Format ID [http://digitalcontinuity.org/post/7327791836/emulation-workbench-for-digital-object-format-analysis "Emulation Workbench"] for format analysis<br />
** Document software input and output formats to use in limiting the option set for files of a particular time period (if we know all formats that were creatable during a period when a file was created then we can limit results to only those formats), and for use in [http://digitalcontinuity.org/post/7325561455/mining-application-documentation-for-file-format format intelligence mining].<br />
* Archivematica / Tika integration <br />
** @archivematica team & volunteers<br />
* Archivematica [https://www.archivematica.org/wiki/Format_policy_registry_requirements Format Policy Registry] testing<br />
** @archivematica team & volunteers<br />
* @kvanmalssen Improved file id /characterization support for AV files in existing tools like Tika and FITS. An update of Exiftool and inclusion of MediaInfo would be a good start. Or maybe test applicability of ffprobe/avprobe for this task.<br />
** @dericed This is exactly what ffprobe/avprobe does. Whereas the many of the digipres tools do identification by sampling x bytes from the head and tail, ffprobe/avprobe incorporate one of the many extensive demuxing libraries to manage identification of the contents.<br />
** @kvanmalssen - Yes, so can we get avprobe to output in a structured way? And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?<br />
** @dericed - Yes ffprobe/avprobe have the -print_format (-of) option so you can get json, xml, csv, or others. There's also an xsd published for the output. I suppose ffprobe could be incorporated into FITS but not sure if this is an efficient idea. The premise of FITS seems to put all preservation metadata considerations on the container (file format) but in AV collections the codecs and contained bitstreams are far more significant to consider.<br />
** @kvanmalssen - Issue is we need AV support (including track/bitstream support) in these general tools so people can process mixed collections. That's what I'd like to figure out.<br />
And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?<br />
** See also [[Improving identification methods]], which could perhaps be split into two or three and one of which merged with the above tweet discussion? [[User:Andy Jackson|Andy Jackson]] 15:20, 22 October 2012 (PDT)<br />
* FITS or Tika bugfix marathon (e.g. [https://issues.apache.org/jira/browse/TIKA-539 this one]).<br />
** Perhaps consider refactoring FITS to re-use existing dependency management tools like Maven and apt/yum/etc instead of manual dependency management? [[User:Andy Jackson|Andy Jackson]] 05:16, 23 October 2012 (PDT)<br />
* [[User:Maurice_de_Rooij|TechMaurice]]: Replace container identification function of [https://github.com/openplanets/fido FIDO] using PRONOM container signature.<br />
''Should we take a poll a day in advance to select 2 or 3 projects or should we just let everyone work on whatever proposal they wish?''<br />
<br />
=Preparation TODO=<br />
* GitHub How To<br />
** Set up temporary FITS and/or Tika forks that we can work on?<br />
* Easier signature development tools and/or signature contribution tracking, as outlined in [[Improving format ID coverage]]<br />
* Example file contribution How To document, c.f. [[Collecting format ID test files]]<br />
* Prep Archivematica dev VMs (incl Tika checkout), spin up & grant IPs/SSH to Hackfest participants upon request (Artefactual: [http://artefactual.com/austin-trask.html Austin])</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_24_hour_worldwide_file_id_hackathon_Nov_16_2012&diff=2170CURATEcamp 24 hour worldwide file id hackathon Nov 16 20122012-11-01T22:32:53Z<p>Mark Jordan: </p>
<hr />
<div>[[Main Page]] > CURATEcamp iPRES 2012 > CURATEcamp and Open Planets Foundation 24 hour file id hackathon Nov 16 2012<br />
<br />
=Background=<br />
One break-out session at the CURATEcamp iPRES 2012 was affectionately branded "file id confessional" where we commiserated on the state of our file id tools and processes. We also talked about:<br />
<br />
*We can do better job specifying and documenting our file id requirements / use cases<br />
*We're all hooked on that FITS.xml but FITS needs performance optimization ASAP (also, Is Harvard up for extra dev?)<br />
*Apache Tika is very actively supported and useful tool for file id and content extraction. How much of our file id requirements can it in fact cover?<br />
* Archivematica [https://www.archivematica.org/wiki/Format_policy_registry_requirements Format Policy Registry] use case (see also [http://actionplan.fcla.edu/ DAITSS action plans])<br />
* Jason Scott's "[http://ascii.textfiles.com/archives/3645 Let's Just Solve the Problem]" campaign to boldly catalog as much file format info as possible in the month of November.<br />
* also, CURATEcamp iPres participant Paul Wheatley has since posted: [http://www.openplanetsfoundation.org/blogs/2012-10-19-practitioners-have-spoken-we-need-better-characterisation We Need Better Characterization] as well as link to [http://willsworld.blogs.edina.ac.uk/2012/10/18/online-hack-event/ Online Hack Event]. This led to Twitter discussion between @pjvangarderen @anjacks0n @prwheatley about this 24 hr hackathon event.<br />
<br />
=What=<br />
<br />
24hour+ live hackathon event where multi-time zone teams work on common technical projects related to the CURATEcamp iPres 2012 file id discussions. <br />
<br />
Project proposals can be made by anyone.<br />
<br />
We will start the day with New Zealand (GMT +12:00) and end with North America West Coast wrapping up project(s), hopefully with one or two solid deliverables by 12 midnight-ish PST (GMT -8:00).<br />
<br />
=When: '''Fri Nov 16'''=<br />
<br />
* Friday, November 16, 2012<br />
** [http://wiki.opf-labs.org/display/KB/2012-11-13+OPF+Hackathon+-+Emulation%2C+learn+from+the+experts OPF Emulation Hackathon] is Nov 13-15. Freiburg, Germany. Sorry, Nov 16th was chosen somewhat haphazardly. We didn't mean to compete with OPF Hackathon event. But emulation needs file characterization too? Maybe OPF Emulation Hackathon can hand off some "File Id for Emulation" use cases to the Nov 16 24 hr Hackathon...or better yet, extend the Freiburg event to include participation in the Nov 16 24 hr worldwide #fileidhack event. Great way to cap off their Hackathon week! --[[User:PeterVG|PeterVG]] 11:48, 23 Oct 2012 (PDT)<br />
* <strike>Friday, November 23, 2012</strike><br />
** RT @declan: @pjvangarderen neat idea! You know that date is the day after US Thanksgiving, right? people might be on vacation<br />
<br />
=How=<br />
* Twitter: [https://twitter.com/search/realtime?q=%23fileidhack #fileidhack] (made it shorter)<br />
* CURATEcamp Mediawiki: [[Special:UserLogin|Log-in]] and please help update this page<br />
<br />
Let's put together a schedule, tasklist, & volunteers to road-test these tools for Nov 16:<br />
* Google Hangout: [[Google Hangout for CURATEcamp|fire up a webcam]] <br />
* GoogleDocs: we can live edit any docs we feel the urge to produce<br />
* IRC: use existing channel or create one just for event?<br />
* GitHub: get those pull requests going<br />
<br />
=Why=<br />
* Because we'll probably get some useful shit done<br />
* Because its fun to work with CURATEcamp people in a CURATEcamp type of way<br />
* Because doing a 24hr+ worldwide hack with real time collaboration tools is cool<br />
<br />
=Who ([[Special:UserLogin|Sign up]])=<br />
* '''GMT +12:00''' Digital Preservation Practical Implementers Guild (@DP_PIG)<br />
* ?<br />
* '''GMT +7:00''' [[User:Euan_Cochrane|Euan Cochrane]] (@euanc)<br />
* ?<br />
* '''GMT +2:00''' [[User:Maurice_de_Rooij|TechMaurice]] (NANETH)<br />
* '''GMT +1:00''' [[User:Nicholas_Clarke|Nicholas Clarke]] (@nclarkedk) - netarkivet.dk<br />
* '''GMT +0:00''' [[User:Andy_Jackson|Andy Jackson]] (@anjacks0n), Paul Wheatley (@prwheatley), BL digital preservation team - Maureen (@mopennock) PeteC, PeterM, Lynn, William, and maybe more...; [[User:David Underdown|David Underdown]] (@davidunderdown9) and maybe some more TNA folk<br />
* ?<br />
* '''GMT -5:00''' Kara Van Malssen (@kvanmalssen), Dave Rice (@dericed), Ben Fino-Radin (@benfinoradin), Gary McGath (@Garym03062), @anarchivist<br />
* '''GMT -5:00''' @lljohnston @blefurgy et al!<br />
* ?<br />
* '''GMT -8:00''' [http://artefactual.com/team Artefactual]: peter (@pjvangarderen), courtney (@snarkivist), evelyn, joseph, mikeC (@mcantelon), mikeG, austin, dan...plus any VanCity people wanting to participate from [http://artefactual.com/contact.html Artefactual office].<br />
<br />
=Project Proposals=<br />
* Document file id requirements / use cases<br />
* ArchiveTeam "Just Solve the Problem" wiki scraping -> structured data (CSV?, XML?, RDF?); as an ongoing service?<br />
* [[Improving format ID coverage]]<br />
* [[Collecting format ID test files]]<br />
* [[Improving identification methods]]<br />
** Develop a Format ID [http://digitalcontinuity.org/post/7327791836/emulation-workbench-for-digital-object-format-analysis "Emulation Workbench"] for format analysis<br />
** Document software input and output formats to use in limiting the option set for files of a particular time period (if we know all formats that were creatable during a period when a file was created then we can limit results to only those formats), and for use in [http://digitalcontinuity.org/post/7325561455/mining-application-documentation-for-file-format format intelligence mining].<br />
* Archivematica / Tika integration <br />
** @archivematica team & volunteers<br />
* Archivematica [https://www.archivematica.org/wiki/Format_policy_registry_requirements Format Policy Registry] testing<br />
** @archivematica team & volunteers<br />
* @kvanmalssen Improved file id /characterization support for AV files in existing tools like Tika and FITS. An update of Exiftool and inclusion of MediaInfo would be a good start. Or maybe test applicability of ffprobe/avprobe for this task.<br />
** @dericed This is exactly what ffprobe/avprobe does. Whereas the many of the digipres tools do identification by sampling x bytes from the head and tail, ffprobe/avprobe incorporate one of the many extensive demuxing libraries to manage identification of the contents.<br />
** @kvanmalssen - Yes, so can we get avprobe to output in a structured way? And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?<br />
** @dericed - Yes ffprobe/avprobe have the -print_format (-of) option so you can get json, xml, csv, or others. There's also an xsd published for the output. I suppose ffprobe could be incorporated into FITS but not sure if this is an efficient idea. The premise of FITS seems to put all preservation metadata considerations on the container (file format) but in AV collections the codecs and contained bitstreams are far more significant to consider.<br />
** @kvanmalssen - Issue is we need AV support (including track/bitstream support) in these general tools so people can process mixed collections. That's what I'd like to figure out.<br />
And could it be incorporated in to a tool like FITS or Tika so that we can have a file id tool that supports mixed collections?<br />
** See also [[Improving identification methods]], which could perhaps be split into two or three and one of which merged with the above tweet discussion? [[User:Andy Jackson|Andy Jackson]] 15:20, 22 October 2012 (PDT)<br />
* FITS or Tika bugfix marathon (e.g. [https://issues.apache.org/jira/browse/TIKA-539 this one]).<br />
** Perhaps consider refactoring FITS to re-use existing dependency management tools like Maven and apt/yum/etc instead of manual dependency management? [[User:Andy Jackson|Andy Jackson]] 05:16, 23 October 2012 (PDT)<br />
* [[User:Maurice_de_Rooij|TechMaurice]]: Replace container identification function of [https://github.com/openplanets/fido FIDO] using PRONOM container signature.<br />
''Should we take a poll a day in advance to select 2 or 3 projects or should we just let everyone work on whatever proposal they wish?''<br />
<br />
=Preparation TODO=<br />
* GitHub How To<br />
** Set up temporary FITS and/or Tika forks that we can work on?<br />
* Easier signature development tools and/or signature contribution tracking, as outlined in [[Improving format ID coverage]]<br />
* Example file contribution How To document, c.f. [[Collecting format ID test files]]<br />
* Prep Archivematica dev VMs (incl Tika checkout), spin up & grant IPs/SSH to Hackfest participants upon request (Artefactual: [http://artefactual.com/austin-trask.html Austin])</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=2020CURATEcamp iPRES 2012 Discussion Ideas2012-10-02T12:09:29Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES. Don't forget that the focus of this camp is integrated digital preservation.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
# '''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
# '''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown. This topic is relevant to integrating digital preservation activities not only within an organization but also to allow cross-organization collaboration and tool-sharing. The RESTful BagIt Server (https://github.com/acdha/restful-bag-server) is an example of a such a service.<br />
# '''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
# '''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?<br />
# '''Archivematica requirements for File ID & Format Policy Registry''' (Peter Van Garderen): Folllowing up on PaulW's plea to clearly define these requirements, I can speak on the analysis we have done within the Archivematica project and our current design/development work on the Archivematica file identification microservice and Format Policy Registry (FPR) slated for use with Archivematica 1.0 (Feb 2013).<br />
# '''Cross-organization PREMIS event generation''' (Mark Jordan): How are organizations who implement PREMIS generate and manage PREMIS events across intra-organization document creation / management points?<br />
# '''Curation and the cloud''' (Lisa Snider) What kind of tools and strategies can we use to preserve cloud based services and social media while still preserving authenticity? Do we have to come up with different preservation strategies and use different digital forensics tools (compared to what we use for physical media)? Additionally, what tools do we use to provide user access to electronic material, such as emails, social media, Google docs, etc.?<br />
# '''New Strategies/Tools for New Technology?''' (Lisa Snider) With the popularity of new storage devices, such as SSDs, how can we preserve authenticity when our current forensics tools can’t keep up with them? What kinds of strategies can we employ to preserve these kinds of materials?<br />
# '''Feedback on LoC's Levels of Digital Preservation''' (Mark Jordan) Develop some feedback to the Library of Congress' new Levels of Preservation document, as invited at http://blogs.loc.gov/digitalpreservation/2012/09/help-define-levels-for-digital-preservation-request-for-public-comments/<br />
# '''Practical digital preservation solutions for production entities''' (Kara Van Malssen) Digital preservation solutions have primarily been developed by and for collecting institutions. Applying these within entities that are actively producing content is a challenge, and thus far has not been done consistently and/or effectively. What are digital preservation workflows for institutions with little time or resources to devote to digital preservation, with valuable assets in active use? What simple technologies could support digital preservation and be integrated into existing toolsets, such as DAMs, CMSs, or other ECM tools?</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=2019CURATEcamp iPRES 2012 Discussion Ideas2012-10-02T12:08:59Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES. Don't forget that the focus of this camp is integrated digital preservation.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
# '''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
<br />
# '''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown. This topic is relevant to integrating digital preservation activities not only within an organization but also to allow cross-organization collaboration and tool-sharing. The RESTful BagIt Server (https://github.com/acdha/restful-bag-server) is an example of a such a service.<br />
# '''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
# '''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?<br />
# '''Archivematica requirements for File ID & Format Policy Registry''' (Peter Van Garderen): Folllowing up on PaulW's plea to clearly define these requirements, I can speak on the analysis we have done within the Archivematica project and our current design/development work on the Archivematica file identification microservice and Format Policy Registry (FPR) slated for use with Archivematica 1.0 (Feb 2013).<br />
# '''Cross-organization PREMIS event generation''' (Mark Jordan): How are organizations who implement PREMIS generate and manage PREMIS events across intra-organization document creation / management points?<br />
# '''Curation and the cloud''' (Lisa Snider) What kind of tools and strategies can we use to preserve cloud based services and social media while still preserving authenticity? Do we have to come up with different preservation strategies and use different digital forensics tools (compared to what we use for physical media)? Additionally, what tools do we use to provide user access to electronic material, such as emails, social media, Google docs, etc.?<br />
# '''New Strategies/Tools for New Technology?''' (Lisa Snider) With the popularity of new storage devices, such as SSDs, how can we preserve authenticity when our current forensics tools can’t keep up with them? What kinds of strategies can we employ to preserve these kinds of materials?<br />
# '''Feedback on LoC's Levels of Digital Preservation''' (Mark Jordan) Develop some feedback to the Library of Congress' new Levels of Preservation document, as invited at http://blogs.loc.gov/digitalpreservation/2012/09/help-define-levels-for-digital-preservation-request-for-public-comments/<br />
# '''Practical digital preservation solutions for production entities''' (Kara Van Malssen) Digital preservation solutions have primarily been developed by and for collecting institutions. Applying these within entities that are actively producing content is a challenge, and thus far has not been done consistently and/or effectively. What are digital preservation workflows for institutions with little time or resources to devote to digital preservation, with valuable assets in active use? What simple technologies could support digital preservation and be integrated into existing toolsets, such as DAMs, CMSs, or other ECM tools?</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=2017CURATEcamp iPRES 2012 Discussion Ideas2012-10-01T20:38:13Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES. Don't forget that the focus of this camp is integrated digital preservation.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
'''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
<br />
'''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown. This topic is relevant to integrating digital preservation activities not only within an organization but also to allow cross-organization collaboration and tool-sharing. The RESTful BagIt Server (https://github.com/acdha/restful-bag-server) is an example of a such a service.<br />
<br />
'''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
<br />
'''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?<br />
<br />
'''Archivematica requirements for File ID & Format Policy Registry''' (Peter Van Garderen): Folllowing up on PaulW's plea to clearly define these requirements, I can speak on the analysis we have done within the Archivematica project and our current design/development work on the Archivematica file identification microservice and Format Policy Registry (FPR) slated for use with Archivematica 1.0 (Feb 2013).<br />
<br />
'''Cross-organization PREMIS event generation''' (Mark Jordan): How are organizations who implement PREMIS generate and manage PREMIS events across intra-organization document creation / management points?<br />
<br />
'''Curation and the cloud''' (Lisa Snider) What kind of tools and strategies can we use to preserve cloud based services and social media while still preserving authenticity? Do we have to come up with different preservation strategies and use different digital forensics tools (compared to what we use for physical media)? Additionally, what tools do we use to provide user access to electronic material, such as emails, social media, Google docs, etc.?<br />
<br />
'''New Strategies/Tools for New Technology?''' (Lisa Snider) With the popularity of new storage devices, such as SSDs, how can we preserve authenticity when our current forensics tools can’t keep up with them? What kinds of strategies can we employ to preserve these kinds of materials?<br />
<br />
'''Feedback on LoC's Levels of Digital Preservation''' (Mark Jordan) Develop some feedback to the Library of Congress' new Levels of Preservation document, as invited at http://blogs.loc.gov/digitalpreservation/2012/09/help-define-levels-for-digital-preservation-request-for-public-comments/</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=2012CURATEcamp iPRES 2012 Discussion Ideas2012-09-20T14:15:06Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
'''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
<br />
'''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown. This topic is relevant to integrating digital preservation activities not only within an organization but also to allow cross-organization collaboration and tool-sharing. The RESTful BagIt Server (https://github.com/acdha/restful-bag-server) is an example of a such a service.<br />
<br />
'''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
<br />
'''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?<br />
<br />
'''Archivematica requirements for File ID & Format Policy Registry''' (Peter Van Garderen): Folllowing up on PaulW's plea to clearly define these requirements, I can speak on the analysis we have done within the Archivematica project and our current design/development work on the Archivematica file identification microservice and Format Policy Registry (FPR) slated for use with Archivematica 1.0 (Feb 2013).<br />
<br />
'''Cross-organization PREMIS event generation''' (Mark Jordan) How are organizations who implement PREMIS generate and manage PREMIS events across intra-organization document creation / management points?</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=2001CURATEcamp iPRES 2012 Discussion Ideas2012-09-17T23:20:52Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
'''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
<br />
'''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown. This topic is relevant to integrating digital preservation activities not only within an organization but also to allow cross-organization collaboration and tool-sharing. The RESTful BagIt Server (https://github.com/acdha/restful-bag-server) is an example of a such a service.<br />
<br />
'''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
<br />
'''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?<br />
<br />
'''Cross-organization PREMIS event generation''' (Mark Jordan) How are organizations who implement PREMIS generating and managing PREMIS events across intra-organization document creation / management points)?</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=2000CURATEcamp iPRES 2012 Discussion Ideas2012-09-17T23:20:07Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
'''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
<br />
'''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown. This topic is relevant to integrating digital preservation activities not only within an organization but also to allow cross-organization collaboration and tool-sharing. The RESTful BagIt Server (https://github.com/acdha/restful-bag-server) is an example of a such a service.<br />
<br />
'''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
<br />
'''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?<br />
<br />
'''Cross-organization PREMIS event generation''' (Mark Jordan) How do organizations who implement PREMIS generating and managing PREMIS events across intra-organization document creation / management points)?</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=1999CURATEcamp iPRES 2012 Discussion Ideas2012-09-13T14:53:42Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
'''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
<br />
'''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown. This topic is relevant to integrating digital preservation activities not only within an organization but also to allow cross-organization collaboration and tool-sharing. The RESTful BagIt Server (https://github.com/acdha/restful-bag-server) is an example of a such a service.<br />
<br />
'''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
<br />
'''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?</div>Mark Jordanhttps://wiki.curatecamp.org/index.php?title=CURATEcamp_iPRES_2012_Discussion_Ideas&diff=1998CURATEcamp iPRES 2012 Discussion Ideas2012-09-13T14:48:08Z<p>Mark Jordan: </p>
<hr />
<div>Feel free to use this space to share ideas for discussion at CURATEcamp 2012 iPRES.<br />
<br />
----<br />
<br />
'''Topic you are interested in''' (Your name): A sentence or three about your topic.<br />
<br />
<br />
'''Micro service requirements''' (Paul Wheatley): How do we design micro services to maximise their re-use and simplify their orchestration? What are our requirements for the generic aspects of their design (and most importantly their interfaces)? Can we come up with micro services that can easily be shared between orchestration systems? (See this discussion about FITS and characterisation tool wrapping http://openplanetsfoundation.org/blogs/2012-07-27-fits-or-not-fits)<br />
<br />
'''Service-oriented architecture for digital preservation''' (Mark Jordan) Some work started by JISC (http://www.ahds.ac.uk/about/projects/soapi/index.htm) but current status is unknown.<br />
<br />
'''Re-use and collaboration''' (Paul Wheatley): There seems to be a lot of repetition and reinvention in DP community tool development, despite limited resources and a small pool of contributing parties. How can we avoid these problems, maximise collaboration, increase re-use of existing solutions, and improve sharing of knowledge about where they can be used effectively?<br />
<br />
'''Requirements on file format registries''' (Paul Wheatley): File format initiatives are suddenly all the rage (http://openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem). I have previously suggested (controversial) that we haven't always clearly articulated what our requirements and use cases are for these systems (http://openplanetsfoundation.org/blogs/2012-07-05-dont-panic-what-we-might-need-format-registries). So what requirements do preservation tool developers (and users of their tools) have on file format registries?</div>Mark Jordan