From CURATEcamp
Jump to: navigation, search

This page is for notes on how to optimize FITS.

Whence OpenFITS?

I created the OpenFITS fork (and at least one fork has been made from that) for the Hackathon, but have no long term plans for it. Randy Stern at Harvard has said he would rather not have other forks around, thinking they'll create confusion or something. They'll hopefully take what's good in OpenFITS and merge it back to their Google Code repository.

To keep this in perspective, the Harvard Library regards FITS primarily as an ingest tool for their internal purpose, and makes it available to others as a side issue. Convincing them to incorporate serious changes could be difficult. If someone else wants to create a permanent fork of OpenFITS, that's not my concern, but I'm keeping my own fork to changes that I think they're likely to accept at Harvard. --Gary McGath 16:59, 16 November 2012 (PST)

Thread parallelism and memory consumption

FITS runs all its tools in parallel threads. This can result in heavy memory consumption, particularly if a big file is being processed. Harvard optimized FITS for DRS ingest, which can afford a lot of memory. In other environments this might cause thrashing.

One approach might be to add a command line or config parameter to limit the number of simultaneous threads.

FITS on Github

A fork of FITS is now up on Github. Let me know if you want to be added as a contributor. I've changed the repository name from fits to openfits to assuage concerns from Harvard.

The JHOVE 1.8 jars are now there.

Optimization tip

The HTML module in JHOVE is very slow and not very useful unless you're rejecting the 90% of the HTML files that aren't strictly valid. If you don't need it, edit xml/jhove/jhove.conf and remove the module element which refers to edu.harvard.hul.ois.jhove.module.HtmlModule .

  • Similarly, at UNC we had to exclude XML files from being processed by jhove, as moderately long documents could cause the jhove process to hang for long periods of time. (bbpennel)

Improving JHOVE performance within FITS

Spencer McEwen, who wrote most or all of FITS, tells me that JHOVE is the biggest bottleneck in FITS and that each time it's called, it grabs required schemas from the Web, instead of using local copies. I'll see if that's fixable. CORRECTION: Never mind. He was talking about validating XML files that are given to JHOVE, not validation of JHOVE's own configuration files. Oh, well, onward...

But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile?

I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the module element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example: <module>






--Gary McGath 12:39, 16 November 2012 (PST)

Mavenizing FITS?

Peshkira has some remarks on Mavenizing FITS at [[1]]. While this may have merit in the abstract, it leads to the question of what to do with it. Harvard LTS isn't currently using Maven for anything I can think of, so it would be a hard sell. Unless OPF wants to create its own long-term fork of FITS, this could be a difficult sell with Harvard. --Gary McGath 15:05, 16 November 2012 (PST)

Results of optimization tests

These are some preliminary tests done by the Archivematica team (thanks @berwin22 @jordanheit @mcantelon @pjvangarderen +ARTi +epmcellan) using OpenFITS. Machine specs are 1 GB RAM, no swap, 1 2000Mhz core, 1 or 2 threads (not sure, it's late). Input is the standard (complete) Archivematica sample data.

   time -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp1
   real    2m47.007s
   user    1m41.178s
   sys     0m29.278s
   time openfits -r -i /home/archivematica-sampledata-master/SampleTransfers/ -o /tmp/tmp2
   with default 20 threads
   real    0m4.355s
   user    0m3.420s
   sys     0m0.412s
   Still lots of:
   edu.harvard.hul.ois.fits.exceptions.FitsToolException: Error parsing Exiftool XML Output (Error on line 31: The content of elements must consist of well-formed character data or markup.)
   Set to 4 threads
   real    3m22.685s
   user    2m2.144s
   sys     0m40.235s
   Set to 2 threads
   real    3m27.194s
   user    1m31.018s
   sys     0m29.726s
   Set memory to 1 GB
   real    3m51.133s
   user    1m37.506s
   sys     0m30.574s
   Set to 512MB
   real    3m39.913s
   user    1m42.002s
   sys     0m29.050s
   no changes
   real    2m28.219s
   user    1m17.677s
   sys     0m19.325s
   3 threads
   noticed max ram
   real    3m50.741s
   user    1m38.278s
   sys     0m34.466s
   Results after increasing RAM to 4 GB (3 threads, JVM allocated 512MB, as above):
   real    3m39.226s
   user    1m52.487s
   sys     0m41.527s
   Results after increasing threads in OpenFITS config from 3 to 10:
   real    3m6.449s
   user    1m58.223s
   sys     0m32.542s
   Results after increasing threads in OpenFITS config from 10 to 40:
   real    2m54.929s
   user    1m50.299s
   sys     0m30.662s


These (preliminary) tests don't show an appreciable decrease in processing time by increasing the machine's RAM, the number of threads in the fits.xml config file, or the amount of RAM allocated to the JVM.

Maybe not as hoped for but some metrics to inform further discussion/strategy/dev nonetheless.