Difference between revisions of "FITS"
Gary McGath (talk | contribs) (→Improving JHOVE performance within FITS) |
Gary McGath (talk | contribs) (→Improving JHOVE performance within FITS: Local schemas) |
||
Line 23: | Line 23: | ||
But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile? | But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile? | ||
+ | |||
+ | I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the '''module''' element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example: | ||
+ | <nowiki> <module> </nowiki> | ||
+ | |||
+ | <nowiki> <class>edu.harvard.hul.ois.jhove.module.XmlModule</class> </nowiki> | ||
+ | |||
+ | <nowiki> <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param></nowiki> | ||
+ | |||
+ | <nowiki> <param>schema=http://www.loc.gov/mods/v3;/Users/myself/schemas/mods-3-4.xsd</param></nowiki> | ||
+ | |||
+ | <nowiki> <param>schema=http://www.loc.gov/standards/mods/v3/mods-3-2.xsd;/Users/myself/schemas/mods-3-2.xsd</param> </nowiki> | ||
+ | |||
+ | <nowiki> </module> </nowiki> | ||
+ | |||
+ | --[[User:Gary McGath|Gary McGath]] 12:39, 16 November 2012 (PST) |
Revision as of 21:39, 16 November 2012
This page is for notes on how to optimize FITS.
Contents
Thread parallelism and memory consumption
FITS runs all its tools in parallel threads. This can result in heavy memory consumption, particularly if a big file is being processed. Harvard optimized FITS for DRS ingest, which can afford a lot of memory. In other environments this might cause thrashing.
One approach might be to add a command line or config parameter to limit the number of simultaneous threads.
FITS on Github
A fork of FITS is now up on Github. Let me know if you want to be added as a contributor. I've changed the repository name from fits to openfits to assuage concerns from Harvard.
The JHOVE 1.8 jars are now there.
Optimization tip
The HTML module in JHOVE is very slow and not very useful unless you're rejecting the 90% of the HTML files that aren't strictly valid. If you don't need it, edit xml/jhove/jhove.conf and remove the module element which refers to edu.harvard.hul.ois.jhove.module.HtmlModule .
- Similarly, at UNC we had to exclude XML files from being processed by jhove, as moderately long documents could cause the jhove process to hang for long periods of time. (bbpennel)
Improving JHOVE performance within FITS
Spencer McEwen, who wrote most or all of FITS, tells me that JHOVE is the biggest bottleneck in FITS and that each time it's called, it grabs required schemas from the Web, instead of using local copies. I'll see if that's fixable. CORRECTION: Never mind. He was talking about validating XML files that are given to JHOVE, not validation of JHOVE's own configuration files. Oh, well, onward...
But this gives me another idea. JHOVE really should have a way of using local copies of arbitrary schemas. The difficulty is that we don't know what schemas any given installation will find useful. But if the config for the module can have a series of mappings from schema URIs to local files, the user can edit those parameters to locate any schemas that are used locally. Might this be worthwhile?
I've just checked in experimental JHOVE jars on OpenFITS. These have a feature which can potentially improve performance with XML files. It requires some work to set up, but it could be worth it for high-volume environments. I can't see any way around this, since it involves using local copies of XML schemas, and everyone may be validating XML files that use different schemas. What you have to do is edit xml/jhove/jhove.conf, look for the module element that declares edu.harvard.hul.ois.jhove.module.XmlModule, and add parameters that tell it where local schemas are. For example: <module>
<class>edu.harvard.hul.ois.jhove.module.XmlModule</class>
<param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param>
<param>schema=http://www.loc.gov/mods/v3;/Users/myself/schemas/mods-3-4.xsd</param>
<param>schema=http://www.loc.gov/standards/mods/v3/mods-3-2.xsd;/Users/myself/schemas/mods-3-2.xsd</param>
</module>
--Gary McGath 12:39, 16 November 2012 (PST)