https://wiki.curatecamp.org/index.php?title=Improving_identification_methods&feed=atom&action=historyImproving identification methods - Revision history2024-03-28T14:53:41ZRevision history for this page on the wikiMediaWiki 1.28.0https://wiki.curatecamp.org/index.php?title=Improving_identification_methods&diff=2139&oldid=prevGary McGath: /* Text-based formats */ link to ISO 8859 proposal2012-10-23T22:13:53Z<p><span dir="auto"><span class="autocomment">Text-based formats: </span> link to ISO 8859 proposal</span></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;' lang='en'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 22:13, 23 October 2012</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l15" >Line 15:</td>
<td colspan="2" class="diff-lineno">Line 15:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>== Text-based formats ==</div></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>== Text-based formats ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>All known tools are bad at identifying most text formats. SGML/HMTL/XML are fine (at least when well formed), but CSS, JavaScript, compressed JavaScript, CSV, Bash, Python, etc. are all poorly identifued by given tools. We could do with [[Collecting format ID test files]] of these types to make testing easier, but it's simply not clear how to do it.</div></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>All known tools are bad at identifying most text formats. SGML/HMTL/XML are fine (at least when well formed), but CSS, JavaScript, compressed JavaScript, CSV, Bash, Python, etc. are all poorly identifued by given tools. We could do with [[Collecting format ID test files]] of these types to make testing easier, but it's simply not clear how to do it.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[Identifying text encodings|Added an idea for discussion]] --[[User:Gary McGath|Gary McGath]] 15:13, 23 October 2012 (PDT)</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>See also [http://droid7.wikispaces.com/requirement0013 DROID can identify text based formats such as source code and scripting languages].</div></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>See also [http://droid7.wikispaces.com/requirement0013 DROID can identify text based formats such as source code and scripting languages].</div></td></tr>
</table>Gary McGathhttps://wiki.curatecamp.org/index.php?title=Improving_identification_methods&diff=2099&oldid=prevAndy Jackson: /* Text-based formats */ Added link to relevant DROID discussion.2012-10-22T22:19:04Z<p><span dir="auto"><span class="autocomment">Text-based formats: </span> Added link to relevant DROID discussion.</span></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;' lang='en'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 22:19, 22 October 2012</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l15" >Line 15:</td>
<td colspan="2" class="diff-lineno">Line 15:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>== Text-based formats ==</div></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>== Text-based formats ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>All known tools are bad at identifying most text formats. SGML/HMTL/XML are fine (at least when well formed), but CSS, JavaScript, compressed JavaScript, CSV, Bash, Python, etc. are all poorly identifued by given tools. We could do with [[Collecting format ID test files]] of these types to make testing easier, but it's simply not clear how to do it.</div></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>All known tools are bad at identifying most text formats. SGML/HMTL/XML are fine (at least when well formed), but CSS, JavaScript, compressed JavaScript, CSV, Bash, Python, etc. are all poorly identifued by given tools. We could do with [[Collecting format ID test files]] of these types to make testing easier, but it's simply not clear how to do it.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">See also [http://droid7.wikispaces.com/requirement0013 DROID can identify text based formats such as source code and scripting languages].</ins></div></td></tr>
</table>Andy Jacksonhttps://wiki.curatecamp.org/index.php?title=Improving_identification_methods&diff=2096&oldid=prevAndy Jackson: Outline of methodology issues.2012-10-22T22:06:35Z<p>Outline of methodology issues.</p>
<p><b>New page</b></p><div>[[Main Page]] > CURATEcamp iPRES 2012 > [[CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012]]<br />
<br />
There are some systematic issues with the tools, outlined here. They are rather big issues, but might be worth considering.<br />
<br />
== 'Container' Format ==<br />
Many formats actually require some degree of parsing to understand the contents, from DOCX (which is a special ZIP, and DROID and Tika handle it as such), through to media codecs (which are well supported by other tools like ffprobe).<br />
<br />
There are two issues here:<br />
<br />
* Whether we try and sync up how the different tools work (e.g. port DROID's container file signatures and turn that into a Tika Detector module).<br />
* Whether we try and formalise the integration of Tika/DROID/etc into an overall ID workflow so that ffprobe/etc can be reliably called in as needed.<br />
* Whether we can use MIME Type codec parameters to capture the identification information (See http://tools.ietf.org/html/rfc4281).<br />
<br />
<br />
== Text-based formats ==<br />
All known tools are bad at identifying most text formats. SGML/HMTL/XML are fine (at least when well formed), but CSS, JavaScript, compressed JavaScript, CSV, Bash, Python, etc. are all poorly identifued by given tools. We could do with [[Collecting format ID test files]] of these types to make testing easier, but it's simply not clear how to do it.</div>Andy Jackson