Identifying text encodings

From CURATEcamp
Jump to: navigation, search

Here's a raw idea for a file identification project: Given that a text file is (probably) encoded as ISO 8859, how do we tell which of the 15 encodings is used? A possible approach is to see which encoding yields the most real words.

Putting this into more detail: ISO 8859 encodes only languages that use an alphabet, so any natural-language document will consist of words. Let's take a set of vocabulary lists for all the possible languages (or at least all the ones we consider likely to show up), encoded in UTF-8, and turn each word into a hash value. We now have a vocabulary hash set of all those words. For each document, convert it to UTF-8 for each 8859-n encoding in turn, extract the lexical words, and take their hash values. If the document has invalid character codes for a particular encoding, that encoding is ruled out. Otherwise, determine how many of the hash values for the document are words in the vocabulary hash set. The encoding that yields the highest vocabular word count is the one most likely used.

There will often be ties, especially between encodings that differ only in little-used characters. 8859-1 (Latin-1), 8859-9 ("Turkish Latin-1"), and 8859-15 (Latin-9) are likely to have tied scores a lot of the time. Going by frequency of use in the real world, it's reasonable to award ties to Latin-1.

This approach won't work with documents that aren't in a natural language, such as computer code, with lists of proper names, or with documents that consist mostly of jargon or deliberately mangled spellings (e.g., text messagese).

The same technique could be used with other encoding families, such as ISO 646.

Has anything like this already been done? Is it an idea worth pursuing? The hardest part could be building the multilingual vocabulary list. --Gary McGath 15:10, 23 October 2012 (PDT)