When n-grams go bad

As a followup to Google n-grams and pre-modern Chinese, other features of the Google n-gram viewer may help shed some light on the issues with the pre-1950 data for Chinese.

One useful feature is wildcard search, which allows various open-ended searches, the simplest of these being a search for “*”, which plots the most frequent 1-grams in a corpus – i.e. the most commonly occurring words. For example, if we input a single asterisk as our search query on the English corpus, we get the frequencies of the ten most common English words:

The results look plausible at least as far back as 1800, which is what the authors claim to be the reliable part of the data. Earlier than that things get shakier, and before about 1650 things get quite seriously out of hand:

Remember, these are the most common terms in the corpus, i.e. the ones for which the data is going to be the most reliable. Now lets look at the equivalent figures for Chinese. Firstly, we can get a nice baseline showing what we would like to see by doing the equivalent search on a relatively reliable part of the data, e.g. 1970 to 2000:

This looks good. The top ten 1-grams – i.e. the most frequently occurring terms – are all commonly occurring Chinese words. Now lets try going back to 1800:

Oh dear. From 1800 to 2000, of the ten most frequent 1-grams, more than half are not terms that plausibly occur in pre-modern Chinese texts at all. Note also that the scale of the y axis has now changed: according to this graph, it would appear that up to 40% of terms in pre-1940 texts may have been detected as being URLs or other non-textual content. Unsurprisingly, these problems continue all the way back to 1500:

It’s unclear what exactly _URL_, ^_URL, and @_URL are supposed to represent as they don’t seem to be documented, and none of them are accepted by the viewer as valid query terms so we can’t easily check what their values are on the English data. Possibly they are just categorization tags that don’t affect the overall counts and thus normalized frequencies, but even so they surely point to serious problems with the data that have caused up to 50% of terms to be so interpreted.

Even aside from these suspect “URLs”, the other most frequent terms returned indicate that three terms not plausibly occurring in pre-modern Chinese texts – “0″, “1″, and “I” – account for anything up to 20% or more of all terms in the pre-1900 data:

Since all the n-gram counts are normalized by the total number of terms, these issues (presumably primarily caused by OCR errors) affect all results for Chinese in any year in which they occur. So it looks as if while 1800 might be a reasonable cut-off for meaningful interpretation of the English data, for the Chinese case 1970 would be a better choice, and any results from before around 1940 will be largely meaningless due to the overwhelming amount of noise.


Update April 18, 2015:

It appears that the @_URL_ and ^_URL_ actually correspond to the terms “@” and “^” (both, presumably, tagged with “URL”), and so these do indeed affect the results: for many years pre-1950, anything up to 60% of all terms in the corpus are the term “^”:

It seems that the data used for Chinese fails some fairly basic sanity checks (including “is it in Chinese?”).

This entry was posted in Digital humanities. Bookmark the permalink.

Leave a Reply