Using ctext.org RDF in Python

Resource Description Framework (RDF) support in ctext.org is very new and not well documented. The RDF graph produced by the ctext.org Data Wiki consists of all the entities contained in that system – representing historical people, places, written works, dynasties, eras, etc. – and connections between them in machine-readable format. The user interface allows searching the data directly in various ways, however much more flexibility is available by downloading the complete RDF graph and querying or processing it locally. Below is a simple example of how to get started with this approach in Python using the ctext.org data. To get started with the code below:
  • Download a copy of the demonstration notebook (contents displayed below).
  • Download a copy of the RDF from the Linked Open Data page on ctext.org, unzip the file, and place the extracted file in the same folder as the Jupyter notebook.

Leave a comment

Classical Chinese DH: Regular Expressions

Classical Chinese DH: Regular expressions

By Donald Sturgeon

[View this notebook online] [Download this notebook] [List of tutorials]

Regular expressions

A regular expression (a.k.a. regex or RE) is a pattern to be searched for in some body of text. These are not specific to Python, but by combining simple regular expressions with basic Python statements, we can quickly achieve powerful results.

Commonly used regex syntax

. Matches any one character exactly once
[abcdef] Matches any one of the characters a,b,c,d,e,f exactly once
[^abcdef] Matches any one character **other than** a,b,c,d,e,f
? After a character/group, makes that character/group optional (i.e. match zero or 1 times)
? After +, * or {…}, makes matching ungreedy (i.e. choose shortest match, not longest)
* After a character/group, makes that character/group match zero or more times
+ After a character/group, makes that character/group match one or more times
{2,5} After a character/group, makes that character/group match 2,3,4, or 5 times
{2,} After a character/group, makes that character/group match 2 or more times
\3 Matches whatever was matched into group number 3 (first group from left is numbered 1)

To use regexes in Python, we use another module called “re” (this is a very common module and should already be installed).

In [53]:
import re

laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。上仁為之而無以為;上義為之而有以為。上禮為之而莫之應,則攘臂而扔之。故失道而後德,失德而後仁,失仁而後義,失義而後禮。"

for match in re.finditer(r".德", laozi):  # re.finditer returns "match objects", each of which describes one match
    matched_text = match.group(0)        # In Python, group(0) matches the full text that was found
    print("Found a match: " + matched_text)
Found a match: 上德
Found a match: 不德
Found a match: 有德
Found a match: 下德
Found a match: 失德
Found a match: 無德
Found a match: 上德
Found a match: 下德
Found a match: 後德
Found a match: 失德

[Aside: in Python, regexes are often written in strings with a “r” in front of them, e.g. r”德” rather than just “德”. All this does is tells Python not to try to interpret the contents of the string (e.g. backslashes) as meaning something else. The result of r”德” is still an ordinary string variable with in it.]

Exercise 1 (very easy): Change the above code to verify the results of some of the simple example regexes from the slides. Try these ones:

  • 而無以為
  • 是以.德
  • 失.而後.
  • 上[仁義]為之
  • 後(.),失\1

For the last of these (“後(.),失\1”), see what happens to the output when you change group(0) to group(1). (Change it back to group(0) afterwards though, as we will reuse this code using group(0).)

Exercise 2: Write regular expressions to match the following things (you can keep on modifying the example above to check that they work, but you may want to write down your answers somewhere – remember, you can edit this cell by double-clicking on it).

  • Match any three characters where the middle character is “之” – i.e. “為之而”, “莫之應”, etc. Modify your regex so that it does not match things with punctuation in them, like “扔之。”.
  • Match each “phrase” (i.e. punctuated section) of the text. In other words, the first match should be “上德不德”, the second should be “是以有德”, and so on. You only need to handle the three punctuation marks “。”, “,”, and “;”.
  • Match each phrase which contains the term “之” in it. (Double check that you get 5 matches.)

We can do the same kind of thing on an entire text in one go if we have the whole text in a single string, as in the next example. (If we wanted to know which paragraph or chapter each match appeared in, we would want to run the same regex on each paragraph or chapter in turn so that we know which paragraph or chapter each match occurs in.)

In [54]:
from ctext import *
setapikey("demo")

# The gettextasstring function gives us a single string variable with the whole text in it
laozi = gettextasstring("ctp:dao-de-jing")

for match in re.finditer(r"足.", laozi):
    matched_text = match.group(0)
    print(matched_text)
足,
足。
足,
足,
足者
足見
足聞
足既
足以
足;
足不
足;
足之
足,
足矣
足以
足下
足者
足。
足以

Exercise 3

  • Often we don’t want to include matches that have punctuation in them. Modify the regex from the last example so that it excludes all the matches where the character after “足” is “,”, “。”, or “;”. (You should do this by modifying the regex; the rest of the code does not need to change.)

  • Find all the occurrences of X可X – i.e. “道可道” and “名可名” (there is one more item that should be matched too).

  • Modify your regex so you match all occurrences of XYX – i.e. not just “道可道” but also things like “學不學”. You may need to make some changes to avoid matching punctuation – we don’t want to match “三,三” or “、寡、”.

Exercise 4: (Optional) Using what was covered in the previous tutorial, write a program in the cell below to perform one of these searches again, but this time running it once on each paragraph in turn so that you can also print out the number of the passage in which each match occurs.

In [ ]:
passages = gettextasparagraphlist("ctp:dao-de-jing")

# Your code goes here!

Dictionary variables

One of the advantages of using regexes from within a programming language like Python is that as well as simply finding results, we can easily do things to collate our data, such as count up how many times a regex gave various different results. Another type of variable that is useful here is the “dictionary” variable.

A dictionary variable works in a very similar way to a list, except that whereas in a list the items are numbered 0,1,2,… and accessed using these numbers, a dictionary uses other things – in the case we will look at, strings – to identify the items. This lets us “look up” values for different strings, just like looking up the translation of a word in a dictionary. The things we use instead of numbers to “look up” values in a dictionary are called “keys“.

Dictionaries can be defined in Python using the following notation:

In [55]:
my_titles = {"論語": "Analects", "孟子": "Mengzi", "荀子": "Xunzi"}

The above example defines one dictionary variable called “my_titles”, and sets values for three keys: “論語”, “孟子”, and “荀子”. Each of these keys is set to have the corresponding value (“Analects”, “Mengzi”, and “Xunzi” respectively). In this simple example, our dictionary gives us a way of translating Chinese-language titles into English-language titles.

We can access the items in a dictionary in a very similar way to accessing items from a list:

In [56]:
print(my_titles["論語"])
Analects
In [57]:
print(my_titles["荀子"])
Xunzi

Unlike in a list, our items don’t have numbers, and don’t come in any particular order. So one thing we will sometimes need to do is to get a list of all the keys – i.e., a list telling us what things there are in our dictionary.

In [58]:
list_of_titles = list(my_titles.keys())
print(list_of_titles)
['孟子', '論語', '荀子']

Often we will store numbers in our dictionary; the keys will be strings, but the value for each key will be a number. This lets us do things like count how many times we’ve seen some particular string – for all of the strings we happen to come across at the same time, using just one dictionary variable. In cases like this, we will often want to sort the dictionary by the values of the keys. One way of doing this is using the “sorted” function:

In [59]:
# In this example, we use a dictionary to record people's year of birth
# Then we sort the keys (i.e. the names) by the values (i.e. year of birth)

year_of_birth = {"胡適": 1891, "梁啟超": 1873, "茅盾": 1896, "王韜": 1828, "魯迅": 1881}
list_of_people = sorted(year_of_birth, key=year_of_birth.get, reverse=False)
for name in list_of_people:
    print(name + " was born in " + str(year_of_birth[name]))
王韜 was born in 1828
梁啟超 was born in 1873
魯迅 was born in 1881
胡適 was born in 1891
茅盾 was born in 1896

Don’t worry about the rather complex looking syntax for sorted() – you can just follow this model whenever you need to sort a dictionary (and change “reverse=False” to “reverse=True” if you want to reverse the list):

list_of_keys = sorted(my_dictionary, key=my_dictionary.get, reverse=False)

Using a dictionary, we can keep track of every regex result we found, and at the same time collate the data. Instead of having a long list with repeated items in it, we build a dictionary in which the keys are the unique regex matches, and the values are the number of times we have seen that particular string.

In [60]:
match_count = {}  # This tells Python that we're going to use match_count as a dictionary variable

for match in re.finditer(r"(.)為", laozi):
    matched_text = match.group(0)  # e.g. "心為"
    if not matched_text in match_count:
        match_count[matched_text] = 0  # If we don't do this, Python will give an error on the following line
    match_count[matched_text] = match_count[matched_text] + 1

# Our dictionary now contains a frequency count of each different pair we found
print("match_count contains: " + str(match_count))

# The sorted() function gets us a list of the items we matched, starting with the most frequent
unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
    print(item + " occurred " + str(match_count[item]) + " times.")
match_count contains: {'之為': 3, '不為': 8, '敢為': 5, '淡為': 1, '宜為': 1, '善為': 3, '靜為': 3, '可為': 1, '歙為': 1, '物為': 1, '無為': 11, '德為': 1, '能為': 3, '禮為': 1, '姓為': 1, '名為': 1, '復為': 2, '賤為': 2, '以為': 18, '自為': 1, '一為': 1, '。為': 7, '生為': 1, '下為': 1, '重為': 1, '心為': 1, '身為': 2, '則為': 2, '人為': 2, '有為': 1, '孰為': 1, '義為': 1, '寵為': 1, '仁為': 1, '而為': 4, ',為': 11, '故為': 2, '是為': 1, '強為': 2}
以為 occurred 18 times.
無為 occurred 11 times.
,為 occurred 11 times.
不為 occurred 8 times.
。為 occurred 7 times.
敢為 occurred 5 times.
而為 occurred 4 times.
之為 occurred 3 times.
善為 occurred 3 times.
靜為 occurred 3 times.
能為 occurred 3 times.
復為 occurred 2 times.
賤為 occurred 2 times.
身為 occurred 2 times.
則為 occurred 2 times.
人為 occurred 2 times.
故為 occurred 2 times.
強為 occurred 2 times.
淡為 occurred 1 times.
宜為 occurred 1 times.
可為 occurred 1 times.
歙為 occurred 1 times.
物為 occurred 1 times.
德為 occurred 1 times.
禮為 occurred 1 times.
姓為 occurred 1 times.
名為 occurred 1 times.
自為 occurred 1 times.
一為 occurred 1 times.
生為 occurred 1 times.
下為 occurred 1 times.
重為 occurred 1 times.
心為 occurred 1 times.
有為 occurred 1 times.
孰為 occurred 1 times.
義為 occurred 1 times.
寵為 occurred 1 times.
仁為 occurred 1 times.
是為 occurred 1 times.

We can use this idea and almost exactly the same code to start answering quite complex questions about patterns appearing in texts. This code can tell us which actual phrases matching a certain pattern occurred most frequently.

For example, in poetry we often find various kinds of repetition. We can use part of the 詩經 as an example, and using a regex quickly find out which repeated XYXY patterns are most common:

In [61]:
shijing = gettextasstring("ctp:book-of-poetry/lessons-from-the-states")
In [62]:
match_count = {}  # This tells Python that we're going to use match_count as a dictionary variable

for match in re.finditer(r"(.)(.)\1\2", shijing):
    matched_text = match.group(0)
    if not matched_text in match_count:
        match_count[matched_text] = 0  # If we don't do this, Python will give an error on the following line
    match_count[matched_text] = match_count[shijing[match.start():match.end()]] + 1

unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
    print(item + " occurred " + str(match_count[item]) + " times.")
子兮子兮 occurred 3 times.
懷哉懷哉 occurred 3 times.
碩鼠碩鼠 occurred 3 times.
歸哉歸哉 occurred 3 times.
如何如何 occurred 3 times.
委蛇委蛇 occurred 3 times.
舍旃舍旃 occurred 3 times.
蘀兮蘀兮 occurred 2 times.
式微式微 occurred 2 times.
采苓采苓 occurred 1 times.
鴟鴞鴟鴞 occurred 1 times.
悠哉悠哉 occurred 1 times.
采葑采葑 occurred 1 times.
瑳兮瑳兮 occurred 1 times.
其雨其雨 occurred 1 times.
簡兮簡兮 occurred 1 times.
采苦采苦 occurred 1 times.
伐柯伐柯 occurred 1 times.
樂國樂國 occurred 1 times.
樂土樂土 occurred 1 times.
樂郊樂郊 occurred 1 times.
玼兮玼兮 occurred 1 times.

Exercise 5: Write a regex to match paired lines of four-character poetry that both begin with the same two characters (examples: “亦既見止、亦既覯止”, “且以喜樂、且以永日”, etc.). Re-run the program above to verify your answer.

Exercise 6: Create a regex to match book titles that appear in punctuated Chinese texts, e.g. “《呂氏春秋》”. Your regex should extract the title without the punctuation marks into a group – i.e. you must use “(” and “)” in your regex. You can test it using the short program below – your output should look like this:

爾雅
廣雅
尚賢
呂氏春秋·順民
呂氏春秋·不侵
左·襄十一年傳
韓詩外傳
廣雅
In [ ]:
test_input = "昔者文公出走而正天下,畢云:「正,讀如征。」王念孫云「畢讀非也,《爾雅》曰:『正,長也。』晉文為諸侯盟主,故曰『正天下』,與下『霸諸侯』對文。又《廣雅》『正,君也』。《尚賢》篇曰:『堯、舜、禹、湯、文、武之所以王天下正諸侯者』。凡墨子書言正天下正諸侯者,非訓為長,即訓為君,皆非征伐之謂。」案:王說是也。《呂氏春秋·順民》篇云:「湯克夏而正天下」,高誘注云:「正,治也」,亦非。桓公去國而霸諸侯,越王句踐遇吳王之醜,蘇時學云:「醜,猶恥也。」詒讓案:《呂氏春秋·不侵》篇「欲醜之以辭」,高注云:「醜,或作恥。」而尚攝中國之賢君,畢云:「尚與上通。攝,合也,謂合諸侯。郭璞注爾雅云:『聶,合』,攝同聶。」案:畢說未允。攝當與懾通,《左·襄十一年傳》云:「武震以攝威之」,《韓詩外傳》云:「上攝萬乘,下不敢敖乎匹夫」,此義與彼同,謂越王之威足以懾中國賢君也。三子之能達名成功於天下也,皆於其國抑而大醜也。畢云:「猶曰安其大醜。《廣雅》云:『抑,安也』」。俞樾云:「抑之言屈抑也。抑而大醜,與達名成功相對,言於其國則抑而大醜,於天下則達名成功,正見其由屈抑而達,下文所謂敗而有以成也。畢注於文義未得。」案:俞說是也。太上無敗,畢云:「李善文選注云:『河上公注老子云:太上,謂太古無名之君也』。」案:太上,對其次為文,謂等之最居上者,不論時代今古也。畢引老子注義,與此不相當。其次敗而有以成,此之謂用民。言以親士,故能用其民也。"

for match in re.finditer(r"your regex goes here!", test_input):
    print(match.group(1)) # group() extracts the text of a group from a matched regex: so your regex must have a group in it

Now modify your regex so that instead of getting book titles together with chapter titles, your regex only captures the title of the work – i.e., capture “呂氏春秋” instead of “呂氏春秋·順民”, and “左” instead of “左·襄十一年傳”.

Optional: Bonus points if you can also capture the chapter title on its own in a separate regex group at the same time. This is a bit fiddly though, and we don’t need to do it for this exercise.

  • Now modify the example code below (it’s almost identical to one of examples above) so that it lists how often every title was mentioned in the 墨子閒詁 (a commentary on the classic text “墨子” – in this example we only use the first chapter, though the code can also be run on the whole text by changing the URN).
  • Then modify your code so that it only lists the top 10 most frequently mentioned texts. Hint: “unique_items” is a list, and getting part of a list is very similar to getting part of a string.
In [ ]:
test_input = gettextasstring("ctp:mozi-jiangu/qin-shi")

match_count = {}  # This tells Python that we're going to use match_count as a dictionary variable

for match in re.finditer(r"your regex goes here!", test_input):
    matched_text = match.group(1)
    if not matched_text in match_count:
        match_count[matched_text] = 0  # If we don't do this, Python will give an error on the following line
    match_count[matched_text] = match_count[matched_text] + 1

unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
    print(item + " occurred " + str(match_count[item]) + " times.")

Dictionaries also allow us to produce graphs summarizing our data.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
%matplotlib inline

# Unfortunately some software still has difficulty dealing with Chinese.
# Here we may need to tell matplotlib to use a specific font containing Chinese characters.
# If your system doesn't display the Chinese text in the graph below, you may need to specify a different font name.
import platform
if platform.system() == 'Darwin':   # I.e. if we're running on Mac OS X
    mpl.rcParams['font.family'] = "Arial Unicode MS" 
else:
    mpl.rcParams['font.family'] = "SimHei"
    
mpl.rcParams['font.size'] = 20

# The interesting stuff happens here:

s = pd.Series(match_count)
s.sort_values(0, 0, inplace=True)
s = s[:10]
s.plot(kind='barh')

Now modify your regex so that you only match texts that are cited as pairs of book title and chapter, i.e. you should only match cases like “《呂氏春秋·順民》” (and not 《呂氏春秋》), and capture into a group the full title (“呂氏春秋·順民” in this example). This may be harder than it looks! You will need to be careful that your regex does not sometimes match too much text.

Re-run the above programs to find out (and graph) which chapters of which texts are most frequently cited in this way by this commentary.

Replacing and Splitting with Regexes

As well as finding things, regexes are ideal for other very useful tasks including replacing and splitting textual data.

For example, we saw in the last notebook cases where it would be easier to process a text without any punctuation in it. We can easily match all punctuation using a regex, and once we know how to search and replace, we can just replace each matched piece of punctuation with a blank string to get an unpunctuated text.

We can do a simple search-and-replace using a regex like this:

In [63]:
import re

input_text = "道可道,非常道。"
print(re.sub(r"道", r"名", input_text))
名可名,非常名。

For very simple regexes that don’t use any special regex characters, this gives exactly the same result as replace(). But because we can specify patterns, we can do much more powerful replacements.

In [64]:
input_text = "道可道,非常道。"
print(re.sub(r"[。,]", r"", input_text))
道可道非常道

Of course, as usual the power of this is that we can quickly do it for however much data we like:

In [65]:
laozi = gettextasstring("ctp:dao-de-jing")
print(re.sub(r"[。,;?:!、]", r"", laozi))
道可道非常道名可名非常名無名天地之始有名萬物之母故常無欲以觀其妙常有欲以觀其徼此兩者同出而異名同謂之玄玄之又玄衆妙之門

天下皆知美之為美斯惡已皆知善之為善斯不善已故有無相生難易相成長短相較高下相傾音聲相和前後相隨是以聖人處無為之事行不言之教萬物作焉而不辭生而不有為而不恃功成而弗居夫唯弗居是以不去
...

Another useful aspect is that we can use data from regex groups that we matched within our replacement. This makes it easy to write replacements that do things like add some particular string before or after something we want to match. This example finds any punctuation character, puts it in regex group 1, and then replaces it with regex group 1 followed by a return character – in other words, it adds a line break after every punctuation character.

In [66]:
laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。"
print(re.sub(r"([。,;?:!、])", r"\1\n", laozi))
上德不德,
是以有德;
下德不失德,
是以無德。
上德無為而無以為;
下德為之而有以為。

Regexes and text files

Regular expressions can be very useful when we want to transform text from one format to another, or when we want to read text from a file and it isn’t in the format we want.

In this section, instead of using the ctext.org API, we will experiment with a text from Project Gutenberg. Before starting, download the plain text UTF-8 file from the website and save it on your computer as a file called “mulan.txt”. You should save this in the same folder as this Jupyter notebook (.ipynb) file.

Note: you don’t have to save files in the same folder as your Jupyter notebook, but if you save them somewhere else, when opening the file you will need to tell Python the full path to your file instead of just the filename – e.g. “C:\Users\user\Documents\mulan.txt” instead of just “mulan.txt”.

In [67]:
file = open("mulan.txt", "r", encoding="utf-8")
data_from_file = file.read()
file.close()

One practical issue when dealing with a lot of data in a string is that printing it to the screen so we can see what’s happening in our program may take up a lot of space. One thing we can do is to just print a substring – i.e. only print the first few hundred or so characters:

In [68]:
print(data_from_file[0:700])
The Project Gutenberg EBook of Mu Lan Ji Nu Zhuan, by Anonymous

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Mu Lan Ji Nu Zhuan

Author: Anonymous

Editor: Anonymous

Release Date: December 20, 2007 [EBook #23938]

Language: Chinese

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK MU LAN JI NU ZHUAN ***










序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,
求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,
或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其
難也。性命之理,有不悟性根者,有不知命蒂者,有修性

One thing that will be handy is if we can delete the English blurb at the top of this file automatically. There are several ways we could do this. One way is to use a negative character class – matching everything except some set of characters – to match all characters that are non-Chinese, and delete them.

The re.sub() function takes three parameters:

  1. The regular expression to match
  2. What we want to replace each match with
  3. The string we want to do the matching in
    It returns a new string containing the result after making the substitution.

[The example below also makes use of another kind of special syntax in a character class: we can match a range of characters by their Unicode codepoint. Here we match everything from U+25A1 through U+FFFF, all of which are Chinese characters. Don’t worry too much about the contents of this regex – we won’t need to write regexes like this most of the time.]

In [69]:
new_data = re.sub(r'[^\n\r\u25A1-\uFFFF]', "", data_from_file)
print(new_data[0:700])































序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,
求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,
或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其
難也。性命之理,有不悟性根者,有不知命蒂者,有修性命而旁歧雜出者,有修性命而
後先倒置者。涵養未得其中,即性命未盡其奧也。乃木蘭一女子耳,擔荷倫常,研求性
命,而獨無所不盡也哉!

  予幼讀《木蘭詩》,觀其代父從軍,可謂孝矣;立功絕塞,可謂忠矣。後閱《唐書
》,言木蘭唐女,西陵人,嫻弓馬,諳韜略,轉戰沙漠,累大功十二,何其勇也。封武
昭將軍,凱旋還里。當時筮者謂致亂必由武姓,讒臣嫁禍武昭,詔徵至京。木蘭具表陳
情,掣劍剜心,示使者,目視而死。死後,位證雷部大神,何其烈也。去冬閱《木蘭奇
女傳》,復知其幼而領悟者性命也,長而行持者性命也。且通部議論極精微,極顯豁,
又無非性命之妙諦也。盡人所當盡,亦盡人所難盡。惟其無所不盡,則亦無所不奇。而
人奇,行奇,事奇,文奇,讀者莫不驚奇叫絕也。此書相傳為奎斗馬祖所演,卷首有武
聖帝序。今序已失,同人集貲付梓。書成,爰敘其緣起如此。

      書於滬江梅花書館南窗之下



第一回朱若虛孝弟全天性 朱天錫聰明識童謠

  古樂府所載《木蘭辭》,乃唐初國師李藥師所作也。藥師名靖,號青蓮,又號三元
道人。先生少日,負經天緯地之才,抱治國安民之志,佐太宗平隋亂,開唐基,官拜太
傅,賜爵趙公。晚年修道,煉性登仙。蓋先生盛代奇人,故能識奇中奇人,

We’ve got rid of the English text, but we’ve now got too many empty lines. Depending on what data is in the text, we might want to remove all the line breaks… but in this case there are some things like chapter titles that are best kept on separate lines so we can tell where the chapters begin and end.

Remember: “\n” means “one line break”, and “{3,}” will match 3 or more of something one after the other (and as many times as possible).

In [70]:
without_spaces = re.sub(r'\n{3,}', "\n\n", new_data)  # This regex matches three or more line breaks, and replaces them with two
print(without_spaces[0:700])


序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,
求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,
或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其
難也。性命之理,有不悟性根者,有不知命蒂者,有修性命而旁歧雜出者,有修性命而
後先倒置者。涵養未得其中,即性命未盡其奧也。乃木蘭一女子耳,擔荷倫常,研求性
命,而獨無所不盡也哉!

  予幼讀《木蘭詩》,觀其代父從軍,可謂孝矣;立功絕塞,可謂忠矣。後閱《唐書
》,言木蘭唐女,西陵人,嫻弓馬,諳韜略,轉戰沙漠,累大功十二,何其勇也。封武
昭將軍,凱旋還里。當時筮者謂致亂必由武姓,讒臣嫁禍武昭,詔徵至京。木蘭具表陳
情,掣劍剜心,示使者,目視而死。死後,位證雷部大神,何其烈也。去冬閱《木蘭奇
女傳》,復知其幼而領悟者性命也,長而行持者性命也。且通部議論極精微,極顯豁,
又無非性命之妙諦也。盡人所當盡,亦盡人所難盡。惟其無所不盡,則亦無所不奇。而
人奇,行奇,事奇,文奇,讀者莫不驚奇叫絕也。此書相傳為奎斗馬祖所演,卷首有武
聖帝序。今序已失,同人集貲付梓。書成,爰敘其緣起如此。

      書於滬江梅花書館南窗之下

第一回朱若虛孝弟全天性 朱天錫聰明識童謠

  古樂府所載《木蘭辭》,乃唐初國師李藥師所作也。藥師名靖,號青蓮,又號三元
道人。先生少日,負經天緯地之才,抱治國安民之志,佐太宗平隋亂,開唐基,官拜太
傅,賜爵趙公。晚年修道,煉性登仙。蓋先生盛代奇人,故能識奇中奇人,保全奇中奇
人。奇中奇人為誰?即朱氏木蘭也。

  木蘭女年十

Exercise 7: (Harder) Make another substitution using a regex which removes only the line breaks within a paragraph (and does not remove linebreaks before and after a chapter title). The output should look like this:

序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其難也。性命之理,有不悟性根者,有不知命蒂者,有修性命而旁歧雜出者,有修性命而後先倒置者。涵養未得其中,即性命未盡其奧也。乃木蘭一女子耳,擔荷倫常,研求性命,而獨無所不盡也哉!

  予幼讀《木蘭詩》,觀其代父從軍,可謂孝矣;立功絕塞,可謂忠矣。後閱《唐書》,言木蘭唐女,西陵人,嫻弓馬,諳韜略,轉戰沙漠,累大功十二,何其勇也。封武昭將軍,凱旋還里。當時筮者謂致亂必由武姓,讒臣嫁禍武昭,詔徵至京。木蘭具表陳情,掣劍剜心,示使者,目視而死。死後,位證雷部大神,何其烈也。去冬閱《木蘭奇女傳》,復知其幼而領悟者性命也,長而行持者性命也。且通部議論極精微,極顯豁,又無非性命之妙諦也。盡人所當盡,亦盡人所難盡。惟其無所不盡,則亦無所不奇。而人奇,行奇,事奇,文奇,讀者莫不驚奇叫絕也。此書相傳為奎斗馬祖所演,卷首有武聖帝序。今序已失,同人集貲付梓。書成,爰敘其緣起如此。

      書於滬江梅花書館南窗之下

Hint: Think about what you need to match to make the change. You may need to include some of the things that your regex matches in the replacement using references (i.e. \1, \2, etc.).

In [ ]:
without_spaces2 = re.sub(r"your regex goes here!", r"", without_spaces)
print(without_spaces2[0:700])

Exercise 8: The text contains comments in it which we might want to delete before doing further processing or calculating any statistics. Create a regex substitution which removes each of these comments.

Example comment: …居於湖廣黃州府西陵縣(今之黃陂縣)雙龍鎮。 => should become …居於湖廣黃州府西陵縣雙龍鎮。

Make sure to check that your regex does not remove too much text!

In [ ]:
without_comments = re.sub(r"your regex goes here!", r"", without_spaces2)
print(without_comments[0:1000])

Exercise 9: Experiment with writing regexes to list things that look like chapter titles in the text. There are several ways this can be done. (There are 32 numbered chapters in this text.)

In [ ]:
for match in re.finditer(r"your regex goes here!", without_spaces2):
    matched_text = match.group(1)
    print(matched_text)
  • Next, use your chapter-detecting regex to add immediately before each chapter the text “CHAPTER_STARTS_HERE”.
In [ ]:
# Your code goes here!

Lastly, we can use a regex to split a string variable into a Python list using the re.split() function. At any point in the string where the specified regex is matched, the data is split into pieces. For instance:

In [71]:
laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。上仁為之而無以為;上義為之而有以為。上禮為之而莫之應,則攘臂而扔之。故失道而後德,失德而後仁,失仁而後義,失義而後禮。"
laozi_phrases = re.split(r"[。,;]", laozi)
for number in range(0, len(laozi_phrases)):
    print(str(number) + ". " + laozi_phrases[number])
0. 上德不德
1. 是以有德
2. 下德不失德
3. 是以無德
4. 上德無為而無以為
5. 下德為之而有以為
6. 上仁為之而無以為
7. 上義為之而有以為
8. 上禮為之而莫之應
9. 則攘臂而扔之
10. 故失道而後德
11. 失德而後仁
12. 失仁而後義
13. 失義而後禮
14. 

Use re.split() to split your full text into a Python list, in which each chapter is one list item. (For simplicity you can ignore things like the preface etc.)

In [ ]:
# Your code goes here!
# Call your list variable "chapters"

Now we have this data in a Python list, we can do things to each chapter individually. We can also put each of the chapters into its own text file – this is something we will sometimes need to do when we want to use other tools that are not in Python.

In [ ]:
for chapternumber in range(0,len(chapters)):
    file = open("mulan-part-" + str(chapternumber) + ".txt", "w", encoding="utf-8")
    file.write(chapters[chapternumber] + "\n")
    file.close()

Further reading:

Creative Commons License
Leave a comment

Regular expressions with Text Tools for ctext.org

Along with other functions such as automated text reuse identification, the “Text Tools” plugin for ctext.org can use the ctext API to import textual data from ctext.org directly for analysis with regular expressions. A step-by-step online tutorial describes how to actually use the tool (see also the instructions on the tool’s own help page); here I will give some concrete examples of what the tool can be used to do.

Regular expressions (often shortened to “Regexes”) are a powerful extension of the type of simple string search widely available in computer software (e.g. word processors, web browsers, etc.): a regular expression is a specification of something to be matched in some body of text. At their simplest, regular expressions can be simply strings of characters to search for, like “君子” or “巧言令色”. At its most basic, you can use Text Tools to search for multiple terms within a text by entering your terms one per line in the “Regex” tab:

Text Tools will highlight each match in a different color, and show only the paragraphs with at least one match. Of course, you can specify as many search terms as you like, for example:

Clicking on any of the matched terms adds it as a “constraint”, meaning that only passages containing that term will be shown (though still highlighting any other matches present). For instance, clicking “君子” will show all the passages with the term “君子” in them, while still highlighting any other matches:

As with the similarity function of the same plugin, if your regular expression query results in relational data, this can be visualized as a network graph. This is done by setting “Group rows by” to either “Paragraph” or “Chapter”, which gives results in the “Summary” tab tabulated by paragraph (or chapter) – each row represents a paragraph which matched a term, and each column corresponds to one of the matched items:

This can be visualized as a network graph in which edges represent co-occurrence of terms within the same paragraph, and edge weights represent the number of times such co-occurrence is repeated in the texts selected:

This makes it clear where the most frequently repeated co-occurrences occur in the selected corpus – in this example, “君子” and “小人”, “君子” and “禮”, etc. Similarly to the way in which similarity graphs created with the Text Tools plugin work, double-clicking on any edge in the graph returns to the “Regex” tab with the two terms joined by that edge chosen as constraints, thus listing all the passages in which those terms co-occur, this being the data explaining the selected edge:

So far these examples have used fixed lists of search strings. But as the name suggests, the “Regex” tool also supports regular expressions, and so by making use of standard regular expression syntax, it’s possible to make far more sophisticated queries. [If you haven’t come across regular expressions before, some examples are covered in the regex section of the Text Tools tutorial.] For example, we could write a regular expression that matches any one of a specified set of color terms, followed by any other character, and see how these are used in the Quan Tang Shi (my example regex is “[黑白紅]\w”: match any one of “黑”, “白”, or “紅”, followed by one non-punctuation character):

If we use “Group by: None”, we get total counts of each matched value – i.e. counts of how frequently “白雪”, “白水”, “紅葉”, and whatever other combinations there are occurred in our text. We can then use the “Chart” link to chart these results and get an overview of the most frequently used combinations:

If we go back to the Regex tab and set “Group by” to “Paragraph”, we can visualize the relationships just like in the Analects example — except that this time we don’t need to specify a list of terms, rather these terms can be extracted using the pattern we specified as a regular expression (in this graph I have set “Skip edges with weight less than” to “2” to reduce clutter caused by pairs of terms that only ever occur once):

Although overall – as we can see from the bar chart above – combinations with “白” in them are the most common, the relational data shown in the graph above immediately highlights other features of the use of these color pairings: the three most frequent pairings in our data are actually pairings between “白” and “紅”, like “白雲” and “紅葉”, or “白髮” and “紅顏”. As before, our edges are linked to the data, so we can easily go back to the text to see how these are actually being used:

Regular expressions are a hugely powerful way of expressing patterns to search for in text — see the tutorial for more examples and a step-by-step walk-through.

Leave a comment

Exploring text reuse with Text Tools for ctext.org

The plugin system and API for ctext.org make it possible to import textual data from ctext.org directly into other online tools. One such tool is the new “Text Tools” plugin, which provides a set of textual analysis and visualization tools designed to work with texts from ctext.org. There is a step-by-step online tutorial describing how to actually use the tool (as well as the instructions on the tool’s own help page); I won’t repeat those here, but instead will give some examples of what the tool can be used to do.

One of the most interesting features of the tool is its function to identify text reuse within and between texts (via the “Similarity” tab). This takes as input one or more texts, and identifies and visualizes similarities between them. For example, with the text of the Analects:

This uses a heat map effect somewhat similar to the ctext.org parallel passage feature: here n-grams are matched (e.g. 3-grams, i.e. triples of identical characters used in identical sequence), and overlapping matched n-grams are shown in successively brighter shades of red. By default, all paragraphs having any shared n-grams with anything else in the selected text or texts are shown. The visualization is interactive, so clicking on any highlighted section switches the view to show all locations in the chosen corpus containing the selected n-gram (which is then highlighted in blue, like the 6-gram “如己者過則勿” in the following image):

Since the texts are read in from ctext.org via the API, the program also knows the structure of the text; clicking on “Chapter summary” shows instead a table of calculated total matches aggregated on a chapter-by-chapter basis:

This data is relational: each row expresses strength of similarity of a certain kind between two entities (two chapters of text). It can therefore be visualized as a weighted network graph – the Text Tools plugin can do this for you:

What’s nice about this type of graph is that every edge has a very concrete meaning: the edge weights are simply a representation of how much reuse there is between the two nodes (i.e. chapters) which it connects. Even better, this visualization is also interactive: double-clicking an edge (e.g. the edge connecting 先進 and 雍也) returns to the passage level visualization and lists all the similarities between those two specified chapters – in other words, it lists precisely the data forming the basis for the creation of that edge:

What this means is that the graph can be used as a map to see where similarities occur and with which to navigate the results. It also makes it possible to visualize broader trends in the data which might not be easily visible by looking directly at the raw data. For instance, in the following graph created using the tool from three early texts, several interesting patterns are observable at a glance (key: light green = Mozi; dark green = Zhuangzi; blue = Xunzi):

Some at-a-glance patterns suggested by this graph: chapters of the three texts tend to have stronger relationships within their own text, with a few exceptions. There are several disjoint clusters of chapters, which have text reuse relationships with other members of their own group, but not with the rest of the text they appear in – most striking is the group of eight “military chapters” of the Mozi at the top right of the graph, which have strong internal connections but none to anything else in the graph:

Double-clicking on some edges to view the full data indicates that some of these pairs have quite significant reuse relationships:

The only other entirely disjoint cluster is the group formed by the 大取 and 小取 pair of texts – in this case the edge is formed by one short but highly significant parallel:

Another interesting observation: of those Zhuangzi chapters having text reuse relationships with other chapters in the set considered, only the 天下 chapter lacks any significant reuse relationship with any other part of the Zhuangzi – though it does contain a significant parallel with the Xunzi:

Something similar is seen with the 賦 chapter of the Xunzi:

There is a lot of complex detail in this graph, and interpretation requires care and attention to the actual details of what is being “reused” (as well as the parameters of the comparison and visualization); the Text Tools program makes it possible to easily explore the larger trends while also being able to quickly jump into the detailed instance-level view to examine the underlying text. Text Tools works “out of the box” with texts from ctext.org read in via API (ideally you will need an institutional subscription or API key to do this efficiently), but it can also use texts from other sources.

Further information:

Leave a comment

Searching ctext.org texts from another website

There are a number of ways to add direct full-text search of a ctext.org text to an external website. One of the most straightforward is to use the API “getlink” function to link to a text using its CTP URN. For example, to make a text box which will search this Harvard-Yenching copy of the 茶香閣遺草, you can first locate the corresponding transcribed text on ctext.org, go to the bottom-right of its contents page to get its URN (you need the contents page for the transcription, not the associated scan), which in this case is “ctp:wb417980” – this step can also be done programmatically by API if you want to repeat it for a large number of texts. Once you have the URN, you can create an HTML form which will send the URN and any user-specified search term to the ctext API, which will redirect the user’s browser to the search results. For example, the following HTML creates a search box for 茶香閣遺草:

<form action="https://api.ctext.org/getlink" method="get">
  <input type="hidden" name="urn" value="ctp:wb417980" />
  <input type="text" name="search" />
  <input type="hidden" name="redirect" value="1" />
  <input type="submit" value="Search" />
</form>

This will display the following type of search box (try entering a search term in Chinese and clicking “Search”):





You can also supply the optional “if” and “remap” parameters if you want users of your form to be directed to the Chinese interface, or to use the simplified Chinese version of the site (the defaults are English and traditional Chinese). For Chinese interface, between the <form … /> and </form> tags, add the following line:

  <input type="hidden" name="if" value="zh" />

For simplified Chinese, add this line:

  <input type="hidden" name="remap" value="gb" />

If you want to make a link to the text itself using the URN, you can also directly link to the API endpoint:

<a href="https://api.ctext.org/getlink?urn=ctp:wb417980&amp;redirect=1">茶香閣遺草</a>

Live example: 茶香閣遺草

Again, the “if” and “remap” parameters can also be supplied to choose the interface used, as per the API documentation.

Leave a comment

Spammy advertising best reason for switching to HTTPS

While transiting at Schiphol and using the airport wifi, I noticed the sudden appearance of a bunch of adverts on normally advert-free websites. For example:

Some investigation indicated that this time the adverts were not injected via Google Analytics, but instead attached directly into the HTML content of the page. First at the top we have some injected CSS:

Then at the bottom we have the real payload, injected JavaScript code:

It appears this is the same type of advertising afflicting AT&T hotspots – information gleaned from Jonathan Meyer, whose website describing the issue is itself also affected by the Schipol adverts:

Again it seems that given the large scale involved, someone, somewhere – perhaps including a company called “RaGaPa” who seem to be responsible for the ads – is making quite a bit of money through unsavory and perhaps legally questionable means.

Just in case the adverts on their own are not spammy enough, the icon at the top right of the adverts link to the following explanation, casually noting that in addition to standard user tracking and user history ad serving, “You may also be redirected to sponsor’s websites or welcome pages at a set frequency”:

Perhaps the real take-home though is that HTTPS sites are, again, not affected by this: content injection of this type is not possible on sites served using HTTPS without defeating the certificate authority chain or sidestepping it with other kinds of trickery. Digital Sinology recently moved to HTTPS, so is not affected by this particular attack.

Leave a comment

Classical Chinese Digital Humanities

By Donald Sturgeon

List of tutorials

1 Getting Started [View online] [Download]
2 Python programming and ctext.org API [View online] [Download]
3 Regular expressions [View online] [Download]
Creative Commons License
Leave a comment

Classical Chinese DH: Python programming and ctext.org API

Classical Chinese DH: Python programming and ctext.org API

By Donald Sturgeon

[View this notebook online] [Download this notebook] [List of tutorials]

Variables

Variables are named entities that contain some kind of data that can be changed at a later date. You can choose (almost) any name for a variable as long as it is not the same as a reserved word (i.e. has some special meaning in the Python language), though typically these names will be composed of letters a-z. The names given to variables have no special meaning to the computer, but giving variables names that describe their function in a particular program is usually very helpful to the programmer – and to anyone else who may look at your code. Spaces cannot be part of a variable name, so sometimes other allowed characters (e.g. “_”) are used instead for clarity. Although it is possible to use non-English characters for variable names, this is generally inadvisable as it may cause compatibility problems when running the same program on another computer.

A value is assigned to a variable using the syntax “variable_name = new_value“.

In [1]:
number_of_people = 5
print(number_of_people)
5

A variable only ever has one value at a time. When we assign a second value to a variable, anything that was in it before is lost.

In [2]:
number_of_people = 5
number_of_people = 15
print(number_of_people)
15

In Python, variable names are case sensitive, so as far as Python is concerned, a variable called “thisone” is completely different from a variable called “ThisOne” or “THISONE”.

In [3]:
test = 1
print(Test)  # This will not work and will give an error, because "test" and "Test" are different variables
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-c8ddc105ba6a> in <module>()
      1 test = 1
----> 2 print(Test)  # This will not work and will give an error, because "test" and "Test" are different variables

NameError: name 'Test' is not defined
In [4]:
thisone = 5
ThisOne = 10
print(thisone)
print(ThisOne)
5
10

We can perform basic arithmetic on variables and numbers using symbols representing arithmetic operators:

+ Add
Subtract
/ Divide
* Multiply
In [5]:
number_of_people = 5
pages_per_person = 12
print(number_of_people * pages_per_person)
60

Strings

One of the most important units of text in most programming languages is the string: an ordered sequence of zero or more characters. Strings can be “literal” strings – string data typed in to a program – or the contents of a variable.

Literal strings have to be enclosed in special characters so that Python knows exactly which part of what appears in the program belongs to the string being defined. You can use either a pair of double quotation marks (“…”) or single quotation marks (‘…’) for this. (Note: most programming languages including Python will not allow the use of ‘full-width’ Chinese / CJK punctuation characters for this purpose.)

In [6]:
print("學而時習之")
學而時習之
In [7]:
analects_1_1 = "子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」"
print(analects_1_1)
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」

Two strings can be joined together (concatenated) using the “+” operator to give a new string:

In [8]:
analects_1_3 = "子曰:「巧言令色,鮮矣仁!」"
print(analects_1_1 + analects_1_3)
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」子曰:「巧言令色,鮮矣仁!」

In Python, each variable has a particular “type”. The most common types are “string”, “integer” (…,-2,-1,0,1,2,…), and “float” (any real number, e.g. 3.1415, -26, …). When joining a string and a number using “+”, we need to specify that the number should be changed into a string:

In [9]:
print(analects_1_1 + 5) # This will not work
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-0eb38b4432f4> in <module>()
----> 1 print(analects_1_1 + 5) # This will not work

TypeError: Can't convert 'int' object to str implicitly
In [10]:
print(analects_1_1 + str(5))
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」5

Sometimes we may need to include “special” characters in our strings, like the return character (\n), or tab character (\t). Also, if we need to include quotation marks as part of our string, we can do this by putting a backslash before each quotation mark (\”).

In [11]:
output_string = "第一行\n第二行\n\"第三行\""
print(output_string)
第一行
第二行
"第三行"

Doing things with strings

Once we have data in a string, we can manipulate it in various ways. Often a program will be designed to work on arbitrary string input, i.e. the program will not know in advance what strings it will be asked to work with. So we need ways of finding out basic things about the string. First of all, how long (in characters) is the string:

In [12]:
print(len(analects_1_1))
41

N.B. “Characters” here means characters in the technical string sense. It includes, for example, all punctuation symbols, and other “special characters” that may be in the string (such as characters representing line breaks).

We can take a single character from a string and create a new string containing just that character using the notation “string_variable[m]“, where m is a number describing the position of the character we want to copy from the string.

N.B. In Python (and many other languages), the characters in a string are numbered starting from 0. So if a string has a length of 5, its characters are numbered 0, 1, 2, 3, and 4.

In [13]:
print(analects_1_1[0])
In [14]:
print(analects_1_1[5])

If we use a negative value for m, we can do the same thing but counting backwards from the end of the string:

In [15]:
print(analects_1_1[-3])

Another useful basic function is making a new string from some part of an existing string – this is called a “substring”.
In Python, we get a substring of a string starting at position m and ending just before position n using the notation “string_variable[m:n]”:

In [16]:
print(analects_1_1[0:1])
In [17]:
print(analects_1_1[1:2])
In [18]:
print(analects_1_1[0:2])
子曰
In [19]:
print(analects_1_1[4:15])
學而時習之,不亦說乎?
In [20]:
print(len(analects_1_1[4:15]))
11

If we want to count characters from the end of a string, instead of from the beginning, we can use a negative number for m (meaning “start from –m characters before the end of the string”) and either omit n entirely (meaning “up to the end of the string”) or use a negative number for n (meaning “up to –n characters before the end of the string”):

In [21]:
print(analects_1_1[-7:])
不亦君子乎?」
In [22]:
print(analects_1_1[-7:-1])
不亦君子乎?

There are many other functions for doing things with strings – we will see more of these in week 3. In the meantime, two useful functions are count(), which returns the number of times one string occurs within another string.

In [23]:
input_text = "道可道,非常道。"
print(input_text.count("道"))
3
In [24]:
print(input_text.count("道可"))
1

Another is replace(), which creates a new string in which all matching occurrence of a substring have been replaced by something else. The general form looks something like this:

string_to_search_in.replace(thing_to_search_for, thing_to_replace_with)

N.B. This function does not change the data in the original variable. It just returns new data with the substitution made.

In [25]:
input_text = "道可道,非常道。"
print(input_text.replace("道", "名"))
名可名,非常名。
In [26]:
print(input_text)  # Note: the input_text variable still contains the same data
道可道,非常道。

Lists

Lists are another kind of variable that work a lot like strings, except that whereas each location within a string is always exactly one character, each location in a list can be any kind of value, such as a number or a string.

We can make a list variable by separating each list element with commas and enclosing the whole lot in square brackets.

In [27]:
days_of_week = ["星期天", "星期一", "星期二", "星期三", "星期四", "星期五", "星期六"]
print(days_of_week)
['星期天', '星期一', '星期二', '星期三', '星期四', '星期五', '星期六']

In Python, the items stored in a list are numbered starting from 0. This means if we have 7 items in our list, they are numbered 0, 1, 2, 3, 4, 5, and 6.

In [28]:
print(days_of_week[0])
星期天
In [29]:
print(days_of_week[6])
星期六

If we try to use an item that isn’t in our list, we will get an error.

In [30]:
print(days_of_week[7])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-30-e6a905455149> in <module>()
----> 1 print(days_of_week[7])

IndexError: list index out of range

We can set the value of items in a list just like regular variables.

In [31]:
days_of_week[3] = "禮拜三"
print(days_of_week)
['星期天', '星期一', '星期二', '禮拜三', '星期四', '星期五', '星期六']

Often when we process lists in a program, we will need to find out how long the list is. This is because our programs will usually be designed to work with any input of a certain type (for example, to work with any text, not just one we’ve chosen in advance), and so we will only find out many items there are in a particular list when our program is actually run. The len() function tells us how many items are in our list.

In [32]:
print(len(days_of_week))
7

Remember that when we use len() on a string, it will give us the length of the string in numbers of characters. So for our days_of_week example:

In [33]:
print(len(days_of_week[0]))
print(len(days_of_week[1]))
# etc.
3
3

Make sure you understand why we get this answer here.

True and false

Boolean logic – i.e. logic in which things are either true or false – is central to most commonly used programming languages. Typically programs make decisions as to what to do next based on whether some particular expression (e.g. comparison of variables) is true or false. Some basic comparison operators are:

== equals
> greater than
< less than
>= greater than or equal to
<= less than or equal to

N.B. Assignment, e.g. a=1, uses a single “=”, whereas comparison, e.g. a==2, uses a double “==”.

In [34]:
print(5>2)
True
In [35]:
print(5>7)
False
In [36]:
number_of_pages = 5*12
print(number_of_pages == 60)
True
In [37]:
print(analects_1_1)
print(analects_1_3)
print(analects_1_1[0:2] == analects_1_3[0:2])   # If you're not sure why, try using print() on the two sides of the equation
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」
子曰:「巧言令色,鮮矣仁!」
True
In [38]:
print(analects_1_1[0:7] == analects_1_3[0:7])   # If you're not sure why, try using print() on the two sides of the equation
False

Often we want to know whether something occurs anywhere within a string. One way to do this is using the in operator. (We will look at more sophisticated types of searching for more complex patterns next week.)

In [39]:
text_to_search_in = "有子曰:「其為人也孝弟,而好犯上者,鮮矣;不好犯上,而好作亂者,未之有也。君子務本,本立而道生。孝弟也者,其為仁之本與!」"
print("孝弟" in text_to_search_in)
True
In [40]:
print("仁義" in text_to_search_in) 
False

Making decisions

Now that we can compare things, we can start to change what we do depending on the outcome of these comparisons.

The simplest type of decision is an “if … then …” decision: if something is true, then do something (otherwise don’t do it).

In [41]:
if(3>2):
    print("3 is greater than 2")
    
if(2>3):
    print("2 is greater than 3")
3 is greater than 2

In Python, indentation (one or more spaces from the left-hand margin) is used to mark blocks of code (sequences of instructions) which are to be followed together. For example:

In [42]:
if(3>2):
    print("3 is greater than 2")
    print("This is also executed if 3>2")
    
if(2>3):
    print("2 is greater than 3")

print("This is *always* executed, because it is outside of both 'if' blocks")
3 is greater than 2
This is also executed if 3>2
This is *always* executed, because it is outside of both 'if' blocks

The “else” keyword can be used after an “if” to do one thing if a condition is true, and some other thing if it is not true:

In [43]:
text1 = "Some text"
text2 = "Some other text"
if(len(text1) > len(text2)):
    print("text 1 is longer than text 2")
else:
    print("text 1 is not longer than text 2")  # It might be exactly the same length though - try changing the text in text1 and text2
text 1 is not longer than text 2

Also useful are logical operators “and“, “or“, and “not“. These allow us to make more complex decisions based on several factors.

Python expression Result
A and B True if A is True and B is also True – otherwise False
A or B True if A is True or B is True, or both are True – otherwise False
not A True if A is False – otherwise False

N.B. Matching pairs of brackets are used in complex expressions to remove ambiguity: the innermost brackets are always evaluated first. For instance:

(a and b) or c   # This will only be true when either: 1) a and b are both true; or 2) c is true
a and (b or c)   # This will only be true when *both*: 1) a is true; and 2) either b is true or c is true
a and b or c     # ???? Don't write this - it's not obvious which of the previous two lines it corresponds to

Suggestion: don’t write things like “a and b or c” without brackets, as these are confusing. (There are rules that mean they are not ambiguous, but instead of worrying about these now, always use brackets when mixing and, or, and not in an expression.)

Experiment with changing the values of the three variables below, making sure you understand why you get different results.

In [44]:
is_raining = False
will_rain_later = True
i_am_going_out = True

print(is_raining or will_rain_later)
print((is_raining or will_rain_later) and i_am_going_out)

if ((not is_raining) and will_rain_later) and i_am_going_out:
    print("Did you see the forecast?")
    
if (is_raining or will_rain_later) and i_am_going_out:
    print("Better take an umbrella!")
True
True
Did you see the forecast?
Better take an umbrella!

Repeating instructions

A lot of the power of digital methods comes from the fact that once we have a program that performs some task on an arbitrary object of some kind, we can easily have the computer perform that same task on large numbers of objects. One of the simplest ways of repeating instructions is the “for” loop. Just like with if, in Python we use indentation to indicate exactly which instructions we want repeated. For loops are often used with the range() function:

In [45]:
for some_variable in range(0,5):
    print(some_variable)
0
1
2
3
4

The range() function takes two parameters that determine what range of numbers we want: this first parameter determines the number we want to begin with, and the second determines the end of the range. Note: the “end” parameter to range() is not inclusive – so range(0,3) will give us a range with the numbers 0,1,2 but not 3.

As well as range(), we can also loop over each character in a string, or over each item in a list (i.e. do some processing for each character or item).

In [46]:
for day in days_of_week:
    print(day)
星期天
星期一
星期二
禮拜三
星期四
星期五
星期六

Working with characters in a string is just as easy – we can use the format “for character_variable in string:”

In [47]:
my_string = "道可道,非常道。"
for my_character in my_string:
    print(my_character)
道
可
道
,
非
常
道
。

However, sometimes we also need to know where (i.e. at what index) we are in a string.

In [48]:
for character_index in range(0,len(my_string)):
    print("Index " + str(character_index) + " in our string is: " + my_string[character_index])
Index 0 in our string is: 道
Index 1 in our string is: 可
Index 2 in our string is: 道
Index 3 in our string is: ,
Index 4 in our string is: 非
Index 5 in our string is: 常
Index 6 in our string is: 道
Index 7 in our string is: 。

Sometimes we will need to have one loop inside another loop. In this case, we use progressively larger indentations to indicate which instructions should be repeated in which loop.

In [49]:
for x in range(1,4):
    print("Starting sums with x = " + str(x))
    for y in range(10,14):
        z = x * y
        print(str(x) + "*" + str(y) + "=" + str(z))
    print("Finished sums with x = " + str(x))
print("Finished all the sums")
Starting sums with x = 1
1*10=10
1*11=11
1*12=12
1*13=13
Finished sums with x = 1
Starting sums with x = 2
2*10=20
2*11=22
2*12=24
2*13=26
Finished sums with x = 2
Starting sums with x = 3
3*10=30
3*11=33
3*12=36
3*13=39
Finished sums with x = 3
Finished all the sums

Getting Chinese texts

If you already have a copy of a text you’d like to process, you can easily read it from a text file into a string variable. However, as the formatting of each text may be different, the exact steps needed to process the text may differ slightly in each case. We will look at how to deal with this in detail next time when we look at regular expressions, since these provide powerful tools to quickly reorganize textual data.

An alternative way of getting textual data for historical texts is to use the ctext.org API, which lets us get the text for many historical texts in a consistent format. Texts are identified using a URN (Uniform Resource Name) – you can see this written at the bottom of the page when you view the corresponding text on ctext.org.

To make it easier to access these texts, we can install a specialized Python module to access the API. Before we can use it, we have to install it. Python makes this very easy to do, and it only needs to be done once; if you followed the instructions in the first tutorial, you should already have this module installed.

Once installed, we can use functions from this module to read textual data into Python variables. If we don’t care about the structure of a text, but only its contents, we can read the text into a single list of strings, each containing a single paragraph, like this:

In [50]:
from ctext import *  # This lets us get Chinese texts from http://ctext.org
setapikey("demo")    # This allows us access to the data used in these tutorials

passages = gettextasparagrapharray("ctp:analects")

print("Total number of passages: " + str(len(passages)))
print("First passage is: " + passages[0])
print("Second passage is: " + passages[1])
print("Last passage is: " + passages[-1])
Total number of passages: 503
First passage is: 子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」
Second passage is: 有子曰:「其為人也孝弟,而好犯上者,鮮矣;不好犯上,而好作亂者,未之有也。君子務本,本立而道生。孝弟也者,其為仁之本與!」
Last passage is: 子曰:「不知命,無以為君子也。不知禮,無以立也。不知言,無以知人也。」

Reading and writing files

To read or write to a file, we must first open it. When we open a file, we must specify both the name of the file, and whether we want to read data from it (“r”), or write data to it (“w”). The file.write() function works very much like print(), but writes its output directly to the file instead of to the screen. When writing to a file, however, we need to explicitly include return characters at the end of each line using “\n”. The file.read() function reads all the data from the file, which you can assign to a python variable.

N.B. Be careful when opening files! When you use “w” to open a file for writing, any file with that name in the same folder as your Python Notebook will immediately be replaced with a new, empty file.

In [51]:
file = open("week2_testfile.txt", "w", encoding="utf-8") # N.B. "w" here means we will open this file and write to it. If the file exists, it will immediately be deleted.
file.write("第一行\n第二行")
file.close()

Now take a look in Windows Explorer or Mac OS Finder and make sure you can see where this file is on your computer. It will be helpful for you to know in which folder Python expects files to be by default.

In [52]:
file = open("week2_testfile.txt", "r", encoding="utf-8") # "r" means we will open this file for reading, and won't be able to modify it
data_from_file = file.read()
file.close()
print(data_from_file)
第一行
第二行

Exercises

1.i) Write a program using a for loop to output all of the substrings of length 2 contained in the variable input_string. Your program should produce output like this:

天命
命之
之謂
謂性
性,
,率
率性
性之
...
...
道也
也。
In [60]:
input_string = "天命之謂性,率性之謂道,修道之謂教。道也者,不可須臾離也,可離非道也。"

# Your code goes here!

1.ii) Now modify your program so that you first define a variable called “substring_length” containing a number determining the length of substring to be listed. Your new program should still give the same output when run with “substring_length=2”, but should also work with substring_length set to 3, 4, etc. Remember, every line that your program outputs should have exactly substring_length characters in it (including punctuation characters).

1.iii) Modify your program again so that on each line it prints the total number of times that the substring occurs in input_string. For instance, each line beginning “之謂” should now read “之謂 3”, since “之謂” occurs three times in this string.

1.iv) Run your program again, but now use this slightly longer text instead:

input_string = "天命之謂性,率性之謂道,修道之謂教。道也者,不可須臾離也,可離非道也。是故君子戒慎乎其所不睹,恐懼乎其所不聞。莫見乎隱,莫顯乎微。故君子慎其獨也。喜怒哀樂之未發,謂之中;發而皆中節,謂之和;中也者,天下之大本也;和也者,天下之達道也。致中和,天地位焉,萬物育焉。"

Look at what the most frequent 2-grams are with this text. As before, some of them will include punctuation. Do any of these frequent 2-grams which include punctuation relate to facts about the language?

2.i) In the cell below, write a program to find and print out all passages in the Analects that include the term “仁”.

Note: if you’ve run the example program under Getting Chinese texts, the data is already stored in the passages variable.

In [54]:
# Your code goes here!

2.ii) Modify your program so it instead lists only passages that mention both terms “仁” and “義”.

2.iii) Modify your program so it instead lists only passages that mention either the term “愛人” or “知人” but not both.

3) Write another program (if you like, you can copy and paste your answer to the previous question and modify it) to determine which passage in the Analects mentions the term “禮” the greatest number of times.

Hint: Use one variable to track the greatest number of times “禮” has appeared, and another to track which passage it appeared in.

In [55]:
# Your code goes here!

4.i) Write a program in the cell below to store the full text of the Analects into a file on your computer called “analects.txt”. Put each paragraph on its own line, and in front of each paragraph put firstly the number of the paragraph, starting at 1, and secondly the length of the paragraph in characters. Separate each of these three pieces of data with a tab character. The beginning of your file should look like this:

1   41   子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」
2   61   有子曰:「其為人也孝弟,而好犯上者,鮮矣;不好犯上,而好作亂者,未之有也。君子務本,本立而道生。孝弟也者,其為仁之本與!」
In [56]:
# Your code goes here!
  • Open the file in a text editor (e.g. Notepad for Windows, TextEdit for Mac OS X – usually double-clicking on the file you’ve created will do this) and check that the output looks correct.

4.ii) Modify your program so that the character counts only include Chinese characters, i.e. do not count punctuation characters.

4.iii) (Optional) If you have Excel or another spreadsheet program on your computer, try importing the file into it so that you get separate columns for the paragraph number, length, and content. (This may or may not be easy depending on your operating system and spreadsheet program. If you encounter encoding issues, try copying all of the data from the text editor straight into a blank spreadsheet instead.)

In [57]:
# Your code goes here!

5) [Harder] Which passage in the Analects contains a character repeated more frequently than any other character in any other passage – and what is the character?
It will help if you firstly think carefully about what you need to keep track of between passages in order to answer this question.

In [58]:
# Your code goes here!

Further reading

Bonus question

This section is optional as it includes several things we haven’t covered yet and will look at later on in when we look more closely at structured data.

The program below uses a dictionary variable to count all of the 1-grams in the Analects. [A dictionary variable is very similar to a list, except that its items are not numbered 0,1,2,… but instead indexed using arbitrary strings – for instance, my_dictionary[“論語”], which might contain a string value such as “analects”, or a number like 32. The term “dictionary” here is metaphorical, i.e. a dictionary variable often does not contain translations of words from one language to another – though this is one possible use case.]

It then uses the pandas library to select the top ten most frequent 1-grams, and the matplotlib library to draw a bar chart of this data.

Read through the code, and see if you can modify it to find the most frequent 2-grams, 3-grams, etc.

In [59]:
import numpy as np
import matplotlib.pyplot as plt

# The next line tells the matplotlib library to display its output in our Jupyter notebook
%matplotlib inline
from ctext import *
import pandas as pd
import matplotlib as mpl

# Unfortunately some software still has difficulty dealing with Chinese.
# Here we may need to tell matplotlib to use a specific font containing Chinese characters.
import platform
if platform.system() == 'Darwin':   # I.e. if we're running on Mac OS X
    mpl.rcParams['font.family'] = "STFangsong" 
else:
    mpl.rcParams['font.family'] = "SimHei"
    
mpl.rcParams['font.size'] = 20

chapterdata = gettextasparagrapharray("ctp:analects")

# Use a dictionary variable to keep track the count of each character we see
character_count = {}

# For each paragraph of the chapter data that we downloaded, do the following:
for paragraphnumber in range(0, len(chapterdata)):
    for char in range(0,len(chapterdata[paragraphnumber])):
        this_character = chapterdata[paragraphnumber][char:char+1]
        # Don't bother counting punctuation characters
        if this_character not in [",", "。", ":", ";", "「", "」", "?"]:
            if this_character in character_count:
                new_count = character_count[this_character] + 1
            else:
                new_count = 1
            character_count[this_character] = new_count

s = pd.Series(character_count)
s.sort_values(0, 0, inplace=True)

s[:10].plot(kind='barh')
print(s[:10])
子    973
曰    757
之    613
不    583
也    532
而    343
其    270
者    219
人    219
以    211
dtype: int64
Creative Commons License
Leave a comment

Classical Chinese DH: Getting Started

By Donald Sturgeon

This is the first in a series of online tutorials introducing basic digital humanities techniques using the Python programming language and the Chinese Text Project API. These tutorials are based in part on material covered in the course CHNSHIS 202: Digital Methods for Chinese Studies, which I teach at Harvard University’s Department of East Asian Languages and Civilizations.

Intended audience: People with some knowledge of Chinese literature and an interest in digital humanities; no programming experience necessary.

Format: Most of these tutorials will consist of a Jupyter Notebook file. These files contain a mixture of explanations and code that can be modified and run from within your web browser. This makes it very easy to modify, play with, and extend all of the example code. You can also read the tutorials online first (you’ll need to download the files in order to run the code and do the exercises though).

Getting started

To use this series of tutorials, you need to first complete the following steps:

  1. Install Python (programming language) and Jupyter (web browser based interface to Python). The recommended way to do this is by installing the Anaconda distribution, which will automatically install Python, Jupyter, and many other things we need. For these tutorials, you should install the Python 3.x version of Anaconda (not the 2.7 version).
  2. Install the ctext module. To do this, after installing Anaconda, open Command Prompt (Windows) or Terminal (Mac OS X), and then type:
    pip install ctext [return]
  3. Create a folder to contain your Python projects. To follow a tutorial, first download the .ipynb Jupyter Notebook file and save it into this folder.
  4. Start up the Jupyter environment. One way to do this is opening the Command Prompt (Windows) or Terminal (Mac OS X), and then typing:
    jupyter notebook [return]
  5. When you start Jupyter, it should open your web browser and take you to the page http://localhost:8888/tree. This is a web page, but instead of being located somewhere on the internet, it is located on your own computer. The page should show a list of files and folders on your own computer; using this list, navigate to the folder containing the downloaded .ipynb file, and click on the file to open it in your web browser. You can now use the full interactive version of the notebook.
  6. The Jupyter system works by having a server program which runs in the background (if you start Jupyter as described above, you can see it running in the Terminal / Command Prompt window), which is then accessed using a web browser. This means that when you close your web browser, Jupyter is still running until you stop the server process. You can stop the server process by opening the Terminal / Command Prompt window and pressing Control-C twice (i.e. holding down the “Control” key and pressing the C key twice).

Below is the Jupyter notebook for this tutorial. Since the first tutorial focuses on how to use the Jupyter environment, you should download and open this notebook in Jupyter rather than trying to follow it online.


Welcome to our first Jupyter Notebook!

A notebook is a hypertext document containing a mixture of textual content (like the part you’re reading now) and computer programs – lists of instructions written in a programming language (in our case, the Python language) – as well as the output of these programs.

Using the Jupyter environment

Before getting started with Python itself, it’s important to get some basic familiarity with the user interface of the Jupyter environment. Jupyter is fairly intuitive to use, partly because it runs in a web browser and so works a lot like any web page. Basic principles:

  • Each “notebook” displays as a single page. Notebooks are opened and saved using the menus and icons shown within the Jupyter window (i.e. the menus and icons under the Jupyter logo and icon, not the menus / icons belonging to your web browser).

  • Notebooks are made up of “cells”. Each cell is displayed on the page in a long list, one below another. You can see which parts of the notebook belong to which cell by clicking once on the text – when you do this, this will select the cell containing the text, and show its outline with a grey line.

  • Usually a cell contains either text (like this one – in Jupyter this is called a “Markdown” cell), or Python code (like the one below this one).

  • You can click on a program cell to edit it, and double-click on a text cell to edit it. Try double-clicking on this cell.

  • When you start editing a text cell, the way it is displayed changes so that you can see (and edit) any formatting codes in it. To return the cell back to the “normal” prettified display, you need to “Run” it. You can run a cell by either:

    • choosing “Run” from the “Cell” menu above,
    • pressing shift-return when the cell is selected, or
    • clicking the “Run cell” icon.
  • “Run” this cell so that it returns to the original mode of display.
In [1]:
for number in range(1,13):
    print(str(number) + "*" + str(number) + " = " + str(number*number))
1*1 = 1
2*2 = 4
3*3 = 9
4*4 = 16
5*5 = 25
6*6 = 36
7*7 = 49
8*8 = 64
9*9 = 81
10*10 = 100
11*11 = 121
12*12 = 144

The program in a cell doesn’t do anything until you ask Jupyter to run (a.k.a. “execute”) it – in other words, ask the system to start following the instructions in the program. You can execute a cell by clicking somewhere in it so it’s selected, then choosing “Run” from the “Cell” menu (or by pressing shift-return).

When you run a cell containing a Python program, any output that the program generates is displayed directly below that cell. If you modify the program, you’ll need to run it again before you will see the modified result.

A lot of the power of Python and Jupyter comes from the ability to easily make use of modules written by other people. Modules are included using lines like “from … import *”.
A module needs to be installed on your computer before you can use it; many of the most commonly used ones are installed as part of Anaconda.

“Comments” provide a way of explaining to human readers what parts of a program are supposed to do (but are completely ignored by Python itself). Typing the symbol # begins a comment, which continues until the end of the line.

N.B. You must install the “ctext” module before running the code below. If you get the error “ImportError: No module named ‘ctext'” when you try to run the code, refer to the instructions for how to install the ctext module.

In [2]:
from ctext import *  # This module gives us direct access to data from ctext.org
setapikey("demo")    # This allows us access to the data used in these tutorials

paragraphs = gettextasparagrapharray("ctp:analects/xue-er")

print("This chapter is made up of " + str(len(paragraphs)) + " paragraphs. These are:")

# For each paragraph of the chapter data that we downloaded, do the following:
for paragraphnumber in range(0, len(paragraphs)):
    print(str(paragraphnumber+1) + ". " + paragraphs[paragraphnumber])
This chapter is made up of 16 paragraphs. These are:
1. 子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」
2. 有子曰:「其為人也孝弟,而好犯上者,鮮矣;不好犯上,而好作亂者,未之有也。君子務本,本立而道生。孝弟也者,其為仁之本與!」
3. 子曰:「巧言令色,鮮矣仁!」
4. 曾子曰:「吾日三省吾身:為人謀而不忠乎?與朋友交而不信乎?傳不習乎?」
5. 子曰:「道千乘之國:敬事而信,節用而愛人,使民以時。」
6. 子曰:「弟子入則孝,出則弟,謹而信,汎愛眾,而親仁。行有餘力,則以學文。」
7. 子夏曰:「賢賢易色,事父母能竭其力,事君能致其身,與朋友交言而有信。雖曰未學,吾必謂之學矣。」
8. 子曰:「君子不重則不威,學則不固。主忠信,無友不如己者,過則勿憚改。」
9. 曾子曰:「慎終追遠,民德歸厚矣。」
10. 子禽問於子貢曰:「夫子至於是邦也,必聞其政,求之與?抑與之與?」子貢曰:「夫子溫、良、恭、儉、讓以得之。夫子之求之也,其諸異乎人之求之與?」
11. 子曰:「父在,觀其志;父沒,觀其行;三年無改於父之道,可謂孝矣。」
12. 有子曰:「禮之用,和為貴。先王之道斯為美,小大由之。有所不行,知和而和,不以禮節之,亦不可行也。」
13. 有子曰:「信近於義,言可復也;恭近於禮,遠恥辱也;因不失其親,亦可宗也。」
14. 子曰:「君子食無求飽,居無求安,敏於事而慎於言,就有道而正焉,可謂好學也已。」
15. 子貢曰:「貧而無諂,富而無驕,何如?」子曰:「可也。未若貧而樂,富而好禮者也。」子貢曰:「《詩》云:『如切如磋,如琢如磨。』其斯之謂與?」子曰:「賜也,始可與言詩已矣!告諸往而知來者。」
16. 子曰:「不患人之不己知,患不知人也。」

‘Variables’ are named entities that contain some kind of data that can be changed at a later date. We will look at these in much more detail over the next few weeks. For now, you can think of them as named boxes which can contain any kind of data.

Once we have data stored in a variable (like the ‘paragraphs’ variable above), we can start processing it in whatever way we want. Often we use other variables to track our progress, like the ‘longest_paragraph’ and ‘longest_length’ variables in the program below.

In [3]:
longest_paragraph = None # We use this variable to record which of the paragraphs we've looked at is longest
longest_length = 0       # We use this one to record how long the longest paragraph we've found so far is

for paragraph_number in range(0, len(paragraphs)):
    paragraph_text = paragraphs[paragraph_number];
    if len(paragraph_text)>longest_length:
        longest_paragraph = paragraph_number
        longest_length = len(paragraph_text)

print("The longest paragraph is paragraph number " + str(longest_paragraph+1) + ", which is " + str(longest_length) + " characters long.")
The longest paragraph is paragraph number 15, which is 93 characters long.

Modules allow us to do powerful things like Principle Component Analysis (PCA) and machine learning without having to write any code to perform any of the complex mathematics which lies behind these techniques. They also let us easily plot numerical results within the Jupyter notebook environment.

For example, the following code (which we will go through in much more detail in a future tutorial – don’t worry about the contents of it yet) plots the frequencies of the two characters “矣” and “也” in chapters of the Analects versus chapters of the Fengshen Yanyi. (Note: this may take a few seconds to download the data.)

In [5]:
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

def makevector(string, termlist, normalize = False):
    vector = []
    for term in termlist:
        termcount = len(re.findall(term, string))
        if normalize:
            vector.append(termcount/len(string))
        else:
            vector.append(termcount)
    return vector

text1 = gettextaschapterlist("ctp:fengshen-yanyi")
text2 = gettextaschapterlist("ctp:analects")

vectors1 = []
for chapter in text1:
    vectors1.append(makevector(chapter, ["矣", "也"], True))

vectors2 = []
for chapter in text2:
    vectors2.append(makevector(chapter, ["矣", "也"], True))

df1 = pd.DataFrame(vectors1)
df2 = pd.DataFrame(vectors2)

legend1 = plt.scatter(df1.iloc[:,0], df1.iloc[:,1], color="blue", label="Fengshen Yanyi")
legend2 = plt.scatter(df2.iloc[:,0], df2.iloc[:,1], color="red", label="Analects")
plt.legend(handles = [legend1, legend2])
plt.xlabel("Frequency of 'yi'")
plt.ylabel("Frequency of 'ye'")
Out[5]:
<matplotlib.text.Text at 0x10e4dc940>

You can save changes to your notebook using “File” -> “Save and checkpoint”. Note that Jupyter often saves your changes for you automatically, so if you don’t want to save your changes, you might want to make a copy of your notebook first using “File” -> “Make a Copy”.

You should try to avoid having the same notebook open in two different browser windows or browser tabs at the same time. (If you do this, both pages may try to save changes to the same file, overwriting each other’s work.)

Exercises

Before we start writing programs, we need to get familiar with the Jupyter Notebook programming environment. Check that you can complete the following tasks:

  • Run each of the program cells in this notebook that are above this cell on your computer, checking that each of the short programs produces the expected output.
  • Clear all of the output using “Cell” -> “All output” -> “Clear”, then run one or two of them again.
  • In Jupyter, each cell in a notebook can be run independently. Sometimes the order in which cells are run is important. Try running the following three cells in order, then see what happens when you run them in a different order. Make sure you understand why in some cases you get different results.
In [6]:
number_of_things = 1
In [7]:
print(number_of_things)
1
In [8]:
number_of_things = number_of_things + 1
print(number_of_things)
2
  • Some of the programs in this notebook are very simple. Modify and re-run them to perform the following tasks:

    • Print out the squares of the numbers 3 through 20 (instead of 1 through 12)
    • Print out the cubes of the numbers 3 through 20 (i.e. 3 x 3 x 3 = 27, 4 x 4 x 4 = 64, etc.)
    • Instead of printing passages from the first chapter of the Analects, print passages from the Daodejing, and determine the longest passage in it. The URN for the Daodejing is: ctp:dao-de-jing
  • Often when programming you’ll encounter error messages. The following line contains a bug; try running it, and look at the output. Work out which part of the error message is most relevant, and see if you can find an explanation on the web (e.g. on StackOverflow) and fix the mistake.

In [9]:
print("The answer to life the universe and everything is: " + 42)  # This statement is incorrect and isn't going to work
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-f35205b23c00> in <module>()
----> 1 print("The answer to life the universe and everything is: " + 42)  # This statement is incorrect and isn't going to work

TypeError: Can't convert 'int' object to str implicitly
  • Sometimes a program will take a long time to run – or even run forever – and you’ll need to stop it. Watch what happens to the circle beside the text “Python 3” at the top-right of the screen when you run the cell below.
  • While the cell below is running, try running the cell above. You won’t see any output until the cell below has finished running.
  • Run the cell below again. While it’s running, interrupt its execution by clicking “Kernel” -> “Interrupt”.
In [10]:
import time

for number in range(1,21):
    print(number)
    time.sleep(1)
1
2
3
4
5
6
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-10-c01c67722f36> in <module>()
      3 for number in range(1,21):
      4     print(number)
----> 5     time.sleep(1)

KeyboardInterrupt: 
  • The cell below has been set as a “Markdown” cell, making it a text cell instead of a program (“code”) cell. Work out how to make the cell run as a program.

for number in range(1,11):
print(“1/” + str(number) + ” = ” + str(1/number)) # In many programming languages, the symbol “/” means “divided by”

  • Experiment with creating new cells below this one. Make some text cells, type something in them, and run them. Copy and paste some code from above into code cells, and run them too. Try playing around with simple modifications to the code.
  • (Optional) You can make your text cells look nicer by including formatting instructions in them. The way of doing this is called “Markdown” – there are many good introductions available online.
  • Lastly, save your modified notebook and close your web browser. Shut down the Python server process, then start it again, and reload your modified notebook. Make sure you can also find the saved notebook file in your computer’s file manager (e.g. “Windows Explorer”/”File Explorer” on Windows, or “Finder” on Mac OS X).

Further reading:

  • Jupyter Notebook Users Manual, Bryn Mawr College Computer Science – This provides a thorough introduction to Jupyter features. This guide introduces many more features than we will need to use, but is a great reference.
Creative Commons License
Leave a comment

When n-grams go bad

As a followup to Google n-grams and pre-modern Chinese, other features of the Google n-gram viewer may help shed some light on the issues with the pre-1950 data for Chinese.

One useful feature is wildcard search, which allows various open-ended searches, the simplest of these being a search for “*”, which plots the most frequent 1-grams in a corpus – i.e. the most commonly occurring words. For example, if we input a single asterisk as our search query on the English corpus, we get the frequencies of the ten most common English words:

The results look plausible at least as far back as 1800, which is what the authors claim to be the reliable part of the data. Earlier than that things get shakier, and before about 1650 things get quite seriously out of hand:

Remember, these are the most common terms in the corpus, i.e. the ones for which the data is going to be the most reliable. Now lets look at the equivalent figures for Chinese. Firstly, we can get a nice baseline showing what we would like to see by doing the equivalent search on a relatively reliable part of the data, e.g. 1970 to 2000:

This looks good. The top ten 1-grams – i.e. the most frequently occurring terms – are all commonly occurring Chinese words. Now lets try going back to 1800:

Oh dear. From 1800 to 2000, of the ten most frequent 1-grams, more than half are not terms that plausibly occur in pre-modern Chinese texts at all. Note also that the scale of the y axis has now changed: according to this graph, it would appear that up to 40% of terms in pre-1940 texts may have been detected as being URLs or other non-textual content. Unsurprisingly, these problems continue all the way back to 1500:

It’s unclear what exactly _URL_, ^_URL, and @_URL are supposed to represent as they don’t seem to be documented, and none of them are accepted by the viewer as valid query terms so we can’t easily check what their values are on the English data. Possibly they are just categorization tags that don’t affect the overall counts and thus normalized frequencies, but even so they surely point to serious problems with the data that have caused up to 50% of terms to be so interpreted.

Even aside from these suspect “URLs”, the other most frequent terms returned indicate that three terms not plausibly occurring in pre-modern Chinese texts – “0”, “1”, and “I” – account for anything up to 20% or more of all terms in the pre-1900 data:

Since all the n-gram counts are normalized by the total number of terms, these issues (presumably primarily caused by OCR errors) affect all results for Chinese in any year in which they occur. So it looks as if while 1800 might be a reasonable cut-off for meaningful interpretation of the English data, for the Chinese case 1970 would be a better choice, and any results from before around 1940 will be largely meaningless due to the overwhelming amount of noise.


Update April 18, 2015:

It appears that the @_URL_ and ^_URL_ actually correspond to the terms “@” and “^” (both, presumably, tagged with “URL”), and so these do indeed affect the results: for many years pre-1950, anything up to 60% of all terms in the corpus are the term “^”:

It seems that the data used for Chinese fails some fairly basic sanity checks (including “is it in Chinese?”).

Leave a comment