Classical Chinese DH: Regular Expressions

Classical Chinese DH: Regular expressions

By Donald Sturgeon

[View this notebook online] [Download this notebook] [List of tutorials]

Regular expressions

A regular expression (a.k.a. regex or RE) is a pattern to be searched for in some body of text. These are not specific to Python, but by combining simple regular expressions with basic Python statements, we can quickly achieve powerful results.

Commonly used regex syntax

. Matches any one character exactly once
[abcdef] Matches any one of the characters a,b,c,d,e,f exactly once
[^abcdef] Matches any one character **other than** a,b,c,d,e,f
? After a character/group, makes that character/group optional (i.e. match zero or 1 times)
? After +, * or {…}, makes matching ungreedy (i.e. choose shortest match, not longest)
* After a character/group, makes that character/group match zero or more times
+ After a character/group, makes that character/group match one or more times
{2,5} After a character/group, makes that character/group match 2,3,4, or 5 times
{2,} After a character/group, makes that character/group match 2 or more times
\3 Matches whatever was matched into group number 3 (first group from left is numbered 1)

To use regexes in Python, we use another module called “re” (this is a very common module and should already be installed).

In [53]:
import re

laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。上仁為之而無以為;上義為之而有以為。上禮為之而莫之應,則攘臂而扔之。故失道而後德,失德而後仁,失仁而後義,失義而後禮。"

for match in re.finditer(r".德", laozi):  # re.finditer returns "match objects", each of which describes one match
    matched_text = match.group(0)        # In Python, group(0) matches the full text that was found
    print("Found a match: " + matched_text)
Found a match: 上德
Found a match: 不德
Found a match: 有德
Found a match: 下德
Found a match: 失德
Found a match: 無德
Found a match: 上德
Found a match: 下德
Found a match: 後德
Found a match: 失德

[Aside: in Python, regexes are often written in strings with a "r" in front of them, e.g. r"德" rather than just "德". All this does is tells Python not to try to interpret the contents of the string (e.g. backslashes) as meaning something else. The result of r"德" is still an ordinary string variable with in it.]

Exercise 1 (very easy): Change the above code to verify the results of some of the simple example regexes from the slides. Try these ones:

  • 而無以為
  • 是以.德
  • 失.而後.
  • 上[仁義]為之
  • 後(.),失\1

For the last of these (“後(.),失\1″), see what happens to the output when you change group(0) to group(1). (Change it back to group(0) afterwards though, as we will reuse this code using group(0).)

Exercise 2: Write regular expressions to match the following things (you can keep on modifying the example above to check that they work, but you may want to write down your answers somewhere – remember, you can edit this cell by double-clicking on it).

  • Match any three characters where the middle character is “之” – i.e. “為之而”, “莫之應”, etc. Modify your regex so that it does not match things with punctuation in them, like “扔之。”.
  • Match each “phrase” (i.e. punctuated section) of the text. In other words, the first match should be “上德不德”, the second should be “是以有德”, and so on. You only need to handle the three punctuation marks “。”, “,”, and “;”.
  • Match each phrase which contains the term “之” in it. (Double check that you get 5 matches.)

We can do the same kind of thing on an entire text in one go if we have the whole text in a single string, as in the next example. (If we wanted to know which paragraph or chapter each match appeared in, we would want to run the same regex on each paragraph or chapter in turn so that we know which paragraph or chapter each match occurs in.)

In [54]:
from ctext import *
setapikey("demo")

# The gettextasstring function gives us a single string variable with the whole text in it
laozi = gettextasstring("ctp:dao-de-jing")

for match in re.finditer(r"足.", laozi):
    matched_text = match.group(0)
    print(matched_text)
足,
足。
足,
足,
足者
足見
足聞
足既
足以
足;
足不
足;
足之
足,
足矣
足以
足下
足者
足。
足以

Exercise 3

  • Often we don’t want to include matches that have punctuation in them. Modify the regex from the last example so that it excludes all the matches where the character after “足” is “,”, “。”, or “;”. (You should do this by modifying the regex; the rest of the code does not need to change.)

  • Find all the occurrences of X可X – i.e. “道可道” and “名可名” (there is one more item that should be matched too).

  • Modify your regex so you match all occurrences of XYX – i.e. not just “道可道” but also things like “學不學”. You may need to make some changes to avoid matching punctuation – we don’t want to match “三,三” or “、寡、”.

Exercise 4: (Optional) Using what was covered in the previous tutorial, write a program in the cell below to perform one of these searches again, but this time running it once on each paragraph in turn so that you can also print out the number of the passage in which each match occurs.

In [ ]:
passages = gettextasparagraphlist("ctp:dao-de-jing")

# Your code goes here!

Dictionary variables

One of the advantages of using regexes from within a programming language like Python is that as well as simply finding results, we can easily do things to collate our data, such as count up how many times a regex gave various different results. Another type of variable that is useful here is the “dictionary” variable.

A dictionary variable works in a very similar way to a list, except that whereas in a list the items are numbered 0,1,2,… and accessed using these numbers, a dictionary uses other things – in the case we will look at, strings – to identify the items. This lets us “look up” values for different strings, just like looking up the translation of a word in a dictionary. The things we use instead of numbers to “look up” values in a dictionary are called “keys“.

Dictionaries can be defined in Python using the following notation:

In [55]:
my_titles = {"論語": "Analects", "孟子": "Mengzi", "荀子": "Xunzi"}

The above example defines one dictionary variable called “my_titles”, and sets values for three keys: “論語”, “孟子”, and “荀子”. Each of these keys is set to have the corresponding value (“Analects”, “Mengzi”, and “Xunzi” respectively). In this simple example, our dictionary gives us a way of translating Chinese-language titles into English-language titles.

We can access the items in a dictionary in a very similar way to accessing items from a list:

In [56]:
print(my_titles["論語"])
Analects
In [57]:
print(my_titles["荀子"])
Xunzi

Unlike in a list, our items don’t have numbers, and don’t come in any particular order. So one thing we will sometimes need to do is to get a list of all the keys – i.e., a list telling us what things there are in our dictionary.

In [58]:
list_of_titles = list(my_titles.keys())
print(list_of_titles)
['孟子', '論語', '荀子']

Often we will store numbers in our dictionary; the keys will be strings, but the value for each key will be a number. This lets us do things like count how many times we’ve seen some particular string – for all of the strings we happen to come across at the same time, using just one dictionary variable. In cases like this, we will often want to sort the dictionary by the values of the keys. One way of doing this is using the “sorted” function:

In [59]:
# In this example, we use a dictionary to record people's year of birth
# Then we sort the keys (i.e. the names) by the values (i.e. year of birth)

year_of_birth = {"胡適": 1891, "梁啟超": 1873, "茅盾": 1896, "王韜": 1828, "魯迅": 1881}
list_of_people = sorted(year_of_birth, key=year_of_birth.get, reverse=False)
for name in list_of_people:
    print(name + " was born in " + str(year_of_birth[name]))
王韜 was born in 1828
梁啟超 was born in 1873
魯迅 was born in 1881
胡適 was born in 1891
茅盾 was born in 1896

Don’t worry about the rather complex looking syntax for sorted() – you can just follow this model whenever you need to sort a dictionary (and change “reverse=False” to “reverse=True” if you want to reverse the list):

list_of_keys = sorted(my_dictionary, key=my_dictionary.get, reverse=False)

Using a dictionary, we can keep track of every regex result we found, and at the same time collate the data. Instead of having a long list with repeated items in it, we build a dictionary in which the keys are the unique regex matches, and the values are the number of times we have seen that particular string.

In [60]:
match_count = {}  # This tells Python that we're going to use match_count as a dictionary variable

for match in re.finditer(r"(.)為", laozi):
    matched_text = match.group(0)  # e.g. "心為"
    if not matched_text in match_count:
        match_count[matched_text] = 0  # If we don't do this, Python will give an error on the following line
    match_count[matched_text] = match_count[matched_text] + 1

# Our dictionary now contains a frequency count of each different pair we found
print("match_count contains: " + str(match_count))

# The sorted() function gets us a list of the items we matched, starting with the most frequent
unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
    print(item + " occurred " + str(match_count[item]) + " times.")
match_count contains: {'之為': 3, '不為': 8, '敢為': 5, '淡為': 1, '宜為': 1, '善為': 3, '靜為': 3, '可為': 1, '歙為': 1, '物為': 1, '無為': 11, '德為': 1, '能為': 3, '禮為': 1, '姓為': 1, '名為': 1, '復為': 2, '賤為': 2, '以為': 18, '自為': 1, '一為': 1, '。為': 7, '生為': 1, '下為': 1, '重為': 1, '心為': 1, '身為': 2, '則為': 2, '人為': 2, '有為': 1, '孰為': 1, '義為': 1, '寵為': 1, '仁為': 1, '而為': 4, ',為': 11, '故為': 2, '是為': 1, '強為': 2}
以為 occurred 18 times.
無為 occurred 11 times.
,為 occurred 11 times.
不為 occurred 8 times.
。為 occurred 7 times.
敢為 occurred 5 times.
而為 occurred 4 times.
之為 occurred 3 times.
善為 occurred 3 times.
靜為 occurred 3 times.
能為 occurred 3 times.
復為 occurred 2 times.
賤為 occurred 2 times.
身為 occurred 2 times.
則為 occurred 2 times.
人為 occurred 2 times.
故為 occurred 2 times.
強為 occurred 2 times.
淡為 occurred 1 times.
宜為 occurred 1 times.
可為 occurred 1 times.
歙為 occurred 1 times.
物為 occurred 1 times.
德為 occurred 1 times.
禮為 occurred 1 times.
姓為 occurred 1 times.
名為 occurred 1 times.
自為 occurred 1 times.
一為 occurred 1 times.
生為 occurred 1 times.
下為 occurred 1 times.
重為 occurred 1 times.
心為 occurred 1 times.
有為 occurred 1 times.
孰為 occurred 1 times.
義為 occurred 1 times.
寵為 occurred 1 times.
仁為 occurred 1 times.
是為 occurred 1 times.

We can use this idea and almost exactly the same code to start answering quite complex questions about patterns appearing in texts. This code can tell us which actual phrases matching a certain pattern occurred most frequently.

For example, in poetry we often find various kinds of repetition. We can use part of the 詩經 as an example, and using a regex quickly find out which repeated XYXY patterns are most common:

In [61]:
shijing = gettextasstring("ctp:book-of-poetry/lessons-from-the-states")
In [62]:
match_count = {}  # This tells Python that we're going to use match_count as a dictionary variable

for match in re.finditer(r"(.)(.)\1\2", shijing):
    matched_text = match.group(0)
    if not matched_text in match_count:
        match_count[matched_text] = 0  # If we don't do this, Python will give an error on the following line
    match_count[matched_text] = match_count[shijing[match.start():match.end()]] + 1

unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
    print(item + " occurred " + str(match_count[item]) + " times.")
子兮子兮 occurred 3 times.
懷哉懷哉 occurred 3 times.
碩鼠碩鼠 occurred 3 times.
歸哉歸哉 occurred 3 times.
如何如何 occurred 3 times.
委蛇委蛇 occurred 3 times.
舍旃舍旃 occurred 3 times.
蘀兮蘀兮 occurred 2 times.
式微式微 occurred 2 times.
采苓采苓 occurred 1 times.
鴟鴞鴟鴞 occurred 1 times.
悠哉悠哉 occurred 1 times.
采葑采葑 occurred 1 times.
瑳兮瑳兮 occurred 1 times.
其雨其雨 occurred 1 times.
簡兮簡兮 occurred 1 times.
采苦采苦 occurred 1 times.
伐柯伐柯 occurred 1 times.
樂國樂國 occurred 1 times.
樂土樂土 occurred 1 times.
樂郊樂郊 occurred 1 times.
玼兮玼兮 occurred 1 times.

Exercise 5: Write a regex to match paired lines of four-character poetry that both begin with the same two characters (examples: “亦既見止、亦既覯止”, “且以喜樂、且以永日”, etc.). Re-run the program above to verify your answer.

Exercise 6: Create a regex to match book titles that appear in punctuated Chinese texts, e.g. “《呂氏春秋》”. Your regex should extract the title without the punctuation marks into a group – i.e. you must use “(” and “)” in your regex. You can test it using the short program below – your output should look like this:

爾雅
廣雅
尚賢
呂氏春秋·順民
呂氏春秋·不侵
左·襄十一年傳
韓詩外傳
廣雅
In [ ]:
test_input = "昔者文公出走而正天下,畢云:「正,讀如征。」王念孫云「畢讀非也,《爾雅》曰:『正,長也。』晉文為諸侯盟主,故曰『正天下』,與下『霸諸侯』對文。又《廣雅》『正,君也』。《尚賢》篇曰:『堯、舜、禹、湯、文、武之所以王天下正諸侯者』。凡墨子書言正天下正諸侯者,非訓為長,即訓為君,皆非征伐之謂。」案:王說是也。《呂氏春秋·順民》篇云:「湯克夏而正天下」,高誘注云:「正,治也」,亦非。桓公去國而霸諸侯,越王句踐遇吳王之醜,蘇時學云:「醜,猶恥也。」詒讓案:《呂氏春秋·不侵》篇「欲醜之以辭」,高注云:「醜,或作恥。」而尚攝中國之賢君,畢云:「尚與上通。攝,合也,謂合諸侯。郭璞注爾雅云:『聶,合』,攝同聶。」案:畢說未允。攝當與懾通,《左·襄十一年傳》云:「武震以攝威之」,《韓詩外傳》云:「上攝萬乘,下不敢敖乎匹夫」,此義與彼同,謂越王之威足以懾中國賢君也。三子之能達名成功於天下也,皆於其國抑而大醜也。畢云:「猶曰安其大醜。《廣雅》云:『抑,安也』」。俞樾云:「抑之言屈抑也。抑而大醜,與達名成功相對,言於其國則抑而大醜,於天下則達名成功,正見其由屈抑而達,下文所謂敗而有以成也。畢注於文義未得。」案:俞說是也。太上無敗,畢云:「李善文選注云:『河上公注老子云:太上,謂太古無名之君也』。」案:太上,對其次為文,謂等之最居上者,不論時代今古也。畢引老子注義,與此不相當。其次敗而有以成,此之謂用民。言以親士,故能用其民也。"

for match in re.finditer(r"your regex goes here!", test_input):
    print(match.group(1)) # group() extracts the text of a group from a matched regex: so your regex must have a group in it

Now modify your regex so that instead of getting book titles together with chapter titles, your regex only captures the title of the work – i.e., capture “呂氏春秋” instead of “呂氏春秋·順民”, and “左” instead of “左·襄十一年傳”.

Optional: Bonus points if you can also capture the chapter title on its own in a separate regex group at the same time. This is a bit fiddly though, and we don’t need to do it for this exercise.

  • Now modify the example code below (it’s almost identical to one of examples above) so that it lists how often every title was mentioned in the 墨子閒詁 (a commentary on the classic text “墨子” – in this example we only use the first chapter, though the code can also be run on the whole text by changing the URN).
  • Then modify your code so that it only lists the top 10 most frequently mentioned texts. Hint: “unique_items” is a list, and getting part of a list is very similar to getting part of a string.
In [ ]:
test_input = gettextasstring("ctp:mozi-jiangu/qin-shi")

match_count = {}  # This tells Python that we're going to use match_count as a dictionary variable

for match in re.finditer(r"your regex goes here!", test_input):
    matched_text = match.group(1)
    if not matched_text in match_count:
        match_count[matched_text] = 0  # If we don't do this, Python will give an error on the following line
    match_count[matched_text] = match_count[matched_text] + 1

unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
    print(item + " occurred " + str(match_count[item]) + " times.")

Dictionaries also allow us to produce graphs summarizing our data.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
%matplotlib inline

# Unfortunately some software still has difficulty dealing with Chinese.
# Here we may need to tell matplotlib to use a specific font containing Chinese characters.
# If your system doesn't display the Chinese text in the graph below, you may need to specify a different font name.
import platform
if platform.system() == 'Darwin':   # I.e. if we're running on Mac OS X
    mpl.rcParams['font.family'] = "Arial Unicode MS"
else:
    mpl.rcParams['font.family'] = "SimHei"

mpl.rcParams['font.size'] = 20

# The interesting stuff happens here:

s = pd.Series(match_count)
s.sort_values(0, 0, inplace=True)
s = s[:10]
s.plot(kind='barh')

Now modify your regex so that you only match texts that are cited as pairs of book title and chapter, i.e. you should only match cases like “《呂氏春秋·順民》” (and not 《呂氏春秋》), and capture into a group the full title (“呂氏春秋·順民” in this example). This may be harder than it looks! You will need to be careful that your regex does not sometimes match too much text.

Re-run the above programs to find out (and graph) which chapters of which texts are most frequently cited in this way by this commentary.

Replacing and Splitting with Regexes

As well as finding things, regexes are ideal for other very useful tasks including replacing and splitting textual data.

For example, we saw in the last notebook cases where it would be easier to process a text without any punctuation in it. We can easily match all punctuation using a regex, and once we know how to search and replace, we can just replace each matched piece of punctuation with a blank string to get an unpunctuated text.

We can do a simple search-and-replace using a regex like this:

In [63]:
import re

input_text = "道可道,非常道。"
print(re.sub(r"道", r"名", input_text))
名可名,非常名。

For very simple regexes that don’t use any special regex characters, this gives exactly the same result as replace(). But because we can specify patterns, we can do much more powerful replacements.

In [64]:
input_text = "道可道,非常道。"
print(re.sub(r"[。,]", r"", input_text))
道可道非常道

Of course, as usual the power of this is that we can quickly do it for however much data we like:

In [65]:
laozi = gettextasstring("ctp:dao-de-jing")
print(re.sub(r"[。,;?:!、]", r"", laozi))
道可道非常道名可名非常名無名天地之始有名萬物之母故常無欲以觀其妙常有欲以觀其徼此兩者同出而異名同謂之玄玄之又玄衆妙之門

天下皆知美之為美斯惡已皆知善之為善斯不善已故有無相生難易相成長短相較高下相傾音聲相和前後相隨是以聖人處無為之事行不言之教萬物作焉而不辭生而不有為而不恃功成而弗居夫唯弗居是以不去
...

Another useful aspect is that we can use data from regex groups that we matched within our replacement. This makes it easy to write replacements that do things like add some particular string before or after something we want to match. This example finds any punctuation character, puts it in regex group 1, and then replaces it with regex group 1 followed by a return character – in other words, it adds a line break after every punctuation character.

In [66]:
laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。"
print(re.sub(r"([。,;?:!、])", r"\1\n", laozi))
上德不德,
是以有德;
下德不失德,
是以無德。
上德無為而無以為;
下德為之而有以為。

Regexes and text files

Regular expressions can be very useful when we want to transform text from one format to another, or when we want to read text from a file and it isn’t in the format we want.

In this section, instead of using the ctext.org API, we will experiment with a text from Project Gutenberg. Before starting, download the plain text UTF-8 file from the website and save it on your computer as a file called “mulan.txt”. You should save this in the same folder as this Jupyter notebook (.ipynb) file.

Note: you don’t have to save files in the same folder as your Jupyter notebook, but if you save them somewhere else, when opening the file you will need to tell Python the full path to your file instead of just the filename – e.g. “C:\Users\user\Documents\mulan.txt” instead of just “mulan.txt”.

In [67]:
file = open("mulan.txt", "r", encoding="utf-8")
data_from_file = file.read()
file.close()

One practical issue when dealing with a lot of data in a string is that printing it to the screen so we can see what’s happening in our program may take up a lot of space. One thing we can do is to just print a substring – i.e. only print the first few hundred or so characters:

In [68]:
print(data_from_file[0:700])
The Project Gutenberg EBook of Mu Lan Ji Nu Zhuan, by Anonymous

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

Title: Mu Lan Ji Nu Zhuan

Author: Anonymous

Editor: Anonymous

Release Date: December 20, 2007 [EBook #23938]

Language: Chinese

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK MU LAN JI NU ZHUAN ***

序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,
求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,
或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其
難也。性命之理,有不悟性根者,有不知命蒂者,有修性

One thing that will be handy is if we can delete the English blurb at the top of this file automatically. There are several ways we could do this. One way is to use a negative character class – matching everything except some set of characters – to match all characters that are non-Chinese, and delete them.

The re.sub() function takes three parameters:

  1. The regular expression to match
  2. What we want to replace each match with
  3. The string we want to do the matching in
    It returns a new string containing the result after making the substitution.

[The example below also makes use of another kind of special syntax in a character class: we can match a range of characters by their Unicode codepoint. Here we match everything from U+25A1 through U+FFFF, all of which are Chinese characters. Don't worry too much about the contents of this regex - we won't need to write regexes like this most of the time.]

In [69]:
new_data = re.sub(r'[^\n\r\u25A1-\uFFFF]', "", data_from_file)
print(new_data[0:700])


序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,
求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,
或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其
難也。性命之理,有不悟性根者,有不知命蒂者,有修性命而旁歧雜出者,有修性命而
後先倒置者。涵養未得其中,即性命未盡其奧也。乃木蘭一女子耳,擔荷倫常,研求性
命,而獨無所不盡也哉!

  予幼讀《木蘭詩》,觀其代父從軍,可謂孝矣;立功絕塞,可謂忠矣。後閱《唐書
》,言木蘭唐女,西陵人,嫻弓馬,諳韜略,轉戰沙漠,累大功十二,何其勇也。封武
昭將軍,凱旋還里。當時筮者謂致亂必由武姓,讒臣嫁禍武昭,詔徵至京。木蘭具表陳
情,掣劍剜心,示使者,目視而死。死後,位證雷部大神,何其烈也。去冬閱《木蘭奇
女傳》,復知其幼而領悟者性命也,長而行持者性命也。且通部議論極精微,極顯豁,
又無非性命之妙諦也。盡人所當盡,亦盡人所難盡。惟其無所不盡,則亦無所不奇。而
人奇,行奇,事奇,文奇,讀者莫不驚奇叫絕也。此書相傳為奎斗馬祖所演,卷首有武
聖帝序。今序已失,同人集貲付梓。書成,爰敘其緣起如此。

      書於滬江梅花書館南窗之下

第一回朱若虛孝弟全天性 朱天錫聰明識童謠

  古樂府所載《木蘭辭》,乃唐初國師李藥師所作也。藥師名靖,號青蓮,又號三元
道人。先生少日,負經天緯地之才,抱治國安民之志,佐太宗平隋亂,開唐基,官拜太
傅,賜爵趙公。晚年修道,煉性登仙。蓋先生盛代奇人,故能識奇中奇人,

We’ve got rid of the English text, but we’ve now got too many empty lines. Depending on what data is in the text, we might want to remove all the line breaks… but in this case there are some things like chapter titles that are best kept on separate lines so we can tell where the chapters begin and end.

Remember: “\n” means “one line break”, and “{3,}” will match 3 or more of something one after the other (and as many times as possible).

In [70]:
without_spaces = re.sub(r'\n{3,}', "\n\n", new_data)  # This regex matches three or more line breaks, and replaces them with two
print(without_spaces[0:700])


序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,
求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,
或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其
難也。性命之理,有不悟性根者,有不知命蒂者,有修性命而旁歧雜出者,有修性命而
後先倒置者。涵養未得其中,即性命未盡其奧也。乃木蘭一女子耳,擔荷倫常,研求性
命,而獨無所不盡也哉!

  予幼讀《木蘭詩》,觀其代父從軍,可謂孝矣;立功絕塞,可謂忠矣。後閱《唐書
》,言木蘭唐女,西陵人,嫻弓馬,諳韜略,轉戰沙漠,累大功十二,何其勇也。封武
昭將軍,凱旋還里。當時筮者謂致亂必由武姓,讒臣嫁禍武昭,詔徵至京。木蘭具表陳
情,掣劍剜心,示使者,目視而死。死後,位證雷部大神,何其烈也。去冬閱《木蘭奇
女傳》,復知其幼而領悟者性命也,長而行持者性命也。且通部議論極精微,極顯豁,
又無非性命之妙諦也。盡人所當盡,亦盡人所難盡。惟其無所不盡,則亦無所不奇。而
人奇,行奇,事奇,文奇,讀者莫不驚奇叫絕也。此書相傳為奎斗馬祖所演,卷首有武
聖帝序。今序已失,同人集貲付梓。書成,爰敘其緣起如此。

      書於滬江梅花書館南窗之下

第一回朱若虛孝弟全天性 朱天錫聰明識童謠

  古樂府所載《木蘭辭》,乃唐初國師李藥師所作也。藥師名靖,號青蓮,又號三元
道人。先生少日,負經天緯地之才,抱治國安民之志,佐太宗平隋亂,開唐基,官拜太
傅,賜爵趙公。晚年修道,煉性登仙。蓋先生盛代奇人,故能識奇中奇人,保全奇中奇
人。奇中奇人為誰?即朱氏木蘭也。

  木蘭女年十

Exercise 7: (Harder) Make another substitution using a regex which removes only the line breaks within a paragraph (and does not remove linebreaks before and after a chapter title). The output should look like this:

序

嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其難也。性命之理,有不悟性根者,有不知命蒂者,有修性命而旁歧雜出者,有修性命而後先倒置者。涵養未得其中,即性命未盡其奧也。乃木蘭一女子耳,擔荷倫常,研求性命,而獨無所不盡也哉!

  予幼讀《木蘭詩》,觀其代父從軍,可謂孝矣;立功絕塞,可謂忠矣。後閱《唐書》,言木蘭唐女,西陵人,嫻弓馬,諳韜略,轉戰沙漠,累大功十二,何其勇也。封武昭將軍,凱旋還里。當時筮者謂致亂必由武姓,讒臣嫁禍武昭,詔徵至京。木蘭具表陳情,掣劍剜心,示使者,目視而死。死後,位證雷部大神,何其烈也。去冬閱《木蘭奇女傳》,復知其幼而領悟者性命也,長而行持者性命也。且通部議論極精微,極顯豁,又無非性命之妙諦也。盡人所當盡,亦盡人所難盡。惟其無所不盡,則亦無所不奇。而人奇,行奇,事奇,文奇,讀者莫不驚奇叫絕也。此書相傳為奎斗馬祖所演,卷首有武聖帝序。今序已失,同人集貲付梓。書成,爰敘其緣起如此。

      書於滬江梅花書館南窗之下

Hint: Think about what you need to match to make the change. You may need to include some of the things that your regex matches in the replacement using references (i.e. \1, \2, etc.).

In [ ]:
without_spaces2 = re.sub(r"your regex goes here!", r"", without_spaces)
print(without_spaces2[0:700])

Exercise 8: The text contains comments in it which we might want to delete before doing further processing or calculating any statistics. Create a regex substitution which removes each of these comments.

Example comment: …居於湖廣黃州府西陵縣(今之黃陂縣)雙龍鎮。 => should become …居於湖廣黃州府西陵縣雙龍鎮。

Make sure to check that your regex does not remove too much text!

In [ ]:
without_comments = re.sub(r"your regex goes here!", r"", without_spaces2)
print(without_comments[0:1000])

Exercise 9: Experiment with writing regexes to list things that look like chapter titles in the text. There are several ways this can be done. (There are 32 numbered chapters in this text.)

In [ ]:
for match in re.finditer(r"your regex goes here!", without_spaces2):
    matched_text = match.group(1)
    print(matched_text)
  • Next, use your chapter-detecting regex to add immediately before each chapter the text “CHAPTER_STARTS_HERE”.
In [ ]:
# Your code goes here!

Lastly, we can use a regex to split a string variable into a Python list using the re.split() function. At any point in the string where the specified regex is matched, the data is split into pieces. For instance:

In [71]:
laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。上仁為之而無以為;上義為之而有以為。上禮為之而莫之應,則攘臂而扔之。故失道而後德,失德而後仁,失仁而後義,失義而後禮。"
laozi_phrases = re.split(r"[。,;]", laozi)
for number in range(0, len(laozi_phrases)):
    print(str(number) + ". " + laozi_phrases[number])
0. 上德不德
1. 是以有德
2. 下德不失德
3. 是以無德
4. 上德無為而無以為
5. 下德為之而有以為
6. 上仁為之而無以為
7. 上義為之而有以為
8. 上禮為之而莫之應
9. 則攘臂而扔之
10. 故失道而後德
11. 失德而後仁
12. 失仁而後義
13. 失義而後禮
14.

Use re.split() to split your full text into a Python list, in which each chapter is one list item. (For simplicity you can ignore things like the preface etc.)

In [ ]:
# Your code goes here!
# Call your list variable "chapters"

Now we have this data in a Python list, we can do things to each chapter individually. We can also put each of the chapters into its own text file – this is something we will sometimes need to do when we want to use other tools that are not in Python.

In [ ]:
for chapternumber in range(0,len(chapters)):
    file = open("mulan-part-" + str(chapternumber) + ".txt", "w", encoding="utf-8")
    file.write(chapters[chapternumber] + "\n")
    file.close()

Further reading:

Creative Commons License
Posted in Digital humanities | Leave a comment

Regular expressions with Text Tools for ctext.org

Along with other functions such as automated text reuse identification, the “Text Tools” plugin for ctext.org can use the ctext API to import textual data from ctext.org directly for analysis with regular expressions. A step-by-step online tutorial describes how to actually use the tool (see also the instructions on the tool’s own help page); here I will give some concrete examples of what the tool can be used to do.

Regular expressions (often shortened to “Regexes”) are a powerful extension of the type of simple string search widely available in computer software (e.g. word processors, web browsers, etc.): a regular expression is a specification of something to be matched in some body of text. At their simplest, regular expressions can be simply strings of characters to search for, like “君子” or “巧言令色”. At its most basic, you can use Text Tools to search for multiple terms within a text by entering your terms one per line in the “Regex” tab:

Text Tools will highlight each match in a different color, and show only the paragraphs with at least one match. Of course, you can specify as many search terms as you like, for example:

Clicking on any of the matched terms adds it as a “constraint”, meaning that only passages containing that term will be shown (though still highlighting any other matches present). For instance, clicking “君子” will show all the passages with the term “君子” in them, while still highlighting any other matches:

As with the similarity function of the same plugin, if your regular expression query results in relational data, this can be visualized as a network graph. This is done by setting “Group rows by” to either “Paragraph” or “Chapter”, which gives results in the “Summary” tab tabulated by paragraph (or chapter) – each row represents a paragraph which matched a term, and each column corresponds to one of the matched items:

This can be visualized as a network graph in which edges represent co-occurrence of terms within the same paragraph, and edge weights represent the number of times such co-occurrence is repeated in the texts selected:

This makes it clear where the most frequently repeated co-occurrences occur in the selected corpus – in this example, “君子” and “小人”, “君子” and “禮”, etc. Similarly to the way in which similarity graphs created with the Text Tools plugin work, double-clicking on any edge in the graph returns to the “Regex” tab with the two terms joined by that edge chosen as constraints, thus listing all the passages in which those terms co-occur, this being the data explaining the selected edge:

So far these examples have used fixed lists of search strings. But as the name suggests, the “Regex” tool also supports regular expressions, and so by making use of standard regular expression syntax, it’s possible to make far more sophisticated queries. [If you haven't come across regular expressions before, some examples are covered in the regex section of the Text Tools tutorial.] For example, we could write a regular expression that matches any one of a specified set of color terms, followed by any other character, and see how these are used in the Quan Tang Shi (my example regex is “[黑白紅]\w”: match any one of “黑”, “白”, or “紅”, followed by one non-punctuation character):

If we use “Group by: None”, we get total counts of each matched value – i.e. counts of how frequently “白雪”, “白水”, “紅葉”, and whatever other combinations there are occurred in our text. We can then use the “Chart” link to chart these results and get an overview of the most frequently used combinations:

If we go back to the Regex tab and set “Group by” to “Paragraph”, we can visualize the relationships just like in the Analects example — except that this time we don’t need to specify a list of terms, rather these terms can be extracted using the pattern we specified as a regular expression (in this graph I have set “Skip edges with weight less than” to “2″ to reduce clutter caused by pairs of terms that only ever occur once):

Although overall – as we can see from the bar chart above – combinations with “白” in them are the most common, the relational data shown in the graph above immediately highlights other features of the use of these color pairings: the three most frequent pairings in our data are actually pairings between “白” and “紅”, like “白雲” and “紅葉”, or “白髮” and “紅顏”. As before, our edges are linked to the data, so we can easily go back to the text to see how these are actually being used:

Regular expressions are a hugely powerful way of expressing patterns to search for in text — see the tutorial for more examples and a step-by-step walk-through.

Posted in Digital humanities | Leave a comment

Exploring text reuse with Text Tools for ctext.org

The plugin system and API for ctext.org make it possible to import textual data from ctext.org directly into other online tools. One such tool is the new “Text Tools” plugin, which provides a set of textual analysis and visualization tools designed to work with texts from ctext.org. There is a step-by-step online tutorial describing how to actually use the tool (as well as the instructions on the tool’s own help page); I won’t repeat those here, but instead will give some examples of what the tool can be used to do.

One of the most interesting features of the tool is its function to identify text reuse within and between texts (via the “Similarity” tab). This takes as input one or more texts, and identifies and visualizes similarities between them. For example, with the text of the Analects:

This uses a heat map effect somewhat similar to the ctext.org parallel passage feature: here n-grams are matched (e.g. 3-grams, i.e. triples of identical characters used in identical sequence), and overlapping matched n-grams are shown in successively brighter shades of red. By default, all paragraphs having any shared n-grams with anything else in the selected text or texts are shown. The visualization is interactive, so clicking on any highlighted section switches the view to show all locations in the chosen corpus containing the selected n-gram (which is then highlighted in blue, like the 6-gram “如己者過則勿” in the following image):

Since the texts are read in from ctext.org via the API, the program also knows the structure of the text; clicking on “Chapter summary” shows instead a table of calculated total matches aggregated on a chapter-by-chapter basis:

This data is relational: each row expresses strength of similarity of a certain kind between two entities (two chapters of text). It can therefore be visualized as a weighted network graph – the Text Tools plugin can do this for you:

What’s nice about this type of graph is that every edge has a very concrete meaning: the edge weights are simply a representation of how much reuse there is between the two nodes (i.e. chapters) which it connects. Even better, this visualization is also interactive: double-clicking an edge (e.g. the edge connecting 先進 and 雍也) returns to the passage level visualization and lists all the similarities between those two specified chapters – in other words, it lists precisely the data forming the basis for the creation of that edge:

What this means is that the graph can be used as a map to see where similarities occur and with which to navigate the results. It also makes it possible to visualize broader trends in the data which might not be easily visible by looking directly at the raw data. For instance, in the following graph created using the tool from three early texts, several interesting patterns are observable at a glance (key: light green = Mozi; dark green = Zhuangzi; blue = Xunzi):

Some at-a-glance patterns suggested by this graph: chapters of the three texts tend to have stronger relationships within their own text, with a few exceptions. There are several disjoint clusters of chapters, which have text reuse relationships with other members of their own group, but not with the rest of the text they appear in – most striking is the group of eight “military chapters” of the Mozi at the top right of the graph, which have strong internal connections but none to anything else in the graph:

Double-clicking on some edges to view the full data indicates that some of these pairs have quite significant reuse relationships:

The only other entirely disjoint cluster is the group formed by the 大取 and 小取 pair of texts – in this case the edge is formed by one short but highly significant parallel:

Another interesting observation: of those Zhuangzi chapters having text reuse relationships with other chapters in the set considered, only the 天下 chapter lacks any significant reuse relationship with any other part of the Zhuangzi – though it does contain a significant parallel with the Xunzi:

Something similar is seen with the 賦 chapter of the Xunzi:

There is a lot of complex detail in this graph, and interpretation requires care and attention to the actual details of what is being “reused” (as well as the parameters of the comparison and visualization); the Text Tools program makes it possible to easily explore the larger trends while also being able to quickly jump into the detailed instance-level view to examine the underlying text. Text Tools works “out of the box” with texts from ctext.org read in via API (ideally you will need an institutional subscription or API key to do this efficiently), but it can also use texts from other sources.

Further information:

Posted in Digital humanities | Leave a comment

Searching ctext.org texts from another website

There are a number of ways to add direct full-text search of a ctext.org text to an external website. One of the most straightforward is to use the API “getlink” function to link to a text using its CTP URN. For example, to make a text box which will search this Harvard-Yenching copy of the 茶香閣遺草, you can first locate the corresponding transcribed text on ctext.org, go to the bottom-right of its contents page to get its URN (you need the contents page for the transcription, not the associated scan), which in this case is “ctp:wb417980″ – this step can also be done programmatically by API if you want to repeat it for a large number of texts. Once you have the URN, you can create an HTML form which will send the URN and any user-specified search term to the ctext API, which will redirect the user’s browser to the search results. For example, the following HTML creates a search box for 茶香閣遺草:

<form action="https://api.ctext.org/getlink" method="get">
  <input type="hidden" name="urn" value="ctp:wb417980" />
  <input type="text" name="search" />
  <input type="hidden" name="redirect" value="1" />
  <input type="submit" value="Search" />
</form>

This will display the following type of search box (try entering a search term in Chinese and clicking “Search”):

You can also supply the optional “if” and “remap” parameters if you want users of your form to be directed to the Chinese interface, or to use the simplified Chinese version of the site (the defaults are English and traditional Chinese). For Chinese interface, between the <form … /> and </form> tags, add the following line:

  <input type="hidden" name="if" value="zh" />

For simplified Chinese, add this line:

  <input type="hidden" name="remap" value="gb" />

If you want to make a link to the text itself using the URN, you can also directly link to the API endpoint:

<a href="https://api.ctext.org/getlink?urn=ctp:wb417980&amp;redirect=1">茶香閣遺草</a>

Live example: 茶香閣遺草

Again, the “if” and “remap” parameters can also be supplied to choose the interface used, as per the API documentation.

Posted in Digital humanities | Leave a comment

Spammy advertising best reason for switching to HTTPS

While transiting at Schiphol and using the airport wifi, I noticed the sudden appearance of a bunch of adverts on normally advert-free websites. For example:

Some investigation indicated that this time the adverts were not injected via Google Analytics, but instead attached directly into the HTML content of the page. First at the top we have some injected CSS:

Then at the bottom we have the real payload, injected JavaScript code:

It appears this is the same type of advertising afflicting AT&T hotspots – information gleaned from Jonathan Meyer, whose website describing the issue is itself also affected by the Schipol adverts:

Again it seems that given the large scale involved, someone, somewhere – perhaps including a company called “RaGaPa” who seem to be responsible for the ads – is making quite a bit of money through unsavory and perhaps legally questionable means.

Just in case the adverts on their own are not spammy enough, the icon at the top right of the adverts link to the following explanation, casually noting that in addition to standard user tracking and user history ad serving, “You may also be redirected to sponsor’s websites or welcome pages at a set frequency”:

Perhaps the real take-home though is that HTTPS sites are, again, not affected by this: content injection of this type is not possible on sites served using HTTPS without defeating the certificate authority chain or sidestepping it with other kinds of trickery. Digital Sinology recently moved to HTTPS, so is not affected by this particular attack.

Posted in Off topic | Leave a comment