Classical Chinese DH: Regular expressions¶
[View this notebook online] [Download this notebook] [List of tutorials]
Regular expressions¶
A regular expression (a.k.a. regex or RE) is a pattern to be searched for in some body of text. These are not specific to Python, but by combining simple regular expressions with basic Python statements, we can quickly achieve powerful results.
Commonly used regex syntax
. | Matches any one character exactly once |
[abcdef] | Matches any one of the characters a,b,c,d,e,f exactly once |
[^abcdef] | Matches any one character **other than** a,b,c,d,e,f |
? | After a character/group, makes that character/group optional (i.e. match zero or 1 times) |
? | After +, * or {…}, makes matching ungreedy (i.e. choose shortest match, not longest) |
* | After a character/group, makes that character/group match zero or more times |
+ | After a character/group, makes that character/group match one or more times |
{2,5} | After a character/group, makes that character/group match 2,3,4, or 5 times |
{2,} | After a character/group, makes that character/group match 2 or more times |
\3 | Matches whatever was matched into group number 3 (first group from left is numbered 1) |
To use regexes in Python, we use another module called “re” (this is a very common module and should already be installed).
import re
laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。上仁為之而無以為;上義為之而有以為。上禮為之而莫之應,則攘臂而扔之。故失道而後德,失德而後仁,失仁而後義,失義而後禮。"
for match in re.finditer(r".德", laozi): # re.finditer returns "match objects", each of which describes one match
matched_text = match.group(0) # In Python, group(0) matches the full text that was found
print("Found a match: " + matched_text)
[Aside: in Python, regexes are often written in strings with a “r” in front of them, e.g. r”德” rather than just “德”. All this does is tells Python not to try to interpret the contents of the string (e.g. backslashes) as meaning something else. The result of r”德” is still an ordinary string variable with 德 in it.]
Exercise 1 (very easy): Change the above code to verify the results of some of the simple example regexes from the slides. Try these ones:
- 而無以為
- 是以.德
- 失.而後.
- 上[仁義]為之
- 後(.),失\1
For the last of these (“後(.),失\1”), see what happens to the output when you change group(0) to group(1). (Change it back to group(0) afterwards though, as we will reuse this code using group(0).)
Exercise 2: Write regular expressions to match the following things (you can keep on modifying the example above to check that they work, but you may want to write down your answers somewhere – remember, you can edit this cell by double-clicking on it).
- Match any three characters where the middle character is “之” – i.e. “為之而”, “莫之應”, etc. Modify your regex so that it does not match things with punctuation in them, like “扔之。”.
- Match each “phrase” (i.e. punctuated section) of the text. In other words, the first match should be “上德不德”, the second should be “是以有德”, and so on. You only need to handle the three punctuation marks “。”, “,”, and “;”.
- Match each phrase which contains the term “之” in it. (Double check that you get 5 matches.)
We can do the same kind of thing on an entire text in one go if we have the whole text in a single string, as in the next example. (If we wanted to know which paragraph or chapter each match appeared in, we would want to run the same regex on each paragraph or chapter in turn so that we know which paragraph or chapter each match occurs in.)
from ctext import *
setapikey("demo")
# The gettextasstring function gives us a single string variable with the whole text in it
laozi = gettextasstring("ctp:dao-de-jing")
for match in re.finditer(r"足.", laozi):
matched_text = match.group(0)
print(matched_text)
Exercise 3
-
Often we don’t want to include matches that have punctuation in them. Modify the regex from the last example so that it excludes all the matches where the character after “足” is “,”, “。”, or “;”. (You should do this by modifying the regex; the rest of the code does not need to change.)
-
Find all the occurrences of X可X – i.e. “道可道” and “名可名” (there is one more item that should be matched too).
-
Modify your regex so you match all occurrences of XYX – i.e. not just “道可道” but also things like “學不學”. You may need to make some changes to avoid matching punctuation – we don’t want to match “三,三” or “、寡、”.
Exercise 4: (Optional) Using what was covered in the previous tutorial, write a program in the cell below to perform one of these searches again, but this time running it once on each paragraph in turn so that you can also print out the number of the passage in which each match occurs.
passages = gettextasparagraphlist("ctp:dao-de-jing")
# Your code goes here!
Dictionary variables¶
One of the advantages of using regexes from within a programming language like Python is that as well as simply finding results, we can easily do things to collate our data, such as count up how many times a regex gave various different results. Another type of variable that is useful here is the “dictionary” variable.
A dictionary variable works in a very similar way to a list, except that whereas in a list the items are numbered 0,1,2,… and accessed using these numbers, a dictionary uses other things – in the case we will look at, strings – to identify the items. This lets us “look up” values for different strings, just like looking up the translation of a word in a dictionary. The things we use instead of numbers to “look up” values in a dictionary are called “keys“.
Dictionaries can be defined in Python using the following notation:
my_titles = {"論語": "Analects", "孟子": "Mengzi", "荀子": "Xunzi"}
The above example defines one dictionary variable called “my_titles”, and sets values for three keys: “論語”, “孟子”, and “荀子”. Each of these keys is set to have the corresponding value (“Analects”, “Mengzi”, and “Xunzi” respectively). In this simple example, our dictionary gives us a way of translating Chinese-language titles into English-language titles.
We can access the items in a dictionary in a very similar way to accessing items from a list:
print(my_titles["論語"])
print(my_titles["荀子"])
Unlike in a list, our items don’t have numbers, and don’t come in any particular order. So one thing we will sometimes need to do is to get a list of all the keys – i.e., a list telling us what things there are in our dictionary.
list_of_titles = list(my_titles.keys())
print(list_of_titles)
Often we will store numbers in our dictionary; the keys will be strings, but the value for each key will be a number. This lets us do things like count how many times we’ve seen some particular string – for all of the strings we happen to come across at the same time, using just one dictionary variable. In cases like this, we will often want to sort the dictionary by the values of the keys. One way of doing this is using the “sorted” function:
# In this example, we use a dictionary to record people's year of birth
# Then we sort the keys (i.e. the names) by the values (i.e. year of birth)
year_of_birth = {"胡適": 1891, "梁啟超": 1873, "茅盾": 1896, "王韜": 1828, "魯迅": 1881}
list_of_people = sorted(year_of_birth, key=year_of_birth.get, reverse=False)
for name in list_of_people:
print(name + " was born in " + str(year_of_birth[name]))
Don’t worry about the rather complex looking syntax for sorted() – you can just follow this model whenever you need to sort a dictionary (and change “reverse=False” to “reverse=True” if you want to reverse the list):
list_of_keys = sorted(my_dictionary, key=my_dictionary.get, reverse=False)
Using a dictionary, we can keep track of every regex result we found, and at the same time collate the data. Instead of having a long list with repeated items in it, we build a dictionary in which the keys are the unique regex matches, and the values are the number of times we have seen that particular string.
match_count = {} # This tells Python that we're going to use match_count as a dictionary variable
for match in re.finditer(r"(.)為", laozi):
matched_text = match.group(0) # e.g. "心為"
if not matched_text in match_count:
match_count[matched_text] = 0 # If we don't do this, Python will give an error on the following line
match_count[matched_text] = match_count[matched_text] + 1
# Our dictionary now contains a frequency count of each different pair we found
print("match_count contains: " + str(match_count))
# The sorted() function gets us a list of the items we matched, starting with the most frequent
unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
print(item + " occurred " + str(match_count[item]) + " times.")
We can use this idea and almost exactly the same code to start answering quite complex questions about patterns appearing in texts. This code can tell us which actual phrases matching a certain pattern occurred most frequently.
For example, in poetry we often find various kinds of repetition. We can use part of the 詩經 as an example, and using a regex quickly find out which repeated XYXY patterns are most common:
shijing = gettextasstring("ctp:book-of-poetry/lessons-from-the-states")
match_count = {} # This tells Python that we're going to use match_count as a dictionary variable
for match in re.finditer(r"(.)(.)\1\2", shijing):
matched_text = match.group(0)
if not matched_text in match_count:
match_count[matched_text] = 0 # If we don't do this, Python will give an error on the following line
match_count[matched_text] = match_count[shijing[match.start():match.end()]] + 1
unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
print(item + " occurred " + str(match_count[item]) + " times.")
Exercise 5: Write a regex to match paired lines of four-character poetry that both begin with the same two characters (examples: “亦既見止、亦既覯止”, “且以喜樂、且以永日”, etc.). Re-run the program above to verify your answer.
Exercise 6: Create a regex to match book titles that appear in punctuated Chinese texts, e.g. “《呂氏春秋》”. Your regex should extract the title without the punctuation marks into a group – i.e. you must use “(” and “)” in your regex. You can test it using the short program below – your output should look like this:
爾雅
廣雅
尚賢
呂氏春秋·順民
呂氏春秋·不侵
左·襄十一年傳
韓詩外傳
廣雅
test_input = "昔者文公出走而正天下,畢云:「正,讀如征。」王念孫云「畢讀非也,《爾雅》曰:『正,長也。』晉文為諸侯盟主,故曰『正天下』,與下『霸諸侯』對文。又《廣雅》『正,君也』。《尚賢》篇曰:『堯、舜、禹、湯、文、武之所以王天下正諸侯者』。凡墨子書言正天下正諸侯者,非訓為長,即訓為君,皆非征伐之謂。」案:王說是也。《呂氏春秋·順民》篇云:「湯克夏而正天下」,高誘注云:「正,治也」,亦非。桓公去國而霸諸侯,越王句踐遇吳王之醜,蘇時學云:「醜,猶恥也。」詒讓案:《呂氏春秋·不侵》篇「欲醜之以辭」,高注云:「醜,或作恥。」而尚攝中國之賢君,畢云:「尚與上通。攝,合也,謂合諸侯。郭璞注爾雅云:『聶,合』,攝同聶。」案:畢說未允。攝當與懾通,《左·襄十一年傳》云:「武震以攝威之」,《韓詩外傳》云:「上攝萬乘,下不敢敖乎匹夫」,此義與彼同,謂越王之威足以懾中國賢君也。三子之能達名成功於天下也,皆於其國抑而大醜也。畢云:「猶曰安其大醜。《廣雅》云:『抑,安也』」。俞樾云:「抑之言屈抑也。抑而大醜,與達名成功相對,言於其國則抑而大醜,於天下則達名成功,正見其由屈抑而達,下文所謂敗而有以成也。畢注於文義未得。」案:俞說是也。太上無敗,畢云:「李善文選注云:『河上公注老子云:太上,謂太古無名之君也』。」案:太上,對其次為文,謂等之最居上者,不論時代今古也。畢引老子注義,與此不相當。其次敗而有以成,此之謂用民。言以親士,故能用其民也。"
for match in re.finditer(r"your regex goes here!", test_input):
print(match.group(1)) # group() extracts the text of a group from a matched regex: so your regex must have a group in it
Now modify your regex so that instead of getting book titles together with chapter titles, your regex only captures the title of the work – i.e., capture “呂氏春秋” instead of “呂氏春秋·順民”, and “左” instead of “左·襄十一年傳”.
Optional: Bonus points if you can also capture the chapter title on its own in a separate regex group at the same time. This is a bit fiddly though, and we don’t need to do it for this exercise.
- Now modify the example code below (it’s almost identical to one of examples above) so that it lists how often every title was mentioned in the 墨子閒詁 (a commentary on the classic text “墨子” – in this example we only use the first chapter, though the code can also be run on the whole text by changing the URN).
- Then modify your code so that it only lists the top 10 most frequently mentioned texts. Hint: “unique_items” is a list, and getting part of a list is very similar to getting part of a string.
test_input = gettextasstring("ctp:mozi-jiangu/qin-shi")
match_count = {} # This tells Python that we're going to use match_count as a dictionary variable
for match in re.finditer(r"your regex goes here!", test_input):
matched_text = match.group(1)
if not matched_text in match_count:
match_count[matched_text] = 0 # If we don't do this, Python will give an error on the following line
match_count[matched_text] = match_count[matched_text] + 1
unique_items = sorted(match_count, key=match_count.get, reverse=True)
for item in unique_items:
print(item + " occurred " + str(match_count[item]) + " times.")
Dictionaries also allow us to produce graphs summarizing our data.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
%matplotlib inline
# Unfortunately some software still has difficulty dealing with Chinese.
# Here we may need to tell matplotlib to use a specific font containing Chinese characters.
# If your system doesn't display the Chinese text in the graph below, you may need to specify a different font name.
import platform
if platform.system() == 'Darwin': # I.e. if we're running on Mac OS X
mpl.rcParams['font.family'] = "Arial Unicode MS"
else:
mpl.rcParams['font.family'] = "SimHei"
mpl.rcParams['font.size'] = 20
# The interesting stuff happens here:
s = pd.Series(match_count)
s.sort_values(0, 0, inplace=True)
s = s[:10]
s.plot(kind='barh')
Now modify your regex so that you only match texts that are cited as pairs of book title and chapter, i.e. you should only match cases like “《呂氏春秋·順民》” (and not 《呂氏春秋》), and capture into a group the full title (“呂氏春秋·順民” in this example). This may be harder than it looks! You will need to be careful that your regex does not sometimes match too much text.
Re-run the above programs to find out (and graph) which chapters of which texts are most frequently cited in this way by this commentary.
Replacing and Splitting with Regexes¶
As well as finding things, regexes are ideal for other very useful tasks including replacing and splitting textual data.
For example, we saw in the last notebook cases where it would be easier to process a text without any punctuation in it. We can easily match all punctuation using a regex, and once we know how to search and replace, we can just replace each matched piece of punctuation with a blank string to get an unpunctuated text.
We can do a simple search-and-replace using a regex like this:
import re
input_text = "道可道,非常道。"
print(re.sub(r"道", r"名", input_text))
For very simple regexes that don’t use any special regex characters, this gives exactly the same result as replace(). But because we can specify patterns, we can do much more powerful replacements.
input_text = "道可道,非常道。"
print(re.sub(r"[。,]", r"", input_text))
Of course, as usual the power of this is that we can quickly do it for however much data we like:
laozi = gettextasstring("ctp:dao-de-jing")
print(re.sub(r"[。,;?:!、]", r"", laozi))
Another useful aspect is that we can use data from regex groups that we matched within our replacement. This makes it easy to write replacements that do things like add some particular string before or after something we want to match. This example finds any punctuation character, puts it in regex group 1, and then replaces it with regex group 1 followed by a return character – in other words, it adds a line break after every punctuation character.
laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。"
print(re.sub(r"([。,;?:!、])", r"\1\n", laozi))
Regexes and text files¶
Regular expressions can be very useful when we want to transform text from one format to another, or when we want to read text from a file and it isn’t in the format we want.
In this section, instead of using the ctext.org API, we will experiment with a text from Project Gutenberg. Before starting, download the plain text UTF-8 file from the website and save it on your computer as a file called “mulan.txt”. You should save this in the same folder as this Jupyter notebook (.ipynb) file.
Note: you don’t have to save files in the same folder as your Jupyter notebook, but if you save them somewhere else, when opening the file you will need to tell Python the full path to your file instead of just the filename – e.g. “C:\Users\user\Documents\mulan.txt” instead of just “mulan.txt”.
file = open("mulan.txt", "r", encoding="utf-8")
data_from_file = file.read()
file.close()
One practical issue when dealing with a lot of data in a string is that printing it to the screen so we can see what’s happening in our program may take up a lot of space. One thing we can do is to just print a substring – i.e. only print the first few hundred or so characters:
print(data_from_file[0:700])
One thing that will be handy is if we can delete the English blurb at the top of this file automatically. There are several ways we could do this. One way is to use a negative character class – matching everything except some set of characters – to match all characters that are non-Chinese, and delete them.
The re.sub() function takes three parameters:
- The regular expression to match
- What we want to replace each match with
- The string we want to do the matching in
It returns a new string containing the result after making the substitution.
[The example below also makes use of another kind of special syntax in a character class: we can match a range of characters by their Unicode codepoint. Here we match everything from U+25A1 through U+FFFF, all of which are Chinese characters. Don’t worry too much about the contents of this regex – we won’t need to write regexes like this most of the time.]
new_data = re.sub(r'[^\n\r\u25A1-\uFFFF]', "", data_from_file)
print(new_data[0:700])
We’ve got rid of the English text, but we’ve now got too many empty lines. Depending on what data is in the text, we might want to remove all the line breaks… but in this case there are some things like chapter titles that are best kept on separate lines so we can tell where the chapters begin and end.
Remember: “\n” means “one line break”, and “{3,}” will match 3 or more of something one after the other (and as many times as possible).
without_spaces = re.sub(r'\n{3,}', "\n\n", new_data) # This regex matches three or more line breaks, and replaces them with two
print(without_spaces[0:700])
Exercise 7: (Harder) Make another substitution using a regex which removes only the line breaks within a paragraph (and does not remove linebreaks before and after a chapter title). The output should look like this:
序
嘗思人道之大,莫大於倫常;學問之精,莫精於性命。自有書籍以來,所載傳人不少,求其交盡乎倫常者鮮矣,求其交盡乎性命者益鮮矣。蓋倫常之地,或盡孝而不必兼忠,或盡忠而不必兼孝,或盡忠孝而安常處順,不必兼勇烈。遭際未極其變,即倫常未盡其難也。性命之理,有不悟性根者,有不知命蒂者,有修性命而旁歧雜出者,有修性命而後先倒置者。涵養未得其中,即性命未盡其奧也。乃木蘭一女子耳,擔荷倫常,研求性命,而獨無所不盡也哉!
予幼讀《木蘭詩》,觀其代父從軍,可謂孝矣;立功絕塞,可謂忠矣。後閱《唐書》,言木蘭唐女,西陵人,嫻弓馬,諳韜略,轉戰沙漠,累大功十二,何其勇也。封武昭將軍,凱旋還里。當時筮者謂致亂必由武姓,讒臣嫁禍武昭,詔徵至京。木蘭具表陳情,掣劍剜心,示使者,目視而死。死後,位證雷部大神,何其烈也。去冬閱《木蘭奇女傳》,復知其幼而領悟者性命也,長而行持者性命也。且通部議論極精微,極顯豁,又無非性命之妙諦也。盡人所當盡,亦盡人所難盡。惟其無所不盡,則亦無所不奇。而人奇,行奇,事奇,文奇,讀者莫不驚奇叫絕也。此書相傳為奎斗馬祖所演,卷首有武聖帝序。今序已失,同人集貲付梓。書成,爰敘其緣起如此。
書於滬江梅花書館南窗之下
Hint: Think about what you need to match to make the change. You may need to include some of the things that your regex matches in the replacement using references (i.e. \1, \2, etc.).
without_spaces2 = re.sub(r"your regex goes here!", r"", without_spaces)
print(without_spaces2[0:700])
Exercise 8: The text contains comments in it which we might want to delete before doing further processing or calculating any statistics. Create a regex substitution which removes each of these comments.
Example comment: …居於湖廣黃州府西陵縣(今之黃陂縣)雙龍鎮。 => should become …居於湖廣黃州府西陵縣雙龍鎮。
Make sure to check that your regex does not remove too much text!
without_comments = re.sub(r"your regex goes here!", r"", without_spaces2)
print(without_comments[0:1000])
Exercise 9: Experiment with writing regexes to list things that look like chapter titles in the text. There are several ways this can be done. (There are 32 numbered chapters in this text.)
for match in re.finditer(r"your regex goes here!", without_spaces2):
matched_text = match.group(1)
print(matched_text)
- Next, use your chapter-detecting regex to add immediately before each chapter the text “CHAPTER_STARTS_HERE”.
# Your code goes here!
Lastly, we can use a regex to split a string variable into a Python list using the re.split() function. At any point in the string where the specified regex is matched, the data is split into pieces. For instance:
laozi = "上德不德,是以有德;下德不失德,是以無德。上德無為而無以為;下德為之而有以為。上仁為之而無以為;上義為之而有以為。上禮為之而莫之應,則攘臂而扔之。故失道而後德,失德而後仁,失仁而後義,失義而後禮。"
laozi_phrases = re.split(r"[。,;]", laozi)
for number in range(0, len(laozi_phrases)):
print(str(number) + ". " + laozi_phrases[number])
Use re.split() to split your full text into a Python list, in which each chapter is one list item. (For simplicity you can ignore things like the preface etc.)
# Your code goes here!
# Call your list variable "chapters"
Now we have this data in a Python list, we can do things to each chapter individually. We can also put each of the chapters into its own text file – this is something we will sometimes need to do when we want to use other tools that are not in Python.
for chapternumber in range(0,len(chapters)):
file = open("mulan-part-" + str(chapternumber) + ".txt", "w", encoding="utf-8")
file.write(chapters[chapternumber] + "\n")
file.close()
Further reading:
- The browser-based Text Tools plugin for ctext.org supports regular expressions – an online tutorial for the plugin is available, which describes how to use it to investigate patterns with regexes.