Classical Chinese DH: Python programming and ctext.org API

Classical Chinese DH: Python programming and ctext.org API

By Donald Sturgeon

[View this notebook online] [Download this notebook] [List of tutorials]

Variables

Variables are named entities that contain some kind of data that can be changed at a later date. You can choose (almost) any name for a variable as long as it is not the same as a reserved word (i.e. has some special meaning in the Python language), though typically these names will be composed of letters a-z. The names given to variables have no special meaning to the computer, but giving variables names that describe their function in a particular program is usually very helpful to the programmer – and to anyone else who may look at your code. Spaces cannot be part of a variable name, so sometimes other allowed characters (e.g. “_”) are used instead for clarity. Although it is possible to use non-English characters for variable names, this is generally inadvisable as it may cause compatibility problems when running the same program on another computer.

A value is assigned to a variable using the syntax “variable_name = new_value“.

In [1]:
number_of_people = 5
print(number_of_people)
5

A variable only ever has one value at a time. When we assign a second value to a variable, anything that was in it before is lost.

In [2]:
number_of_people = 5
number_of_people = 15
print(number_of_people)
15

In Python, variable names are case sensitive, so as far as Python is concerned, a variable called “thisone” is completely different from a variable called “ThisOne” or “THISONE”.

In [3]:
test = 1
print(Test)  # This will not work and will give an error, because "test" and "Test" are different variables
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-c8ddc105ba6a> in <module>()
      1 test = 1
----> 2 print(Test)  # This will not work and will give an error, because "test" and "Test" are different variables

NameError: name 'Test' is not defined
In [4]:
thisone = 5
ThisOne = 10
print(thisone)
print(ThisOne)
5
10

We can perform basic arithmetic on variables and numbers using symbols representing arithmetic operators:

+ Add
Subtract
/ Divide
* Multiply
In [5]:
number_of_people = 5
pages_per_person = 12
print(number_of_people * pages_per_person)
60

Strings

One of the most important units of text in most programming languages is the string: an ordered sequence of zero or more characters. Strings can be “literal” strings – string data typed in to a program – or the contents of a variable.

Literal strings have to be enclosed in special characters so that Python knows exactly which part of what appears in the program belongs to the string being defined. You can use either a pair of double quotation marks (“…”) or single quotation marks (‘…’) for this. (Note: most programming languages including Python will not allow the use of ‘full-width’ Chinese / CJK punctuation characters for this purpose.)

In [6]:
print("學而時習之")
學而時習之
In [7]:
analects_1_1 = "子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」"
print(analects_1_1)
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」

Two strings can be joined together (concatenated) using the “+” operator to give a new string:

In [8]:
analects_1_3 = "子曰:「巧言令色,鮮矣仁!」"
print(analects_1_1 + analects_1_3)
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」子曰:「巧言令色,鮮矣仁!」

In Python, each variable has a particular “type”. The most common types are “string”, “integer” (…,-2,-1,0,1,2,…), and “float” (any real number, e.g. 3.1415, -26, …). When joining a string and a number using “+”, we need to specify that the number should be changed into a string:

In [9]:
print(analects_1_1 + 5) # This will not work
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-0eb38b4432f4> in <module>()
----> 1 print(analects_1_1 + 5) # This will not work

TypeError: Can't convert 'int' object to str implicitly
In [10]:
print(analects_1_1 + str(5))
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」5

Sometimes we may need to include “special” characters in our strings, like the return character (\n), or tab character (\t). Also, if we need to include quotation marks as part of our string, we can do this by putting a backslash before each quotation mark (\”).

In [11]:
output_string = "第一行\n第二行\n\"第三行\""
print(output_string)
第一行
第二行
"第三行"

Doing things with strings

Once we have data in a string, we can manipulate it in various ways. Often a program will be designed to work on arbitrary string input, i.e. the program will not know in advance what strings it will be asked to work with. So we need ways of finding out basic things about the string. First of all, how long (in characters) is the string:

In [12]:
print(len(analects_1_1))
41

N.B. “Characters” here means characters in the technical string sense. It includes, for example, all punctuation symbols, and other “special characters” that may be in the string (such as characters representing line breaks).

We can take a single character from a string and create a new string containing just that character using the notation “string_variable[m]“, where m is a number describing the position of the character we want to copy from the string.

N.B. In Python (and many other languages), the characters in a string are numbered starting from 0. So if a string has a length of 5, its characters are numbered 0, 1, 2, 3, and 4.

In [13]:
print(analects_1_1[0])
In [14]:
print(analects_1_1[5])

If we use a negative value for m, we can do the same thing but counting backwards from the end of the string:

In [15]:
print(analects_1_1[-3])

Another useful basic function is making a new string from some part of an existing string – this is called a “substring”.
In Python, we get a substring of a string starting at position m and ending just before position n using the notation “string_variable[m:n]”:

In [16]:
print(analects_1_1[0:1])
In [17]:
print(analects_1_1[1:2])
In [18]:
print(analects_1_1[0:2])
子曰
In [19]:
print(analects_1_1[4:15])
學而時習之,不亦說乎?
In [20]:
print(len(analects_1_1[4:15]))
11

If we want to count characters from the end of a string, instead of from the beginning, we can use a negative number for m (meaning “start from –m characters before the end of the string”) and either omit n entirely (meaning “up to the end of the string”) or use a negative number for n (meaning “up to –n characters before the end of the string”):

In [21]:
print(analects_1_1[-7:])
不亦君子乎?」
In [22]:
print(analects_1_1[-7:-1])
不亦君子乎?

There are many other functions for doing things with strings – we will see more of these in week 3. In the meantime, two useful functions are count(), which returns the number of times one string occurs within another string.

In [23]:
input_text = "道可道,非常道。"
print(input_text.count("道"))
3
In [24]:
print(input_text.count("道可"))
1

Another is replace(), which creates a new string in which all matching occurrence of a substring have been replaced by something else. The general form looks something like this:

string_to_search_in.replace(thing_to_search_for, thing_to_replace_with)

N.B. This function does not change the data in the original variable. It just returns new data with the substitution made.

In [25]:
input_text = "道可道,非常道。"
print(input_text.replace("道", "名"))
名可名,非常名。
In [26]:
print(input_text)  # Note: the input_text variable still contains the same data
道可道,非常道。

Lists

Lists are another kind of variable that work a lot like strings, except that whereas each location within a string is always exactly one character, each location in a list can be any kind of value, such as a number or a string.

We can make a list variable by separating each list element with commas and enclosing the whole lot in square brackets.

In [27]:
days_of_week = ["星期天", "星期一", "星期二", "星期三", "星期四", "星期五", "星期六"]
print(days_of_week)
['星期天', '星期一', '星期二', '星期三', '星期四', '星期五', '星期六']

In Python, the items stored in a list are numbered starting from 0. This means if we have 7 items in our list, they are numbered 0, 1, 2, 3, 4, 5, and 6.

In [28]:
print(days_of_week[0])
星期天
In [29]:
print(days_of_week[6])
星期六

If we try to use an item that isn’t in our list, we will get an error.

In [30]:
print(days_of_week[7])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-30-e6a905455149> in <module>()
----> 1 print(days_of_week[7])

IndexError: list index out of range

We can set the value of items in a list just like regular variables.

In [31]:
days_of_week[3] = "禮拜三"
print(days_of_week)
['星期天', '星期一', '星期二', '禮拜三', '星期四', '星期五', '星期六']

Often when we process lists in a program, we will need to find out how long the list is. This is because our programs will usually be designed to work with any input of a certain type (for example, to work with any text, not just one we’ve chosen in advance), and so we will only find out many items there are in a particular list when our program is actually run. The len() function tells us how many items are in our list.

In [32]:
print(len(days_of_week))
7

Remember that when we use len() on a string, it will give us the length of the string in numbers of characters. So for our days_of_week example:

In [33]:
print(len(days_of_week[0]))
print(len(days_of_week[1]))
# etc.
3
3

Make sure you understand why we get this answer here.

True and false

Boolean logic – i.e. logic in which things are either true or false – is central to most commonly used programming languages. Typically programs make decisions as to what to do next based on whether some particular expression (e.g. comparison of variables) is true or false. Some basic comparison operators are:

== equals
> greater than
< less than
>= greater than or equal to
<= less than or equal to

N.B. Assignment, e.g. a=1, uses a single “=”, whereas comparison, e.g. a==2, uses a double “==”.

In [34]:
print(5>2)
True
In [35]:
print(5>7)
False
In [36]:
number_of_pages = 5*12
print(number_of_pages == 60)
True
In [37]:
print(analects_1_1)
print(analects_1_3)
print(analects_1_1[0:2] == analects_1_3[0:2])   # If you're not sure why, try using print() on the two sides of the equation
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」
子曰:「巧言令色,鮮矣仁!」
True
In [38]:
print(analects_1_1[0:7] == analects_1_3[0:7])   # If you're not sure why, try using print() on the two sides of the equation
False

Often we want to know whether something occurs anywhere within a string. One way to do this is using the in operator. (We will look at more sophisticated types of searching for more complex patterns next week.)

In [39]:
text_to_search_in = "有子曰:「其為人也孝弟,而好犯上者,鮮矣;不好犯上,而好作亂者,未之有也。君子務本,本立而道生。孝弟也者,其為仁之本與!」"
print("孝弟" in text_to_search_in)
True
In [40]:
print("仁義" in text_to_search_in) 
False

Making decisions

Now that we can compare things, we can start to change what we do depending on the outcome of these comparisons.

The simplest type of decision is an “if … then …” decision: if something is true, then do something (otherwise don’t do it).

In [41]:
if(3>2):
    print("3 is greater than 2")
    
if(2>3):
    print("2 is greater than 3")
3 is greater than 2

In Python, indentation (one or more spaces from the left-hand margin) is used to mark blocks of code (sequences of instructions) which are to be followed together. For example:

In [42]:
if(3>2):
    print("3 is greater than 2")
    print("This is also executed if 3>2")
    
if(2>3):
    print("2 is greater than 3")

print("This is *always* executed, because it is outside of both 'if' blocks")
3 is greater than 2
This is also executed if 3>2
This is *always* executed, because it is outside of both 'if' blocks

The “else” keyword can be used after an “if” to do one thing if a condition is true, and some other thing if it is not true:

In [43]:
text1 = "Some text"
text2 = "Some other text"
if(len(text1) > len(text2)):
    print("text 1 is longer than text 2")
else:
    print("text 1 is not longer than text 2")  # It might be exactly the same length though - try changing the text in text1 and text2
text 1 is not longer than text 2

Also useful are logical operators “and“, “or“, and “not“. These allow us to make more complex decisions based on several factors.

Python expression Result
A and B True if A is True and B is also True – otherwise False
A or B True if A is True or B is True, or both are True – otherwise False
not A True if A is False – otherwise False

N.B. Matching pairs of brackets are used in complex expressions to remove ambiguity: the innermost brackets are always evaluated first. For instance:

(a and b) or c   # This will only be true when either: 1) a and b are both true; or 2) c is true
a and (b or c)   # This will only be true when *both*: 1) a is true; and 2) either b is true or c is true
a and b or c     # ???? Don't write this - it's not obvious which of the previous two lines it corresponds to

Suggestion: don’t write things like “a and b or c” without brackets, as these are confusing. (There are rules that mean they are not ambiguous, but instead of worrying about these now, always use brackets when mixing and, or, and not in an expression.)

Experiment with changing the values of the three variables below, making sure you understand why you get different results.

In [44]:
is_raining = False
will_rain_later = True
i_am_going_out = True

print(is_raining or will_rain_later)
print((is_raining or will_rain_later) and i_am_going_out)

if ((not is_raining) and will_rain_later) and i_am_going_out:
    print("Did you see the forecast?")
    
if (is_raining or will_rain_later) and i_am_going_out:
    print("Better take an umbrella!")
True
True
Did you see the forecast?
Better take an umbrella!

Repeating instructions

A lot of the power of digital methods comes from the fact that once we have a program that performs some task on an arbitrary object of some kind, we can easily have the computer perform that same task on large numbers of objects. One of the simplest ways of repeating instructions is the “for” loop. Just like with if, in Python we use indentation to indicate exactly which instructions we want repeated. For loops are often used with the range() function:

In [45]:
for some_variable in range(0,5):
    print(some_variable)
0
1
2
3
4

The range() function takes two parameters that determine what range of numbers we want: this first parameter determines the number we want to begin with, and the second determines the end of the range. Note: the “end” parameter to range() is not inclusive – so range(0,3) will give us a range with the numbers 0,1,2 but not 3.

As well as range(), we can also loop over each character in a string, or over each item in a list (i.e. do some processing for each character or item).

In [46]:
for day in days_of_week:
    print(day)
星期天
星期一
星期二
禮拜三
星期四
星期五
星期六

Working with characters in a string is just as easy – we can use the format “for character_variable in string:”

In [47]:
my_string = "道可道,非常道。"
for my_character in my_string:
    print(my_character)
道
可
道
,
非
常
道
。

However, sometimes we also need to know where (i.e. at what index) we are in a string.

In [48]:
for character_index in range(0,len(my_string)):
    print("Index " + str(character_index) + " in our string is: " + my_string[character_index])
Index 0 in our string is: 道
Index 1 in our string is: 可
Index 2 in our string is: 道
Index 3 in our string is: ,
Index 4 in our string is: 非
Index 5 in our string is: 常
Index 6 in our string is: 道
Index 7 in our string is: 。

Sometimes we will need to have one loop inside another loop. In this case, we use progressively larger indentations to indicate which instructions should be repeated in which loop.

In [49]:
for x in range(1,4):
    print("Starting sums with x = " + str(x))
    for y in range(10,14):
        z = x * y
        print(str(x) + "*" + str(y) + "=" + str(z))
    print("Finished sums with x = " + str(x))
print("Finished all the sums")
Starting sums with x = 1
1*10=10
1*11=11
1*12=12
1*13=13
Finished sums with x = 1
Starting sums with x = 2
2*10=20
2*11=22
2*12=24
2*13=26
Finished sums with x = 2
Starting sums with x = 3
3*10=30
3*11=33
3*12=36
3*13=39
Finished sums with x = 3
Finished all the sums

Getting Chinese texts

If you already have a copy of a text you’d like to process, you can easily read it from a text file into a string variable. However, as the formatting of each text may be different, the exact steps needed to process the text may differ slightly in each case. We will look at how to deal with this in detail next time when we look at regular expressions, since these provide powerful tools to quickly reorganize textual data.

An alternative way of getting textual data for historical texts is to use the ctext.org API, which lets us get the text for many historical texts in a consistent format. Texts are identified using a URN (Uniform Resource Name) – you can see this written at the bottom of the page when you view the corresponding text on ctext.org.

To make it easier to access these texts, we can install a specialized Python module to access the API. Before we can use it, we have to install it. Python makes this very easy to do, and it only needs to be done once; if you followed the instructions in the first tutorial, you should already have this module installed.

Once installed, we can use functions from this module to read textual data into Python variables. If we don’t care about the structure of a text, but only its contents, we can read the text into a single list of strings, each containing a single paragraph, like this:

In [50]:
from ctext import *  # This lets us get Chinese texts from http://ctext.org
setapikey("demo")    # This allows us access to the data used in these tutorials

passages = gettextasparagrapharray("ctp:analects")

print("Total number of passages: " + str(len(passages)))
print("First passage is: " + passages[0])
print("Second passage is: " + passages[1])
print("Last passage is: " + passages[-1])
Total number of passages: 503
First passage is: 子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」
Second passage is: 有子曰:「其為人也孝弟,而好犯上者,鮮矣;不好犯上,而好作亂者,未之有也。君子務本,本立而道生。孝弟也者,其為仁之本與!」
Last passage is: 子曰:「不知命,無以為君子也。不知禮,無以立也。不知言,無以知人也。」

Reading and writing files

To read or write to a file, we must first open it. When we open a file, we must specify both the name of the file, and whether we want to read data from it (“r”), or write data to it (“w”). The file.write() function works very much like print(), but writes its output directly to the file instead of to the screen. When writing to a file, however, we need to explicitly include return characters at the end of each line using “\n”. The file.read() function reads all the data from the file, which you can assign to a python variable.

N.B. Be careful when opening files! When you use “w” to open a file for writing, any file with that name in the same folder as your Python Notebook will immediately be replaced with a new, empty file.

In [51]:
file = open("week2_testfile.txt", "w", encoding="utf-8") # N.B. "w" here means we will open this file and write to it. If the file exists, it will immediately be deleted.
file.write("第一行\n第二行")
file.close()

Now take a look in Windows Explorer or Mac OS Finder and make sure you can see where this file is on your computer. It will be helpful for you to know in which folder Python expects files to be by default.

In [52]:
file = open("week2_testfile.txt", "r", encoding="utf-8") # "r" means we will open this file for reading, and won't be able to modify it
data_from_file = file.read()
file.close()
print(data_from_file)
第一行
第二行

Exercises

1.i) Write a program using a for loop to output all of the substrings of length 2 contained in the variable input_string. Your program should produce output like this:

天命
命之
之謂
謂性
性,
,率
率性
性之
...
...
道也
也。
In [60]:
input_string = "天命之謂性,率性之謂道,修道之謂教。道也者,不可須臾離也,可離非道也。"

# Your code goes here!

1.ii) Now modify your program so that you first define a variable called “substring_length” containing a number determining the length of substring to be listed. Your new program should still give the same output when run with “substring_length=2”, but should also work with substring_length set to 3, 4, etc. Remember, every line that your program outputs should have exactly substring_length characters in it (including punctuation characters).

1.iii) Modify your program again so that on each line it prints the total number of times that the substring occurs in input_string. For instance, each line beginning “之謂” should now read “之謂 3”, since “之謂” occurs three times in this string.

1.iv) Run your program again, but now use this slightly longer text instead:

input_string = "天命之謂性,率性之謂道,修道之謂教。道也者,不可須臾離也,可離非道也。是故君子戒慎乎其所不睹,恐懼乎其所不聞。莫見乎隱,莫顯乎微。故君子慎其獨也。喜怒哀樂之未發,謂之中;發而皆中節,謂之和;中也者,天下之大本也;和也者,天下之達道也。致中和,天地位焉,萬物育焉。"

Look at what the most frequent 2-grams are with this text. As before, some of them will include punctuation. Do any of these frequent 2-grams which include punctuation relate to facts about the language?

2.i) In the cell below, write a program to find and print out all passages in the Analects that include the term “仁”.

Note: if you’ve run the example program under Getting Chinese texts, the data is already stored in the passages variable.

In [54]:
# Your code goes here!

2.ii) Modify your program so it instead lists only passages that mention both terms “仁” and “義”.

2.iii) Modify your program so it instead lists only passages that mention either the term “愛人” or “知人” but not both.

3) Write another program (if you like, you can copy and paste your answer to the previous question and modify it) to determine which passage in the Analects mentions the term “禮” the greatest number of times.

Hint: Use one variable to track the greatest number of times “禮” has appeared, and another to track which passage it appeared in.

In [55]:
# Your code goes here!

4.i) Write a program in the cell below to store the full text of the Analects into a file on your computer called “analects.txt”. Put each paragraph on its own line, and in front of each paragraph put firstly the number of the paragraph, starting at 1, and secondly the length of the paragraph in characters. Separate each of these three pieces of data with a tab character. The beginning of your file should look like this:

1   41   子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?人不知而不慍,不亦君子乎?」
2   61   有子曰:「其為人也孝弟,而好犯上者,鮮矣;不好犯上,而好作亂者,未之有也。君子務本,本立而道生。孝弟也者,其為仁之本與!」
In [56]:
# Your code goes here!
  • Open the file in a text editor (e.g. Notepad for Windows, TextEdit for Mac OS X – usually double-clicking on the file you’ve created will do this) and check that the output looks correct.

4.ii) Modify your program so that the character counts only include Chinese characters, i.e. do not count punctuation characters.

4.iii) (Optional) If you have Excel or another spreadsheet program on your computer, try importing the file into it so that you get separate columns for the paragraph number, length, and content. (This may or may not be easy depending on your operating system and spreadsheet program. If you encounter encoding issues, try copying all of the data from the text editor straight into a blank spreadsheet instead.)

In [57]:
# Your code goes here!

5) [Harder] Which passage in the Analects contains a character repeated more frequently than any other character in any other passage – and what is the character?
It will help if you firstly think carefully about what you need to keep track of between passages in order to answer this question.

In [58]:
# Your code goes here!

Further reading

Bonus question

This section is optional as it includes several things we haven’t covered yet and will look at later on in when we look more closely at structured data.

The program below uses a dictionary variable to count all of the 1-grams in the Analects. [A dictionary variable is very similar to a list, except that its items are not numbered 0,1,2,… but instead indexed using arbitrary strings – for instance, my_dictionary[“論語”], which might contain a string value such as “analects”, or a number like 32. The term “dictionary” here is metaphorical, i.e. a dictionary variable often does not contain translations of words from one language to another – though this is one possible use case.]

It then uses the pandas library to select the top ten most frequent 1-grams, and the matplotlib library to draw a bar chart of this data.

Read through the code, and see if you can modify it to find the most frequent 2-grams, 3-grams, etc.

In [59]:
import numpy as np
import matplotlib.pyplot as plt

# The next line tells the matplotlib library to display its output in our Jupyter notebook
%matplotlib inline
from ctext import *
import pandas as pd
import matplotlib as mpl

# Unfortunately some software still has difficulty dealing with Chinese.
# Here we may need to tell matplotlib to use a specific font containing Chinese characters.
import platform
if platform.system() == 'Darwin':   # I.e. if we're running on Mac OS X
    mpl.rcParams['font.family'] = "STFangsong" 
else:
    mpl.rcParams['font.family'] = "SimHei"
    
mpl.rcParams['font.size'] = 20

chapterdata = gettextasparagrapharray("ctp:analects")

# Use a dictionary variable to keep track the count of each character we see
character_count = {}

# For each paragraph of the chapter data that we downloaded, do the following:
for paragraphnumber in range(0, len(chapterdata)):
    for char in range(0,len(chapterdata[paragraphnumber])):
        this_character = chapterdata[paragraphnumber][char:char+1]
        # Don't bother counting punctuation characters
        if this_character not in [",", "。", ":", ";", "「", "」", "?"]:
            if this_character in character_count:
                new_count = character_count[this_character] + 1
            else:
                new_count = 1
            character_count[this_character] = new_count

s = pd.Series(character_count)
s.sort_values(0, 0, inplace=True)

s[:10].plot(kind='barh')
print(s[:10])
子    973
曰    757
之    613
不    583
也    532
而    343
其    270
者    219
人    219
以    211
dtype: int64
Creative Commons License
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply