This is the first in a series of online tutorials introducing basic digital humanities techniques using the Python programming language and the Chinese Text Project API. These tutorials are based in part on material covered in the course CHNSHIS 202: Digital Methods for Chinese Studies, which I teach at Harvard University’s Department of East Asian Languages and Civilizations.
Intended audience: People with some knowledge of Chinese literature and an interest in digital humanities; no programming experience necessary.
Format: Most of these tutorials will consist of a Jupyter Notebook file. These files contain a mixture of explanations and code that can be modified and run from within your web browser. This makes it very easy to modify, play with, and extend all of the example code. You can also read the tutorials online first (you’ll need to download the files in order to run the code and do the exercises though).
To use this series of tutorials, you need to first complete the following steps:
- Install Python (programming language) and Jupyter (web browser based interface to Python). The recommended way to do this is by installing the Anaconda distribution, which will automatically install Python, Jupyter, and many other things we need. For these tutorials, you should install the Python 3.x version of Anaconda (not the 2.7 version).
- Install the ctext module. To do this, after installing Anaconda, open Command Prompt (Windows) or Terminal (Mac OS X), and then type:
pip install ctext [return]
- Create a folder to contain your Python projects. To follow a tutorial, first download the .ipynb Jupyter Notebook file and save it into this folder.
- Start up the Jupyter environment. One way to do this is opening the Command Prompt (Windows) or Terminal (Mac OS X), and then typing:
jupyter notebook [return]
- When you start Jupyter, it should open your web browser and take you to the page http://localhost:8888/tree. This is a web page, but instead of being located somewhere on the internet, it is located on your own computer. The page should show a list of files and folders on your own computer; using this list, navigate to the folder containing the downloaded .ipynb file, and click on the file to open it in your web browser. You can now use the full interactive version of the notebook.
- The Jupyter system works by having a server program which runs in the background (if you start Jupyter as described above, you can see it running in the Terminal / Command Prompt window), which is then accessed using a web browser. This means that when you close your web browser, Jupyter is still running until you stop the server process. You can stop the server process by opening the Terminal / Command Prompt window and pressing Control-C twice (i.e. holding down the “Control” key and pressing the C key twice).
Below is the Jupyter notebook for this tutorial. Since the first tutorial focuses on how to use the Jupyter environment, you should download and open this notebook in Jupyter rather than trying to follow it online.
Welcome to our first Jupyter Notebook!¶
A notebook is a hypertext document containing a mixture of textual content (like the part you’re reading now) and computer programs – lists of instructions written in a programming language (in our case, the Python language) – as well as the output of these programs.
Using the Jupyter environment¶
Before getting started with Python itself, it’s important to get some basic familiarity with the user interface of the Jupyter environment. Jupyter is fairly intuitive to use, partly because it runs in a web browser and so works a lot like any web page. Basic principles:
Each “notebook” displays as a single page. Notebooks are opened and saved using the menus and icons shown within the Jupyter window (i.e. the menus and icons under the Jupyter logo and icon, not the menus / icons belonging to your web browser).
Notebooks are made up of “cells”. Each cell is displayed on the page in a long list, one below another. You can see which parts of the notebook belong to which cell by clicking once on the text – when you do this, this will select the cell containing the text, and show its outline with a grey line.
Usually a cell contains either text (like this one – in Jupyter this is called a “Markdown” cell), or Python code (like the one below this one).
You can click on a program cell to edit it, and double-click on a text cell to edit it. Try double-clicking on this cell.
When you start editing a text cell, the way it is displayed changes so that you can see (and edit) any formatting codes in it. To return the cell back to the “normal” prettified display, you need to “Run” it. You can run a cell by either:
- choosing “Run” from the “Cell” menu above,
- pressing shift-return when the cell is selected, or
- clicking the “Run cell” icon.
- “Run” this cell so that it returns to the original mode of display.
for number in range(1,13): print(str(number) + "*" + str(number) + " = " + str(number*number))
1*1 = 1 2*2 = 4 3*3 = 9 4*4 = 16 5*5 = 25 6*6 = 36 7*7 = 49 8*8 = 64 9*9 = 81 10*10 = 100 11*11 = 121 12*12 = 144
The program in a cell doesn’t do anything until you ask Jupyter to run (a.k.a. “execute”) it – in other words, ask the system to start following the instructions in the program. You can execute a cell by clicking somewhere in it so it’s selected, then choosing “Run” from the “Cell” menu (or by pressing shift-return).
When you run a cell containing a Python program, any output that the program generates is displayed directly below that cell. If you modify the program, you’ll need to run it again before you will see the modified result.
A lot of the power of Python and Jupyter comes from the ability to easily make use of modules written by other people. Modules are included using lines like “from … import *”.
A module needs to be installed on your computer before you can use it; many of the most commonly used ones are installed as part of Anaconda.
“Comments” provide a way of explaining to human readers what parts of a program are supposed to do (but are completely ignored by Python itself). Typing the symbol # begins a comment, which continues until the end of the line.
N.B. You must install the “ctext” module before running the code below. If you get the error “ImportError: No module named ‘ctext’” when you try to run the code, refer to the instructions for how to install the ctext module.
from ctext import * # This module gives us direct access to data from ctext.org paragraphs = gettextasparagrapharray("ctp:analects/xue-er") print("This chapter is made up of " + str(len(paragraphs)) + " paragraphs. These are:") # For each paragraph of the chapter data that we downloaded, do the following: for paragraphnumber in range(0, len(paragraphs)): print(str(paragraphnumber+1) + ". " + paragraphs[paragraphnumber])
This chapter is made up of 16 paragraphs. These are: 1. 子曰：「學而時習之，不亦說乎？有朋自遠方來，不亦樂乎？人不知而不慍，不亦君子乎？」 2. 有子曰：「其為人也孝弟，而好犯上者，鮮矣；不好犯上，而好作亂者，未之有也。君子務本，本立而道生。孝弟也者，其為仁之本與！」 3. 子曰：「巧言令色，鮮矣仁！」 4. 曾子曰：「吾日三省吾身：為人謀而不忠乎？與朋友交而不信乎？傳不習乎？」 5. 子曰：「道千乘之國：敬事而信，節用而愛人，使民以時。」 6. 子曰：「弟子入則孝，出則弟，謹而信，汎愛眾，而親仁。行有餘力，則以學文。」 7. 子夏曰：「賢賢易色，事父母能竭其力，事君能致其身，與朋友交言而有信。雖曰未學，吾必謂之學矣。」 8. 子曰：「君子不重則不威，學則不固。主忠信，無友不如己者，過則勿憚改。」 9. 曾子曰：「慎終追遠，民德歸厚矣。」 10. 子禽問於子貢曰：「夫子至於是邦也，必聞其政，求之與？抑與之與？」子貢曰：「夫子溫、良、恭、儉、讓以得之。夫子之求之也，其諸異乎人之求之與？」 11. 子曰：「父在，觀其志；父沒，觀其行；三年無改於父之道，可謂孝矣。」 12. 有子曰：「禮之用，和為貴。先王之道斯為美，小大由之。有所不行，知和而和，不以禮節之，亦不可行也。」 13. 有子曰：「信近於義，言可復也；恭近於禮，遠恥辱也；因不失其親，亦可宗也。」 14. 子曰：「君子食無求飽，居無求安，敏於事而慎於言，就有道而正焉，可謂好學也已。」 15. 子貢曰：「貧而無諂，富而無驕，何如？」子曰：「可也。未若貧而樂，富而好禮者也。」子貢曰：「《詩》云：『如切如磋，如琢如磨。』其斯之謂與？」子曰：「賜也，始可與言詩已矣！告諸往而知來者。」 16. 子曰：「不患人之不己知，患不知人也。」
‘Variables’ are named entities that contain some kind of data that can be changed at a later date. We will look at these in much more detail over the next few weeks. For now, you can think of them as named boxes which can contain any kind of data.
Once we have data stored in a variable (like the ‘paragraphs’ variable above), we can start processing it in whatever way we want. Often we use other variables to track our progress, like the ‘longest_paragraph’ and ‘longest_length’ variables in the program below.
longest_paragraph = None # We use this variable to record which of the paragraphs we've looked at is longest longest_length = 0 # We use this one to record how long the longest paragraph we've found so far is for paragraph_number in range(0, len(paragraphs)): paragraph_text = paragraphs[paragraph_number]; if len(paragraph_text)>longest_length: longest_paragraph = paragraph_number longest_length = len(paragraph_text) print("The longest paragraph is paragraph number " + str(longest_paragraph+1) + ", which is " + str(longest_length) + " characters long.")
The longest paragraph is paragraph number 15, which is 93 characters long.
Modules allow us to do powerful things like Principle Component Analysis (PCA) and machine learning without having to write any code to perform any of the complex mathematics which lies behind these techniques. They also let us easily plot numerical results within the Jupyter notebook environment.
For example, the following code (which we will go through in much more detail in a future tutorial – don’t worry about the contents of it yet) plots the frequencies of the two characters “矣” and “也” in chapters of the Analects versus chapters of the Fengshen Yanyi. (Note: this may take a few seconds to download the data.)
import re import pandas as pd import matplotlib.pyplot as plt %matplotlib inline def makevector(string, termlist, normalize = False): vector =  for term in termlist: termcount = len(re.findall(term, string)) if normalize: vector.append(termcount/len(string)) else: vector.append(termcount) return vector text1 = gettextaschapterlist("ctp:fengshen-yanyi") text2 = gettextaschapterlist("ctp:analects") vectors1 =  for chapter in text1: vectors1.append(makevector(chapter, ["矣", "也"], True)) vectors2 =  for chapter in text2: vectors2.append(makevector(chapter, ["矣", "也"], True)) df1 = pd.DataFrame(vectors1) df2 = pd.DataFrame(vectors2) legend1 = plt.scatter(df1.iloc[:,0], df1.iloc[:,1], color="blue", label="Fengshen Yanyi") legend2 = plt.scatter(df2.iloc[:,0], df2.iloc[:,1], color="red", label="Analects") plt.legend(handles = [legend1, legend2]) plt.xlabel("Frequency of 'yi'") plt.ylabel("Frequency of 'ye'")
<matplotlib.text.Text at 0x10e4dc940>
You can save changes to your notebook using “File” -> “Save and checkpoint”. Note that Jupyter often saves your changes for you automatically, so if you don’t want to save your changes, you might want to make a copy of your notebook first using “File” -> “Make a Copy”.
You should try to avoid having the same notebook open in two different browser windows or browser tabs at the same time. (If you do this, both pages may try to save changes to the same file, overwriting each other’s work.)
Before we start writing programs, we need to get familiar with the Jupyter Notebook programming environment. Check that you can complete the following tasks:
- Run each of the program cells in this notebook that are above this cell on your computer, checking that each of the short programs produces the expected output.
- Clear all of the output using “Cell” -> “All output” -> “Clear”, then run one or two of them again.
- In Jupyter, each cell in a notebook can be run independently. Sometimes the order in which cells are run is important. Try running the following three cells in order, then see what happens when you run them in a different order. Make sure you understand why in some cases you get different results.
number_of_things = 1
number_of_things = number_of_things + 1 print(number_of_things)
Some of the programs in this notebook are very simple. Modify and re-run them to perform the following tasks:
- Print out the squares of the numbers 3 through 20 (instead of 1 through 12)
- Print out the cubes of the numbers 3 through 20 (i.e. 3 x 3 x 3 = 27, 4 x 4 x 4 = 64, etc.)
- Instead of printing passages from the first chapter of the Analects, print passages from the Daodejing, and determine the longest passage in it. The URN for the Daodejing is: ctp:dao-de-jing
Often when programming you’ll encounter error messages. The following line contains a bug; try running it, and look at the output. Work out which part of the error message is most relevant, and see if you can find an explanation on the web (e.g. on StackOverflow) and fix the mistake.
print("The answer to life the universe and everything is: " + 42) # This statement is incorrect and isn't going to work
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-9-f35205b23c00> in <module>() ----> 1 print("The answer to life the universe and everything is: " + 42) # This statement is incorrect and isn't going to work TypeError: Can't convert 'int' object to str implicitly
- Sometimes a program will take a long time to run – or even run forever – and you’ll need to stop it. Watch what happens to the circle beside the text “Python 3″ at the top-right of the screen when you run the cell below.
- While the cell below is running, try running the cell above. You won’t see any output until the cell below has finished running.
- Run the cell below again. While it’s running, interrupt its execution by clicking “Kernel” -> “Interrupt”.
import time for number in range(1,21): print(number) time.sleep(1)
1 2 3 4 5 6
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-10-c01c67722f36> in <module>() 3 for number in range(1,21): 4 print(number) ----> 5 time.sleep(1) KeyboardInterrupt:
- The cell below has been set as a “Markdown” cell, making it a text cell instead of a program (“code”) cell. Work out how to make the cell run as a program.
for number in range(1,11):
print(“1/” + str(number) + ” = ” + str(1/number)) # In many programming languages, the symbol “/” means “divided by”
- Experiment with creating new cells below this one. Make some text cells, type something in them, and run them. Copy and paste some code from above into code cells, and run them too. Try playing around with simple modifications to the code.
- (Optional) You can make your text cells look nicer by including formatting instructions in them. The way of doing this is called “Markdown” – there are many good introductions available online.
- Lastly, save your modified notebook and close your web browser. Shut down the Python server process, then start it again, and reload your modified notebook. Make sure you can also find the saved notebook file in your computer’s file manager (e.g. “Windows Explorer”/”File Explorer” on Windows, or “Finder” on Mac OS X).
- Jupyter Notebook Users Manual, Bryn Mawr College Computer Science – This provides a thorough introduction to Jupyter features. This guide introduces many more features than we will need to use, but is a great reference.