This is the first in a series of online tutorials introducing basic digital humanities techniques using the Python programming language and the Chinese Text Project API. These tutorials are based in part on material covered in the course CHNSHIS 202: Digital Methods for Chinese Studies, which I teach at Harvard University’s Department of East Asian Languages and Civilizations.
Intended audience: People with some knowledge of Chinese literature and an interest in digital humanities; no programming experience necessary.
Format: Most of these tutorials will consist of a Jupyter Notebook file. These files contain a mixture of explanations and code that can be modified and run from within your web browser. This makes it very easy to modify, play with, and extend all of the example code. You can also read the tutorials online first (you’ll need to download the files in order to run the code and do the exercises though).
Getting started
To use this series of tutorials, you need to first complete the following steps:
- Install Python (programming language) and Jupyter (web browser based interface to Python). The recommended way to do this is by installing the Anaconda distribution, which will automatically install Python, Jupyter, and many other things we need. For these tutorials, you should install the Python 3.x version of Anaconda (not the 2.7 version).
- Install the ctext module. To do this, after installing Anaconda, open Command Prompt (Windows) or Terminal (Mac OS X), and then type:
pip install ctext [return]
- Create a folder to contain your Python projects. To follow a tutorial, first download the .ipynb Jupyter Notebook file and save it into this folder.
- Start up the Jupyter environment. One way to do this is opening the Command Prompt (Windows) or Terminal (Mac OS X), and then typing:
jupyter notebook [return]
- When you start Jupyter, it should open your web browser and take you to the page http://localhost:8888/tree. This is a web page, but instead of being located somewhere on the internet, it is located on your own computer. The page should show a list of files and folders on your own computer; using this list, navigate to the folder containing the downloaded .ipynb file, and click on the file to open it in your web browser. You can now use the full interactive version of the notebook.
- The Jupyter system works by having a server program which runs in the background (if you start Jupyter as described above, you can see it running in the Terminal / Command Prompt window), which is then accessed using a web browser. This means that when you close your web browser, Jupyter is still running until you stop the server process. You can stop the server process by opening the Terminal / Command Prompt window and pressing Control-C twice (i.e. holding down the “Control” key and pressing the C key twice).
Below is the Jupyter notebook for this tutorial. Since the first tutorial focuses on how to use the Jupyter environment, you should download and open this notebook in Jupyter rather than trying to follow it online.
Classical Chinese DH: Getting started¶
[View this notebook online] [Download this notebook] [List of tutorials]
Welcome to our first Jupyter Notebook!¶
A notebook is a hypertext document containing a mixture of textual content (like the part you’re reading now) and computer programs – lists of instructions written in a programming language (in our case, the Python language) – as well as the output of these programs.
Using the Jupyter environment¶
Before getting started with Python itself, it’s important to get some basic familiarity with the user interface of the Jupyter environment. Jupyter is fairly intuitive to use, partly because it runs in a web browser and so works a lot like any web page. Basic principles:
-
Each “notebook” displays as a single page. Notebooks are opened and saved using the menus and icons shown within the Jupyter window (i.e. the menus and icons under the Jupyter logo and icon, not the menus / icons belonging to your web browser).
-
Notebooks are made up of “cells”. Each cell is displayed on the page in a long list, one below another. You can see which parts of the notebook belong to which cell by clicking once on the text – when you do this, this will select the cell containing the text, and show its outline with a grey line.
-
Usually a cell contains either text (like this one – in Jupyter this is called a “Markdown” cell), or Python code (like the one below this one).
-
You can click on a program cell to edit it, and double-click on a text cell to edit it. Try double-clicking on this cell.
-
When you start editing a text cell, the way it is displayed changes so that you can see (and edit) any formatting codes in it. To return the cell back to the “normal” prettified display, you need to “Run” it. You can run a cell by either:
- choosing “Run” from the “Cell” menu above,
- pressing shift-return when the cell is selected, or
- clicking the “Run cell” icon.
- “Run” this cell so that it returns to the original mode of display.
for number in range(1,13):
print(str(number) + "*" + str(number) + " = " + str(number*number))
The program in a cell doesn’t do anything until you ask Jupyter to run (a.k.a. “execute”) it – in other words, ask the system to start following the instructions in the program. You can execute a cell by clicking somewhere in it so it’s selected, then choosing “Run” from the “Cell” menu (or by pressing shift-return).
When you run a cell containing a Python program, any output that the program generates is displayed directly below that cell. If you modify the program, you’ll need to run it again before you will see the modified result.
A lot of the power of Python and Jupyter comes from the ability to easily make use of modules written by other people. Modules are included using lines like “from … import *”.
A module needs to be installed on your computer before you can use it; many of the most commonly used ones are installed as part of Anaconda.
“Comments” provide a way of explaining to human readers what parts of a program are supposed to do (but are completely ignored by Python itself). Typing the symbol # begins a comment, which continues until the end of the line.
N.B. You must install the “ctext” module before running the code below. If you get the error “ImportError: No module named ‘ctext'” when you try to run the code, refer to the instructions for how to install the ctext module.
from ctext import * # This module gives us direct access to data from ctext.org
setapikey("demo") # This allows us access to the data used in these tutorials
paragraphs = gettextasparagrapharray("ctp:analects/xue-er")
print("This chapter is made up of " + str(len(paragraphs)) + " paragraphs. These are:")
# For each paragraph of the chapter data that we downloaded, do the following:
for paragraphnumber in range(0, len(paragraphs)):
print(str(paragraphnumber+1) + ". " + paragraphs[paragraphnumber])
‘Variables’ are named entities that contain some kind of data that can be changed at a later date. We will look at these in much more detail over the next few weeks. For now, you can think of them as named boxes which can contain any kind of data.
Once we have data stored in a variable (like the ‘paragraphs’ variable above), we can start processing it in whatever way we want. Often we use other variables to track our progress, like the ‘longest_paragraph’ and ‘longest_length’ variables in the program below.
longest_paragraph = None # We use this variable to record which of the paragraphs we've looked at is longest
longest_length = 0 # We use this one to record how long the longest paragraph we've found so far is
for paragraph_number in range(0, len(paragraphs)):
paragraph_text = paragraphs[paragraph_number];
if len(paragraph_text)>longest_length:
longest_paragraph = paragraph_number
longest_length = len(paragraph_text)
print("The longest paragraph is paragraph number " + str(longest_paragraph+1) + ", which is " + str(longest_length) + " characters long.")
Modules allow us to do powerful things like Principle Component Analysis (PCA) and machine learning without having to write any code to perform any of the complex mathematics which lies behind these techniques. They also let us easily plot numerical results within the Jupyter notebook environment.
For example, the following code (which we will go through in much more detail in a future tutorial – don’t worry about the contents of it yet) plots the frequencies of the two characters “矣” and “也” in chapters of the Analects versus chapters of the Fengshen Yanyi. (Note: this may take a few seconds to download the data.)
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
def makevector(string, termlist, normalize = False):
vector = []
for term in termlist:
termcount = len(re.findall(term, string))
if normalize:
vector.append(termcount/len(string))
else:
vector.append(termcount)
return vector
text1 = gettextaschapterlist("ctp:fengshen-yanyi")
text2 = gettextaschapterlist("ctp:analects")
vectors1 = []
for chapter in text1:
vectors1.append(makevector(chapter, ["矣", "也"], True))
vectors2 = []
for chapter in text2:
vectors2.append(makevector(chapter, ["矣", "也"], True))
df1 = pd.DataFrame(vectors1)
df2 = pd.DataFrame(vectors2)
legend1 = plt.scatter(df1.iloc[:,0], df1.iloc[:,1], color="blue", label="Fengshen Yanyi")
legend2 = plt.scatter(df2.iloc[:,0], df2.iloc[:,1], color="red", label="Analects")
plt.legend(handles = [legend1, legend2])
plt.xlabel("Frequency of 'yi'")
plt.ylabel("Frequency of 'ye'")
You can save changes to your notebook using “File” -> “Save and checkpoint”. Note that Jupyter often saves your changes for you automatically, so if you don’t want to save your changes, you might want to make a copy of your notebook first using “File” -> “Make a Copy”.
You should try to avoid having the same notebook open in two different browser windows or browser tabs at the same time. (If you do this, both pages may try to save changes to the same file, overwriting each other’s work.)
Exercises¶
Before we start writing programs, we need to get familiar with the Jupyter Notebook programming environment. Check that you can complete the following tasks:
- Run each of the program cells in this notebook that are above this cell on your computer, checking that each of the short programs produces the expected output.
- Clear all of the output using “Cell” -> “All output” -> “Clear”, then run one or two of them again.
- In Jupyter, each cell in a notebook can be run independently. Sometimes the order in which cells are run is important. Try running the following three cells in order, then see what happens when you run them in a different order. Make sure you understand why in some cases you get different results.
number_of_things = 1
print(number_of_things)
number_of_things = number_of_things + 1
print(number_of_things)
-
Some of the programs in this notebook are very simple. Modify and re-run them to perform the following tasks:
- Print out the squares of the numbers 3 through 20 (instead of 1 through 12)
- Print out the cubes of the numbers 3 through 20 (i.e. 3 x 3 x 3 = 27, 4 x 4 x 4 = 64, etc.)
- Instead of printing passages from the first chapter of the Analects, print passages from the Daodejing, and determine the longest passage in it. The URN for the Daodejing is: ctp:dao-de-jing
-
Often when programming you’ll encounter error messages. The following line contains a bug; try running it, and look at the output. Work out which part of the error message is most relevant, and see if you can find an explanation on the web (e.g. on StackOverflow) and fix the mistake.
print("The answer to life the universe and everything is: " + 42) # This statement is incorrect and isn't going to work
- Sometimes a program will take a long time to run – or even run forever – and you’ll need to stop it. Watch what happens to the circle beside the text “Python 3” at the top-right of the screen when you run the cell below.
- While the cell below is running, try running the cell above. You won’t see any output until the cell below has finished running.
- Run the cell below again. While it’s running, interrupt its execution by clicking “Kernel” -> “Interrupt”.
import time
for number in range(1,21):
print(number)
time.sleep(1)
- The cell below has been set as a “Markdown” cell, making it a text cell instead of a program (“code”) cell. Work out how to make the cell run as a program.
for number in range(1,11):
print(“1/” + str(number) + ” = ” + str(1/number)) # In many programming languages, the symbol “/” means “divided by”
- Experiment with creating new cells below this one. Make some text cells, type something in them, and run them. Copy and paste some code from above into code cells, and run them too. Try playing around with simple modifications to the code.
- (Optional) You can make your text cells look nicer by including formatting instructions in them. The way of doing this is called “Markdown” – there are many good introductions available online.
- Lastly, save your modified notebook and close your web browser. Shut down the Python server process, then start it again, and reload your modified notebook. Make sure you can also find the saved notebook file in your computer’s file manager (e.g. “Windows Explorer”/”File Explorer” on Windows, or “Finder” on Mac OS X).
Further reading:
- Jupyter Notebook Users Manual, Bryn Mawr College Computer Science – This provides a thorough introduction to Jupyter features. This guide introduces many more features than we will need to use, but is a great reference.