Classical Chinese DH: Getting started
=====
*By [Donald Sturgeon](http://dsturgeon.net/about)*

\[[View this notebook online](http://digitalsinology.org/classical-chinese-dh-getting-started)\] \[[Download this notebook](http://digitalsinology.org/notebooks/classical-chinese-dh-1.ipynb)\] \[[List of tutorials](http://digitalsinology.org/classical-chinese-digital-humanities/)\]

### Welcome to our first Jupyter Notebook!

A [notebook](http://jupyter-notebook-beginner-guide.readthedocs.org/en/latest/what_is_jupyter.html) is a [hypertext](https://en.wikipedia.org/wiki/Hypertext) document containing a mixture of textual content (like the part you're reading now) and computer programs - lists of instructions written in a programming language (in our case, the [Python](https://en.wikipedia.org/wiki/Python_%28programming_language%29) language) - as well as the output of these programs.

### Using the Jupyter environment

Before getting started with Python itself, it's important to get some basic familiarity with the user interface of the Jupyter environment. Jupyter is fairly intuitive to use, partly because it runs in a web browser and so works a lot like any web page. Basic principles:

* Each "notebook" displays as a single page. Notebooks are opened and saved using the menus and icons shown **within** the Jupyter window (i.e. the menus and icons under the Jupyter logo and icon, **not** the menus / icons belonging to your web browser).

* Notebooks are made up of "cells". Each cell is displayed on the page in a long list, one below another. You can see which parts of the notebook belong to which cell by clicking once on the text - when you do this, this will select the cell containing the text, and show its outline with a grey line.

* Usually a cell contains either text (like this one - in Jupyter this is called a "Markdown" cell), or Python code (like the one below this one).

* You can click on a program cell to edit it, and double-click on a text cell to edit it. Try double-clicking on this cell.

* When you start editing a text cell, the way it is displayed changes so that you can see (and edit) any formatting codes in it. To return the cell back to the "normal" prettified display, you need to "Run" it. You can run a cell by either:
 * choosing "Run" from the "Cell" menu above,
 * pressing shift-return when the cell is selected, or
 * clicking the "Run cell" icon.
* "Run" this cell so that it returns to the original mode of display.


In [None]:
for number in range(1,13):
    print(str(number) + "*" + str(number) + " = " + str(number*number))

The program in a cell doesn't do anything until you ask Jupyter to run (a.k.a. "execute") it - in other words, ask the system to start following the instructions in the program. You can execute a cell by clicking somewhere in it so it's selected, then choosing "Run" from the "Cell" menu (or by pressing shift-return).

When you run a cell containing a Python program, any output that the program generates is displayed directly below that cell. If you modify the program, you'll need to run it again before you will see the modified result.

A lot of the power of Python and Jupyter comes from the ability to easily make use of modules written by other people. Modules are included using lines like "from ... import \*".
A module needs to be installed on your computer before you can use it; many of the most commonly used ones are installed as part of Anaconda.

"Comments" provide a way of explaining to human readers what parts of a program are supposed to do (but are completely ignored by Python itself). Typing the symbol # begins a comment, which continues until the end of the line.

**N.B.** You must install the "ctext" module before running the code below. If you get the error "ImportError: No module named 'ctext'" when you try to run the code, [refer to the instructions](http://digitalsinology.org/classical-chinese-dh-getting-started/) for how to install the ctext module.

In [None]:
from ctext import *  # This module gives us direct access to data from ctext.org
setapikey("demo")    # This allows us access to the data used in these tutorials

paragraphs = gettextasparagrapharray("ctp:analects/xue-er")

print("This chapter is made up of " + str(len(paragraphs)) + " paragraphs. These are:")

# For each paragraph of the chapter data that we downloaded, do the following:
for paragraphnumber in range(0, len(paragraphs)):
    print(str(paragraphnumber+1) + ". " + paragraphs[paragraphnumber])

'Variables' are named entities that contain some kind of data that can be changed at a later date. We will look at these in much more detail over the next few weeks. For now, you can think of them as named boxes which can contain any kind of data.

Once we have data stored in a variable (like the 'paragraphs' variable above), we can start processing it in whatever way we want. Often we use other variables to track our progress, like the 'longest_paragraph' and 'longest_length' variables in the program below.

In [None]:
longest_paragraph = None # We use this variable to record which of the paragraphs we've looked at is longest
longest_length = 0       # We use this one to record how long the longest paragraph we've found so far is

for paragraph_number in range(0, len(paragraphs)):
    paragraph_text = paragraphs[paragraph_number];
    if len(paragraph_text)>longest_length:
        longest_paragraph = paragraph_number
        longest_length = len(paragraph_text)

print("The longest paragraph is paragraph number " + str(longest_paragraph+1) + ", which is " + str(longest_length) + " characters long.")

Modules allow us to do powerful things like Principle Component Analysis (PCA) and machine learning without having to write any code to perform any of the complex mathematics which lies behind these techniques. They also let us easily plot numerical results within the Jupyter notebook environment.

For example, the following code (which we will go through in much more detail in a future tutorial - don't worry about the contents of it yet) plots the frequencies of the two characters "矣" and "也" in chapters of the Analects versus chapters of the Fengshen Yanyi. (Note: this may take a few seconds to download the data.)

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

def makevector(string, termlist, normalize = False):
    vector = []
    for term in termlist:
        termcount = len(re.findall(term, string))
        if normalize:
            vector.append(termcount/len(string))
        else:
            vector.append(termcount)
    return vector

text1 = gettextaschapterlist("ctp:fengshen-yanyi")
text2 = gettextaschapterlist("ctp:analects")

vectors1 = []
for chapter in text1:
    vectors1.append(makevector(chapter, ["矣", "也"], True))

vectors2 = []
for chapter in text2:
    vectors2.append(makevector(chapter, ["矣", "也"], True))

df1 = pd.DataFrame(vectors1)
df2 = pd.DataFrame(vectors2)

legend1 = plt.scatter(df1.iloc[:,0], df1.iloc[:,1], color="blue", label="Fengshen Yanyi")
legend2 = plt.scatter(df2.iloc[:,0], df2.iloc[:,1], color="red", label="Analects")
plt.legend(handles = [legend1, legend2])
plt.xlabel("Frequency of 'yi'")
plt.ylabel("Frequency of 'ye'")

You can save changes to your notebook using "File" -> "Save and checkpoint". Note that Jupyter often saves your changes for you automatically, so if you *don't* want to save your changes, you might want to make a copy of your notebook first using "File" -> "Make a Copy".

You should try to avoid having the same notebook open in two different browser windows or browser tabs at the same time. (If you do this, both pages may try to save changes to the same file, overwriting each other's work.)

Exercises
----
Before we start writing programs, we need to get familiar with the Jupyter Notebook programming environment. Check that you can complete the following tasks:

* Run each of the program cells in this notebook that are above this cell on your computer, checking that each of the short programs produces the expected output.
* Clear all of the output using "Cell" -> "All output" -> "Clear", then run one or two of them again.
* In Jupyter, each cell in a notebook can be run independently. Sometimes the _order_ in which cells are run is important. Try running the following three cells in order, then see what happens when you run them in a different order. Make sure you understand why in some cases you get different results.

In [None]:
number_of_things = 1

In [None]:
print(number_of_things)

In [None]:
number_of_things = number_of_things + 1
print(number_of_things)

* Some of the programs in this notebook are very simple. Modify and re-run them to perform the following tasks:
 * Print out the squares of the numbers 3 through 20 (instead of 1 through 12)
 * Print out the cubes of the numbers 3 through 20 (i.e. 3 x 3 x 3 = 27, 4 x 4 x 4 = 64, etc.)
 * Instead of printing passages from the first chapter of the Analects, print passages from the Daodejing, and determine the longest passage in it. The URN for the Daodejing is: ctp:dao-de-jing
 
* Often when programming you'll encounter error messages. The following line contains a bug; try running it, and look at the output. Work out which part of the error message is most relevant, and see if you can find an explanation on the web (e.g. on StackOverflow) and fix the mistake.

In [None]:
print("The answer to life the universe and everything is: " + 42)  # This statement is incorrect and isn't going to work

* Sometimes a program will take a long time to run - or even run forever - and you'll need to stop it. Watch what happens to the circle beside the text "Python 3" at the top-right of the screen when you run the cell below.
* While the cell below is running, try running the cell above. You won't see any output until the cell below has finished running.
* Run the cell below again. While it's running, interrupt its execution by clicking "Kernel" -> "Interrupt".

In [None]:
import time

for number in range(1,21):
    print(number)
    time.sleep(1)


* The cell below has been set as a "Markdown" cell, making it a text cell instead of a program ("code") cell. Work out how to make the cell run as a program.

for number in range(1,11):
    print("1/" + str(number) + " = " + str(1/number))  # In many programming languages, the symbol "/" means "divided by"

* Experiment with creating new cells below this one. Make some text cells, type something in them, and run them. Copy and paste some code from above into code cells, and run them too. Try playing around with simple modifications to the code.
* (Optional) You can make your text cells look nicer by including formatting instructions in them. The way of doing this is called "Markdown" - there are many [good introductions](https://athena.brynmawr.edu/jupyter/hub/dblank/public/Jupyter%20Notebook%20Users%20Manual.ipynb#4.-Using-Markdown-Cells-for-Writing) available online. 
* Lastly, save your modified notebook and close your web browser. Shut down the Python server process, then start it again, and reload your modified notebook. Make sure you can also find the saved notebook file in your computer's file manager (e.g. "Windows Explorer"/"File Explorer" on Windows, or "Finder" on Mac OS X).

**Further reading:**
* [Jupyter Notebook Users Manual](https://athena.brynmawr.edu/jupyter/hub/dblank/public/Jupyter%20Notebook%20Users%20Manual.ipynb), Bryn Mawr College Computer Science - _This provides a thorough introduction to Jupyter features. This guide introduces many more features than we will need to use, but is a great reference._

<div style="float: right;"><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a></div>