China Biographical Database

The China Biographical Database (CBDB) is a relational database on Chinese historical figures from the 7th to 19th centuries. The database provides biographical information (name, date of birth and death, ancestral place, degrees and offices held, kinship and social associations, etc.) of approximately 417,000 individuals up till August 2017.

History
CBDB has its origins in the work of the late social historian Robert M. Hartwell (1932-1996). Hartwell first conceived of using a relational database to study the social and family networks of Song Dynasty officials. Aware of the lack of large dataset for research on the social history of middle period China, he took the first step to collect data himself and generate meaningful answers to historical changes through data analysis. Hartwell structured his data around persons, places, the bureaucratic system, kinship and social associations.

Professor Hartwell bequeathed the program, which by then consisted of more than 25,000 individuals, a bibliographic database of over 4500 titles, and his work on an historical GIS to the Harvard Yenching Institute which later gave up its interest. Professor Peter K. Bol at Harvard organized the effort to make Hartwell's publicly available beginning in 2005. Michael A. Fuller, Professor of Chinese Literature at UC Irvine, started to redesign the application, Professor Deng Xiaonan of Peking University led graduate students at the Center for Research on Ancient Chinese History (北京大學中國古代史研究中心) in revising the contents of the database and Professor Lau Nap-yin of the Institute of History and Philology at Academia Sinica (中研院歷史語言研究所) arranged to make digital reference works available for the project. Thanks to the efforts of many the database has been greatly expanded in temporal and coverage scope. CBDB is now owned and administered by the Fairbank Center for Chinese Studies at Harvard, the Institute of History and Philology, and Center for Research on Ancient Chinese History. More information about its history, funders and contributors can be found at the CBDB website.

Limitations and Strengths
CBDB extracts data from extant sources using computational data mining techniques. By preference it uses sources that can be mined systematically because the sources are structured systematically. This means that it does not undertake in-depth research on individuals, although it is possible for qualified researchers to add data to CBDB based on their own research. The aim is to accurately extract and code the data as given in the sources rather than to check the accuracy of the sources. Thus factual errors in a source and contradictory information from different sources may well be be included in the entries; CBDB does not judge one source above another although it does differentiate between a primary biographical source and biographical mentioned in passing in another source. It follows that CBDB at best represents what has survived over time, which is ever less the further into the past we proceed. Currently CBDB persons are for the most part from the seventh through the early twentieth century (from the Tang through the Qing dynasty). It is a sampling of the past. For example, grave biographies (epitaphs 墓誌銘) are an important source for kinship associations, but only a few tens of thousands have survived. Similarly, only a portion of literary collections have survived, although these have yet to be mined systematically. Because of the nature of the sources, career data will be biased toward officials.

Although CBDB can be used for biographical information on an individual it is not meant to serve as a biographical dictionary. Rather it is a a large and growing assemblage of data about persons, careers, modes of entry into office, kinship, social associations and writings that can be queried to see larger trends as they change over time and vary across space. When large amounts of data are taken into consideration a small percentage of errors, whether from historical sources or mistakes in coding, have little effect. A relational database such as this offers much that biographical dictionaries cannot by giving the user the ability to launch queries and set the parameter of the variables.

Over the long term CBDB will comprehensively mine the available sources and will accurately represent the biographical data in China's historical record.

CBDB Contents
The figure on the right shows persons in CBDB distributed across dynastic periods as of 2018/1. The variation across dynastic periods has much to do with the sources used. For example, the high number of persons for the Ming period is the result of mining the nearly complete record of Ming jinshi degree holders, which includes not only the names of M(other), F(ather), FF, and FFF, but also the names of B+ (older brother) and B-.

By rule CBDB assigns a person to a single dynastic period based on their date of death, although much of their career may have taken place during the previous dynasty. The date of death is lacking for a majority of figures. In these cases we rely on the index year. The index year is a heuristic that represents the surmised time a person was in the sixtieth year of life (60 sui in Chinese terms or 59 years old in Western terms) or the year of death if less than 60. The index year is estimated using a variety of rules, based on averages of all CBDB data. For example, on average men pass the jinshi degree in their thirtieth year, a wife is two and a half years younger than her husband, the first surviving son is born in his father's thirtieth year and so on. Thus if one date is certain within a family then index years can be estimated for other family members. Generally this works well, but if it is extended across more than two generations up or down the reliability of the index year decreases greatly. The index year is essential for queries with temporal parameters.

CBDB collects many kinds of data on individuals; the number of data points by category are given in the figure on the left. For each category there is a code table in the database. The main biographical data table assigns each person a unique ID that can be used in various data tables. It codes 235 kinds of Social Associations, which are further categorized by type: the main ones being Writings, Politics and Scholarship. There are 20 Biographical Address codes, including: place of birth, death and burial; basic affiliation (jiguan 籍貫); ancestral address; membership in the Eight Banner system of the Qing dynasty; former address; etc. The seventeen Alternate Name codes include: courtesy name (zi 字), studio names, posthumous name, dharma name, birth order name, childhood names, etc. Every possible kinship relationship in the sources is coded. However, the goal is to reduce these relations to the shortest distance (e.g. F-S(on), H(usband)-W(ife) and rely on computation to generate family trees on demand. Entry into office codes a wide variety of modes of entry, including: many types of examination, recommendation, yin privilege, purchase, etc. Office postings include all office titles and ranks in a dynasty, which in turn can be accessed through a hierarchical tree (allowing one to query all holders of positions within a part of the bureaucratic structure), and places of service for local officials. Social distinction is used in particular to identify the reputation of persons irrespective of office (e.g. poet, artist, monk, merchant). Texts include both the titles of extant and lost works of a person; when possible the bibliographic class is included.

Visualizations
CBDB serves as a data resource for prosopographical research. The data can be queried and then copied into a tool for statistical analysis and visualization. This is illustrated by the two figures contrasting median age of death for all persons in CBDB with the median age of death for CBDB women. The difference, obscured when gender is not differentiated, is due to the higher mortality of women in child-bearing ages. About ten percent of CBDB persons are women.

The results of querying CBDB data can be exported for use in two other kinds of analytic visualizations: geographic information systems and social network analysis. To illustrate the value of both consider three visualizations of the same dataset. Song Lian 宋濂 (1310-81) has 452 social associations in CBDB, included in this visualization of his social network are only his literary associations (e.g., letter exchanges, wrote inscriptions for) and scholarly associations (e.g. student-teacher relationships), which have been filtered to include only those in the network who have at least one connection to Song Lian and at least one connection to another person in the network. Pajek, free network analysis software, was used for the graph on the left. Gephi was used for the graph in the center; in this case colors have been used to identify subgroups within the network. CBDB also supports export to UCInet. By mapping Song's associations we see that his network is national but overwhelmingly local. This visualization used Quantum GIS, freeware for PCs and Macs, along with prefectural boundaries, rivers, and a digital elevation model freely available from the China Historical GIS (CHGIS) v. 6. CBDB can also export in KML to Google Earth. CBDB makes it possible to generate data based on simple and complex queries. One could find, for example, all those who came from a certain place and proceed to discover the social and kinship connections among all those who entered government through the civil service examination from that place within a certain span of years.