Difference between revisions of "Chinese Text Project"
(→Functionality) |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The '''Chinese Text Project''' (CTP; 中國哲學書電子化計劃) is an open-access [[digital library]] project providing a wide range of functionality for transcribing, navigating, and searching [[Chinese classics|early Chinese texts]]. It aims at providing accessible and accurate versions of a wide range of texts,<ref>{{cite web | url = https://www.princeton.edu/chinese-historiography/electronic-resources/databases-electronic-text/ | title = Classical Historiography for Chinese History: Databases & electronic texts | last = Elman | first = Benjamin A. | publisher = Princeton University | access-date = June 3, 2016}}</ref> particularly those relating to Chinese philosophy, and the site is credited with providing one of the most comprehensive and accurate collections of classical Chinese texts on the Internet,<ref>[http://www.acpa-net.org/scholarship.html Association of Chinese Philosophers in North America (北美中国哲学学者协会)]</ref><ref>[http://cjfraser.net/links/#etext Chris Fraser], Department of Philosophy, University of Hong Kong</ref> as well as being one of the most useful textual databases for scholars of early Chinese texts.<ref>http://warpweftandway.com/support-the-chinese-text-project/</ref><ref>http://languagehat.com/chinese-text-project/</ref> | The '''Chinese Text Project''' (CTP; 中國哲學書電子化計劃) is an open-access [[digital library]] project providing a wide range of functionality for transcribing, navigating, and searching [[Chinese classics|early Chinese texts]]. It aims at providing accessible and accurate versions of a wide range of texts,<ref>{{cite web | url = https://www.princeton.edu/chinese-historiography/electronic-resources/databases-electronic-text/ | title = Classical Historiography for Chinese History: Databases & electronic texts | last = Elman | first = Benjamin A. | publisher = Princeton University | access-date = June 3, 2016}}</ref> particularly those relating to Chinese philosophy, and the site is credited with providing one of the most comprehensive and accurate collections of classical Chinese texts on the Internet,<ref>[http://www.acpa-net.org/scholarship.html Association of Chinese Philosophers in North America (北美中国哲学学者协会)]</ref><ref>[http://cjfraser.net/links/#etext Chris Fraser], Department of Philosophy, University of Hong Kong</ref> as well as being one of the most useful textual databases for scholars of early Chinese texts.<ref>http://warpweftandway.com/support-the-chinese-text-project/</ref><ref>http://languagehat.com/chinese-text-project/</ref> | ||
− | By means of integrated functionality as well as external tools connected via Application Programming Interface, the system facilitates a wide range of digital analyses of pre-modern textual works. Together with the [https://ctext.org/tools/api CTP API], a [https://ctext.org/tools/plugins plugin system] facilitates direct connections from within the user interface to other projects including [[Text Tools]], [[TextRef]], and [[MARKUS]]. | + | By means of integrated functionality as well as external tools connected via Application Programming Interface (API), the system facilitates a wide range of digital analyses of pre-modern textual works. Together with the [https://ctext.org/tools/api CTP API], a [https://ctext.org/tools/plugins plugin system] facilitates direct connections from within the user interface to other projects including [[Text Tools]], [[TextRef]], and [[MARKUS]]. |
==Site contents== | ==Site contents== | ||
− | + | [[File:Liezi text flow.png|200px|thumb|right|Text in transcription view]] The core content of the site consists of digitized copies of published editions and extant manuscripts of pre-modern Chinese works, together with digital transcriptions created on the basis of these editions. These two types of representation - image and transcription - are closely linked to one another. An advantage of this two-sided interface is that it is possible to work with the text in what appears to be an ordinary full-text view, while simultaneously being able to immediately access any part of the historical source images from which the full-text has been transcribed - thus facilitating efficient confirmation of accuracy of transcription. The page-by-page view also allows direct access to an open crowdsourcing interface for making direct corrections and annotations to the text, for example to correct text transcribed using OCR or add modern punctuation to unpunctuated texts. | |
− | As well as providing customized search functionality suited to Chinese texts,<ref>http://ctext.org/instructions/advanced-search</ref><ref>http://ctext.org/faq/normalization</ref> the site also attempts to make use of the unique format of the web to offer a range of features relevant to [[Sinology|sinologists]], including an integrated dictionary, word lists, parallel passage information<ref>{{cite journal | last = Sturgeon | first = Donald | date = 2017 | title = Unsupervised identification of text reuse in early Chinese literature | url = https://dsturgeon.net/text-reuse-chinese-literature/ | journal = Digital Scholarship in the Humanities | publisher = Oxford University Press | access-date= November 21, 2017 }}</ref>, scanned source texts, concordance and index data,<ref>{{cite journal | last = Xu | first = Jiajin | date = 2015 | title = Corpus-based Chinese studies: A historical review from the 1920s to the present | url = http://www.ingentaconnect.com/content/jbp/cld/2015/00000006/00000002/art00006 | journal = Chinese Language & Discourse | publisher = John Benjamins Publishing Company | volume = 6 | issue = 2 | pages = 218-244 | access-date= June 3, 2016 }}</ref> a metadata system, Chinese commentary display,<ref>{{cite journal | last = Adkins | first = Martha A. | date = 2016 | title = Web Review: Online Resources for the Study of Chinese Religion and Philosophy | url = https://theolib.atla.com/theolib/article/view/435/1515 | journal = Theological Librarianship | publisher = American Theological Library Association | volume = 9 | issue = 2 | pages = 5-8 | access-date= November 7, 2016 }}</ref> a published resources database, and a discussion forum in which threads can be linked to specific data on the site.<ref>Holger Schneider and Jeff Tharsen, http://dissertationreviews.org/archives/9213</ref><ref>http://ctext.org/introduction</ref> The "Library" section of the site also includes scanned copies of over 25 million pages of early Chinese texts,<ref>http://ctext.org/library.pl?if=en</ref><ref>http://ctext.org/system-statistics</ref> linked line by line to transcriptions in the full-text database, many creating using Optical Character Recognition,<ref>{{cite conference | last = Sturgeon | first = Donald | date = 2017 | title = Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR. | url = https://aaai.org/ocs/index.php/FLAIRS/FLAIRS17/paper/view/15490/15011 | conference = The Thirtieth International Flairs Conference | publisher = AAAI | access-date= November 21, 2017 }}</ref> and edited and maintained using an online crowd-sourcing wiki system.<ref name="cpi">https://cpianalysis.org/2016/06/08/crowdsourcing-apis-and-a-digital-library-of-chinese/, China Policy Institute, University of Nottingham</ref><ref>http://ctext.org/instructions/ocr</ref> Textual data and metadata can also be exported using an | + | [[File:Liezi-page.png|200px|thumb|right|Same text in page-by-page view]] |
+ | |||
+ | Texts are divided into pre-Qin and Han texts, and post-Han texts, with the former categorized by [[Hundred Schools of Thought|school of thought]] and the latter by [[Dynasties in Chinese history|dynasty]]. The ancient (pre-Qin and Han) section of the database contains over 5 million Chinese characters, the post-Han database over 20 million characters, and the publicly editable [[wiki]] section over 5 billion characters.<ref>http://ctext.org/system-statistics</ref> Many texts also have English and Chinese translations, which are paired with the original text paragraph by paragraph as well as phrase by phrase for ease of comparison; this makes it possible for the system to be used as a useful scholarly research tool even by students with little or no knowledge of Chinese.<ref>{{cite journal | last = Connolly | first = Tim | date = 2012 | title = Learning Chinese Philosophy with Commentaries | url = https://www.pdcnet.org/teachphil/content/teachphil_2012_0035_0001_0001_0018 | journal = Teaching Philosophy | publisher = Philosophy Documentation Center | volume = 35 | issue = 1 | pages = 1-18 | access-date= March 19, 2017}}</ref> Many works are available in multiple versions, with each transcription following (and often linked to images of) a particular historical edition of the text. | ||
+ | |||
+ | As well as providing customized search functionality suited to Chinese texts,<ref>http://ctext.org/instructions/advanced-search</ref><ref>http://ctext.org/faq/normalization</ref> the site also attempts to make use of the unique format of the web to offer a range of features relevant to [[Sinology|sinologists]], including an integrated dictionary, word lists, parallel passage information<ref>{{cite journal | last = Sturgeon | first = Donald | date = 2017 | title = Unsupervised identification of text reuse in early Chinese literature | url = https://dsturgeon.net/text-reuse-chinese-literature/ | journal = Digital Scholarship in the Humanities | publisher = Oxford University Press | access-date= November 21, 2017 }}</ref>, scanned source texts, concordance and index data,<ref>{{cite journal | last = Xu | first = Jiajin | date = 2015 | title = Corpus-based Chinese studies: A historical review from the 1920s to the present | url = http://www.ingentaconnect.com/content/jbp/cld/2015/00000006/00000002/art00006 | journal = Chinese Language & Discourse | publisher = John Benjamins Publishing Company | volume = 6 | issue = 2 | pages = 218-244 | access-date= June 3, 2016 }}</ref> a metadata system, Chinese commentary display,<ref>{{cite journal | last = Adkins | first = Martha A. | date = 2016 | title = Web Review: Online Resources for the Study of Chinese Religion and Philosophy | url = https://theolib.atla.com/theolib/article/view/435/1515 | journal = Theological Librarianship | publisher = American Theological Library Association | volume = 9 | issue = 2 | pages = 5-8 | access-date= November 7, 2016 }}</ref> a published resources database, and a discussion forum in which threads can be linked to specific data on the site.<ref>Holger Schneider and Jeff Tharsen, http://dissertationreviews.org/archives/9213</ref><ref>http://ctext.org/introduction</ref> The "Library" section of the site also includes scanned copies of over 25 million pages of early Chinese texts,<ref>http://ctext.org/library.pl?if=en</ref><ref>http://ctext.org/system-statistics</ref> linked line by line to transcriptions in the full-text database, many creating using Optical Character Recognition,<ref>{{cite conference | last = Sturgeon | first = Donald | date = 2017 | title = Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR. | url = https://aaai.org/ocs/index.php/FLAIRS/FLAIRS17/paper/view/15490/15011 | conference = The Thirtieth International Flairs Conference | publisher = AAAI | access-date= November 21, 2017 }}</ref> and edited and maintained using an online crowd-sourcing wiki system.<ref name="cpi">https://cpianalysis.org/2016/06/08/crowdsourcing-apis-and-a-digital-library-of-chinese/, China Policy Institute, University of Nottingham</ref><ref>http://ctext.org/instructions/ocr</ref> Textual data and metadata can also be exported using an Application Programming Interface, allowing integration with other online tools as well as use in [[text mining]] and [[digital humanities]] projects.<ref name="cpi" /><ref>http://ctext.org/tools/api</ref> | ||
+ | |||
+ | ==Functionality== | ||
+ | |||
+ | [[File:Wenjin.png|200px|thumb|right|Dictionary page for non-Unicode character]] A number of functions are integrated directly into the system itself, with many more others accessible through external plugins and APIs. Core functionality includes an integrated dictionary, which summarized available information about words and characters from knowledge encoded throughout the system itself, such as citations from historical dictionaries, phonetic annotations from various historical sources, attested usage of the term, together with translations (where available). The dictionary also allows lookup of non-Unicode characters where these have been attested to in specific locations within the database. | ||
+ | |||
+ | [[File:Ctext-ocr.png|200px|thumb|right|Comparison of typical error rates on a page of pre-modern Chinese text]] | ||
+ | Specially adapted optical character recognition developed for the project and achieving greatly reduced error rates compared with alternative methods is used extensively within the system to provide transcriptions of many texts and editions not previously available in digital form.<ref>{{cite conference | last = Sturgeon | first = Donald | date = 2017 | title = Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR. | url = https://aaai.org/ocs/index.php/FLAIRS/FLAIRS17/paper/view/15490/15011 | conference = The Thirtieth International Flairs Conference | publisher = AAAI | access-date= November 21, 2017 }}</ref> Transcriptions created through optical character recognition are used to enable full-text search of scanned images of early editions, including those provided by university libraries and other large-scale scanning projects such as the Harvard Yenching Chinese Rare Books Digitization Project. Users of the system collaboratively edit the resulting transcriptions to correct OCR errors as well as add modern punctuation and other annotations to the texts. | ||
+ | |||
+ | As well as providing a mechanism for close integration with external tools and projects, the [https://ctext.org/tools/api CTP API] and [https://ctext.org/tools/plugins plugin system] also provide a powerful means for programmatic access to textual data for use in text mining research and digital humanities teaching. External tools such as [[Text Tools]] facilitate browser-based analyses of word usage, text reuse, document similarity, and other aspects of texts contained in the system as well as interactive visualization of results. A [https://pypi.org/project/ctext/ Python module] interfacing with the same API allows for more specialized data mining research. | ||
==References== | ==References== |
Latest revision as of 13:57, 9 May 2018
The Chinese Text Project (CTP; 中國哲學書電子化計劃) is an open-access digital library project providing a wide range of functionality for transcribing, navigating, and searching early Chinese texts. It aims at providing accessible and accurate versions of a wide range of texts,[1] particularly those relating to Chinese philosophy, and the site is credited with providing one of the most comprehensive and accurate collections of classical Chinese texts on the Internet,[2][3] as well as being one of the most useful textual databases for scholars of early Chinese texts.[4][5]
By means of integrated functionality as well as external tools connected via Application Programming Interface (API), the system facilitates a wide range of digital analyses of pre-modern textual works. Together with the CTP API, a plugin system facilitates direct connections from within the user interface to other projects including Text Tools, TextRef, and MARKUS.
Site contents
The core content of the site consists of digitized copies of published editions and extant manuscripts of pre-modern Chinese works, together with digital transcriptions created on the basis of these editions. These two types of representation - image and transcription - are closely linked to one another. An advantage of this two-sided interface is that it is possible to work with the text in what appears to be an ordinary full-text view, while simultaneously being able to immediately access any part of the historical source images from which the full-text has been transcribed - thus facilitating efficient confirmation of accuracy of transcription. The page-by-page view also allows direct access to an open crowdsourcing interface for making direct corrections and annotations to the text, for example to correct text transcribed using OCR or add modern punctuation to unpunctuated texts.Texts are divided into pre-Qin and Han texts, and post-Han texts, with the former categorized by school of thought and the latter by dynasty. The ancient (pre-Qin and Han) section of the database contains over 5 million Chinese characters, the post-Han database over 20 million characters, and the publicly editable wiki section over 5 billion characters.[6] Many texts also have English and Chinese translations, which are paired with the original text paragraph by paragraph as well as phrase by phrase for ease of comparison; this makes it possible for the system to be used as a useful scholarly research tool even by students with little or no knowledge of Chinese.[7] Many works are available in multiple versions, with each transcription following (and often linked to images of) a particular historical edition of the text.
As well as providing customized search functionality suited to Chinese texts,[8][9] the site also attempts to make use of the unique format of the web to offer a range of features relevant to sinologists, including an integrated dictionary, word lists, parallel passage information[10], scanned source texts, concordance and index data,[11] a metadata system, Chinese commentary display,[12] a published resources database, and a discussion forum in which threads can be linked to specific data on the site.[13][14] The "Library" section of the site also includes scanned copies of over 25 million pages of early Chinese texts,[15][16] linked line by line to transcriptions in the full-text database, many creating using Optical Character Recognition,[17] and edited and maintained using an online crowd-sourcing wiki system.[18][19] Textual data and metadata can also be exported using an Application Programming Interface, allowing integration with other online tools as well as use in text mining and digital humanities projects.[18][20]
Functionality
A number of functions are integrated directly into the system itself, with many more others accessible through external plugins and APIs. Core functionality includes an integrated dictionary, which summarized available information about words and characters from knowledge encoded throughout the system itself, such as citations from historical dictionaries, phonetic annotations from various historical sources, attested usage of the term, together with translations (where available). The dictionary also allows lookup of non-Unicode characters where these have been attested to in specific locations within the database.Specially adapted optical character recognition developed for the project and achieving greatly reduced error rates compared with alternative methods is used extensively within the system to provide transcriptions of many texts and editions not previously available in digital form.[21] Transcriptions created through optical character recognition are used to enable full-text search of scanned images of early editions, including those provided by university libraries and other large-scale scanning projects such as the Harvard Yenching Chinese Rare Books Digitization Project. Users of the system collaboratively edit the resulting transcriptions to correct OCR errors as well as add modern punctuation and other annotations to the texts.
As well as providing a mechanism for close integration with external tools and projects, the CTP API and plugin system also provide a powerful means for programmatic access to textual data for use in text mining research and digital humanities teaching. External tools such as Text Tools facilitate browser-based analyses of word usage, text reuse, document similarity, and other aspects of texts contained in the system as well as interactive visualization of results. A Python module interfacing with the same API allows for more specialized data mining research.
References
- ↑ Elman, Benjamin A. "Classical Historiography for Chinese History: Databases & electronic texts". Princeton University. Retrieved June 3, 2016.
- ↑ Association of Chinese Philosophers in North America (北美中国哲学学者协会)
- ↑ Chris Fraser, Department of Philosophy, University of Hong Kong
- ↑ http://warpweftandway.com/support-the-chinese-text-project/
- ↑ http://languagehat.com/chinese-text-project/
- ↑ http://ctext.org/system-statistics
- ↑ Connolly, Tim (2012). "Learning Chinese Philosophy with Commentaries". Teaching Philosophy. Philosophy Documentation Center. 35 (1): 1–18. Retrieved March 19, 2017.
- ↑ http://ctext.org/instructions/advanced-search
- ↑ http://ctext.org/faq/normalization
- ↑ Sturgeon, Donald (2017). "Unsupervised identification of text reuse in early Chinese literature". Digital Scholarship in the Humanities. Oxford University Press. Retrieved November 21, 2017.
- ↑ Xu, Jiajin (2015). "Corpus-based Chinese studies: A historical review from the 1920s to the present". Chinese Language & Discourse. John Benjamins Publishing Company. 6 (2): 218–244. Retrieved June 3, 2016.
- ↑ Adkins, Martha A. (2016). "Web Review: Online Resources for the Study of Chinese Religion and Philosophy". Theological Librarianship. American Theological Library Association. 9 (2): 5–8. Retrieved November 7, 2016.
- ↑ Holger Schneider and Jeff Tharsen, http://dissertationreviews.org/archives/9213
- ↑ http://ctext.org/introduction
- ↑ http://ctext.org/library.pl?if=en
- ↑ http://ctext.org/system-statistics
- ↑ Sturgeon, Donald (2017). Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR. The Thirtieth International Flairs Conference. AAAI. Retrieved November 21, 2017.
- ↑ 18.0 18.1 https://cpianalysis.org/2016/06/08/crowdsourcing-apis-and-a-digital-library-of-chinese/, China Policy Institute, University of Nottingham
- ↑ http://ctext.org/instructions/ocr
- ↑ http://ctext.org/tools/api
- ↑ Sturgeon, Donald (2017). Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR. The Thirtieth International Flairs Conference. AAAI. Retrieved November 21, 2017.