Corpora for Written Chinese
Essay title: Corpora for Written Chinese
Corpora for Written Chinese:
an Investigation into its Availability
Abstract
This report will investigate the availability of corpora for Chinese language. The first part is a brief introduction to the history and development of Chinese corpora. The second part will specially introduce the current situation of corpora for mandarin Chinese, including a list of such corpora existed today. Then it moves to the third part, a deeper investigation into three of the chosen corpora, introducing their purposes, contents, makers, availability, formatting, annotation, etc. The fourth part will illustrate what kinds of corpora are in the making, and finally a conclusion will be made on what kinds of corpora still need to be built in the future.
Key words: corpus, mandarin Chinese, availability, PeopleЎЇs Daily corpus, HSK corpus, LIVAC corpus
1. Introduction: Chinese Corpora: its History
When talking about corpora, one may easily reflect the Brown corpus, the LOB corpus, the London-Lund corpus, etc., most of which are English Corpora. If we emphasize corpora of Ў°other languagesЎ±, we may know the Swedish SUC corpus, the RWC Japanese corpus, but what about Chinese corpora? It seems little investigation has been made on corpora of this language in western countries. How about the development of corpus linguistics in China? Is it well developed? Is there many Chinese corpora available? Which is the largest and most famous one? And what about the details of these Chinese corpora? So many questions haunted in our minds. This assignment will try to seek answers to the related questions.
The first Chinese Corpus might be the one called Ў°Applied Glossary of Modern ChineseЎ±, which was created in the 1920s, and itЎЇs not a machine-readable one. Chen Heqin, the maker of this corpus, collected about five hundred thousand Chinese words in his work, and aimed to use it in designing the textbook of Chinese language in primary school. (Feng Zhiwei, 2002, p.3-4). Computer readable corpus in China was designed from 1979, and now it has already became a significant research field in linguistic studies.(Feng Zhiwei, 2002, p.5). Not only widely applied in lexicography, language teaching and machine translation, corpus linguistics in China is also a main studying subject in colleges and various academic institutions.(Journal of Chinese Language and Computing, 11(2) 125)
One can divide the existed corpora in China into three types: Chinese corpora, English corpora, and parallel corpora. This report will focus on corpora for mandarin Chinese.
2Ј®Corpora for Mandarin Chinese: Current Situation
Basically there are at least 12 relatively large-sized Chinese corpora existed nowadays, which basic information can be found in the following table. Among them, the Ў°National Balanced CorpusЎ± is the largest tagged corpus. In Part 3, I will make a deeper investigation on the Ў°PeopleЎЇs Daily CorpusЎ± (Corpus for common use), the Ў°LIVAC CorpusЎ± (Corpus for comparative studies), and the Ў°HSK Open CorpusЎ± (Corpus for educational use).
Table 1: Existed Corpora for Mandarin Chinese
Maker
Size(million)
Corpus for Contemporary Literature
Wuhan University
5.27
Corpus for Modern Chinese
Beihang University
Chinese Corpus for Middle-School textbook
Beijing Normal University
1.068
Word Frequency Counting Corpus
Beijing Language & Culture University
1.82
HSK Open Corpus
Beijing Language & Culture University
Open corpus
National Balanced Corpus
China Language Committee
unfinished
Corpus of Peoples Daily
Peking University
China News Corpus
Shanxi University
2.5
Untagged Corpus
Shanghai Normal University
Open corpus
Corpus of Writers Digest
Shanghai Normal University
Open corpus
LIVAC(Linguistic Variety in Chinese Communities)
City University of Hong Kong