计算机科学与技术系

Department of Computer Science and Technology

 

Education background

Bachelor of Computer Science, Tsinghua University, Beijing, China, 1986;

Master of Computer Science, Tsinghua University, Beijing, China, 1988;

Ph.D. in Computational Linguistics, City University of Hong Kong, Hong Kong SAR, 2004.

Social service

Department of Computer Science and Technology, Tsinghua University: Dean (2007-);

Chinese Information Processing Society of China: Vice President (2006-);

Journal of Chinese Information Processing: Editor-in-Chief (2007-);

National Natural Science Foundation of China: Member of Evaluation Group (2007-);

"Research on Chinese-Centric Multilingual Information Processing" Project, National 863 High-Tech Program: Director of General Experts Group (2007-);

Expert Committee of Language Affairs Commission, City of Beijing: Deputy Director (2008-);

Evaluation Group of Academic Degrees Committee, State Council: Member of Computer Science and Technology Sub-Group (2009-);

ACM China Council: Member (2010-).

Areas of Research Interests/ Research Projects

Natural Language Processing, Chinese Information Processing

Information Retrieval, Web Intelligence, Social Computing

National Natural Science Foundation of China: Lexicon, Syntax and Semantics: Study on Language Process based on Cognitive Experiments (2001-2003);

National Natural Science Foundation of China: Study on Key Techniques in Chinese Text Categorization (2006-2008);

National Natural Science Foundation of China: Properties, Structures, Evolution and Typical Applications of Chinese Language Networks (2009-2011);

National 863 High-Tech Program: Key Technologies in Semantic Classification and Shallow Understanding of Large-Scale Web Texts and Images (2007-2009);

National 863 High-Tech Program: A Chinese Text Categorization System with High Accuracy (2004-2005);

International Organization for Standardization (ISO): Language Resource Management -- Word Segmentation of Written Texts -- Part 1: Basic Concepts and General Principles (2005-);

The National Basic Research Program of China (The 973 Program): Theories, Methodologies and Tools for Computation of Large-scale Chinese Texts (2000-2003).

Research Status

My research interests are computational linguistics, statistical and corpus-based natural language processing (NLP), Chinese language computing (computational morphology, bilingual terminology extraction), information retrieval (Chinese text categorization, graphical model-based keyword extraction), collective intelligence (tag generation, Web trend analysis) and social computing (query log analysis, community discovery). I have participated as project leader or principal researcher in over 20 projects funded by National Natural Science Foundation of China, National Social Science Foundation of China, National 863 High-Tech Program, National 973 Basic Research Program, as well as in projects funded by a number of international IT companies. I have published, together with my students, about 130 papers in academic journals and international conferences in the above fields. The total number of citations of these papers in Google Scholar is roughly 1,400. I have served as program committee member in numerous national and international conferences, and as conference chair or program committee chair for many times.

One of my research focus is Chinese word segmentation, the most fundamental issue in Chinese information processing. I have proposed some key concepts in word segmentation (such as maximal overlapping segmentation ambiguities, true and pseudo segmentation ambiguities, local and global statistics), and have developed an integrated Chinese word segmentation and part-of-speech tagging system, which is able to explore all sorts of knowledge, bigrams of words, parts-of-speech and characters, statistical and structural information of named entities, and local statistics of character strings. I have also tried to extend my experience in Chinese word segmentation to other languages in which word segmentation are need, resulting in an international standard "ISO/FDIS 24614-1: Language Resource Management -- Word Segmentation of Written Texts -- Part 1: Basic Concepts and General Principles". I am the only project leader for this ISO standard.

Recently, I have presented an original viewpoint in NLP: NLP based on huge-scale naturally annotated corpora. The basic idea is with Web-scale corpora, natural annotation may help machine better perform some NLP tasks. There are two types of natural annotation: explicit (such as punctuations, anchor text, query log, Wikipedia, blog tags) and implicit (such as language usage patterns). I further put forward a fundamental problem: if we could integrate all information drawn from naturally annotated corpora of different perspectives, are we able to achieve some degree of deep understanding of languages A preliminary work by my students and me published in Computational Linguistics 2009 showed the usefulness of punctuations in Chinese word segmentation, suggesting that this idea deserves further study.

Honors And Awards

State Commission of Language Affairs: Nationwide Distinguished Practitioner (2007);

Academic Achievement

[1] Wei Qiao, Maosong Sun, Wolfgang Menzel. Chinese word frequency approximation based on multitype corpora. Journal of Quantitative Linguistics, vol. 17, no. 2, pp. 142-166, 2010.

[2] Zhongguo Li, Maosong Sun. Punctuation as implicit annotations for Chinese word segmentation, Computational Linguistics, vol. 35, no. 4, pp. 505-512, 2009.

[3] Xinghua Fan, Maosong Sun. Knowledge representation and reasoning based on entity and relation propagation diagram/tree. Intell. Data Anal. Vol. 10, no. 1, pp. 81-102, 2006.

[4] Maosong Sun. LFG for Chinese: Issues of representation and computation. Journal of Chinese Linguistics, Monograph 19, pp. 129-151, UC Berkeley Publisher, 2006.

[5] Yabin Zheng, Zhiyuan Liu, Maosong Sun, Liyun Ru, Yang Zhang. Incorporating user behaviors in new word detection. Proc. 21st International Joint Conference on Artificial Intelligence (IJCAI-09), Pasadena, USA, 2009, pp. 2101-2106.

[6] Zhiyuan Liu, Peng Li, Yabin Zheng, Maosong Sun. Clustering to find exemplar terms for keyphrase extraction. Proc. Conference on Empirical Methods in Natural Language Processing(EMNLP-09), Singapore, 2009,pp. 257-266.

[7] Jingyang Li, Maosong Sun. Scalable term selection for text categorization. Proc. 2007 Empirical Methods in Natural Language Processing (EMNLP-07), Czech Republic, 2007, pp. 774-782.

[8] Jingyang Li, Maosong Sun, Xian Zhang. A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization. Proc. 44th Annual Meeting of the Association for Computational Linguistics and 21st International Conference on computational Linguistics (44th ACL and 21st COLING), Sydney, Australia, 2006, pp. 17-21.

[9] Xue Dejun, Maosong Sun. Eliminating high-degree biased character bigrams for dimensionality reduction in Chinese text categorization. Proc. European Conference on Information Retrieval (ECIR-04), Sunderland, UK, 2004, pp. 197-208.

[10] Xiao Luo, Maosong Sun, Benjamin K. Tsou: Covering ambiguity resolution in Chinese word segmentation based on contextual information. Proc. 19th International Conference on Computational Linguistics (19th COLING), Taipei, China, pp. 598-604.

[11] Maosong Sun, Dayang Shen, Benjamin K. Tsou. Chinese word segmentation without using lexicon and hand-crafted training data. Proc. 36th Annual Meeting of the Association for Computational Linguistics and 17th International conference on Computational linguistics (36th ACL and 17th COLING), Montreal, Canada, 1998, pp. 1265-1271.

[12] Maosong Sun, Dayang Shen, Changning Huang. CSeg&Tagl.0: A practical word segmenter and POS tagger for Chinese texts. Proc. 5th Applied Natural Language Processing Conference (ANLP-97) , Washington, USA, 1997, pp. 119-126.