
Education background
Bachelor of Computer, Tsinghua University, Beijing, China, 1970;
Master of Computer Science, Toronto University, Toronto, Canada, 1983
Social service
Academic Committee of School of Information Science and Technology, Tsinghua University: Chair (2007-);
Department of Computer Technology and Applications, Qinghai University: Dean (2007-);
National Examination Center of Ministry of Education: Deputy Director of Steering Committee for Examinations (2006-)
DASFAA: Member of Steering Committee (2008-);
Beijing Computer Federation: President (2005-2010);
APWeb 2010: General Co-Chair (2010);
SIGMOD 2007: General Co-Chair (2007);
DASFAA 2005: Program Co-Chair (2005);
Department of Computer Science and Technology, Tsinghua University: Dean (1997-2003).
Areas of Research Interests/ Research Projects
Database Systems
Web Search, Web Information Extraction and Integration
National Natural Science Foundation of China: Key Technologies of the Infrastructure Supporting Research and Application of Chinese Web (2009-2012);
National Basic Research Program of China (The 973 Program): Verification and Management of Requirement Modeling (2007-2010);
Cooperative Research with Sohu Company: Bilingual Dictionaries from the Chinese Web (2008-2009).
Cooperative Research with Microsoft: Web Text Based Sentiment Analysis and Mining (2008);
National Natural Science Foundation of China (jointly supported by the EU 6th FP Program): ALVIS - A Superpeer Semantic Search Engine (2005-2007).
Research Status
My recent research interests are all Web-related, focusing on Web vertical search, Web information extraction, and knowledge discovery.
In Web vertical search, we treat World Wide Web as a major information resource for diverse application domains, such as sales of book, scientific publications, etc. To collect the data for such a domain from the Web, we proposed a stepwise methodology that covers the process of domain specification, crawling, extraction, and database building from the Web. The process is fully supported by a domain independent system called SESQ having a graph language GQML as a tool for the process tasks. The built-in query processor of SESQ supports queries on the extracted databases in the form of keyword, SQL-like statements, and GUI as well. SESQ was a highlight of the ALVIS project jointly supported by the National Natural Science Foundation of China and the EU 6th FP Program.
As the Chinese Web, that is the Web consisting of all webpages from China, is becoming more and more important to Chinese people and the World Wide Web, I turn much of my research attention to knowledge discovery of it.
Building a bilingual dictionary from bilingual HTML webpages has been a topic for years in natural language processing (NLP) area. Existing work in NLP uses complicated machine learning and Chinese segmentation algorithms, and relies on preexisting language resources such as corpus, and thus suffers from scalability problem. Based on hundreds of millions of Chinese-English HTML pages crawled by an industrial search engine, our group has developed a self-dependent method for building a Chinese-English bilingual dictionary. The method uses a self-built corpus and an algorithm based on I-Tree data structure to get term translations. By avoiding complicated machine learning and segmentation algorithms, the implemented method has a good scalability and is portable to various languages. Comparative testing with a popular bilingual dictionary shows a performance advantage of our method. A patent has been applied for our proposed method.
Another interesting research we have done by using many millions of HTML pages crawled by an industrial search engine is an investigation of the Web databases on Chinese Web. Given these pages as input, our group has designed algorithms of webpage filtering, feature extraction, and classifications to get an overview of the Chinese Deep Web. Our investigation, which is closer to the real status of Chinese Web than traditional sampling methods, shows at present on Chinese Web there are over 600,000 web databases. Among them half are in the category of e-business and web based trading, and the rest are distributed over domains such as science and technology, education, entertainment, etc. Another noticeable fact is that half of Chinese Web databases only have a single box as their input interface. These interesting facts provide an initial base for further research on Chinese Deep Web.
Web community mining is another research that our group focuses on. We construct large scale graphs with millions of nodes and hundreds of millions of edges from the Web by using various associations between entities, and study algorithms on mining communities from these graphs with more emphasis on the aspects of accuracy and scalability. We propose an algorithm to utilize the dynamic process by contradicting the network topology and the topology-based propinquity, where the propinquity is a measure of the probability for a pair of nodes involved in a coherent community structure. Through several rounds of mutual reinforcement between topology and propinquity, the community structures are expected to naturally emerge. To achieve better efficiency, the propinquity is incrementally calculated. We obtained interesting experimental results on several real network data. This work has been published in ACM SIGKDD '09 conference. Our other contributions in Web community mining can be found in top international conference VLDB 2009 and ACM journal TODS, June 2007.
Honors And Awards
Excellent Teaching Achievement Award by Ministry of Education, Second Class-A Study on University Computer Science and Technology Program - International Trends and its Development and Promotion in China (2009);
Outstanding Education Practitioner Award by City of Beijing (2009);
National Award for Science and Technology Progress, Third Class-Trading System for Beijing Commodity Exchange (1997).
Academic Achievement
[1] Ju Fan, Hao Wu, Guoliang Li, Lizhu Zhou. Suggesting Topic-Based Query Terms as You Type, Proc. the 12th Asia-Pacific Web Conference (APWeb 2010), Busan, Korea, April 6-8, 2010, pp:61-67
[2] Yuzhou Zhang, Jianyong Wang, Yi Wang, Lizhu Zhou. Parallel Community Detection on Large Networks with Propinquity Dynamics. Proc. the 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009. PP:997-1005. (ACM SIGKDD'09)
[3] Zhiping Zeng, Anthony K.H. Tung, Jianyong Wang, Jianhua Feng, Lizhu Zhou. Comparing Stars: On Approximating Graph Edit Distance. Proc. the 35th Int. Conf. on Very Large Data Bases, Lyon, France, Aug. 24-28, 2009. PP: 25-36. (VLDB 2009)
[4] Hang Guo, LiZhu Zhou and Ling Feng, Self-Switching Classification Framework for Titled Documents. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 24(4), pp.615-625 July 2009
[5] Ling Lin, Lizhu Zhou, Web Database Schema Identification through Simple Query Interface. VLDB 2009 Workshop (RED 2009)
[6] Lin Ling, Yukai He, Hang Guo, Ju Fan, Lizhu Zhou, Qi Guo, Gang Li. SESQ: A model-driven method for building object level vertical search engines (Demo). In: Proceedings of the 27th International Conference on Conceptual Modeling. Barcelona, Spain, 2008, pp. 516-517
[7] Zhiping Zeng, Jianyong Wang, Lizhu Zhou, George Karypis. Out-of-Core Coherent Closed Quasi-Clique Mining from Large Dense Graph Databases, ACM TODS, June 2007
[8] Hang Guo, Jun Zhang, Lizhu Zhou, Classifying and Ranking, The First Step towards Mining Inside Vertical Search Engines, Proceedings of International Conference on Database and Expert System Applications.(DEXA), 2007,Germany, pp. 223-232
[9] Ling Lin, Lizhu Zhou, Leveraging Webpage Classification for Data Object Recognition, The 2007 IEEE WIC/ACM International Conference on Web Intelligence (WI 07), USA, p667-670
[10] Hang Guo, Lizhu Zhou, Segmented Document Classification: Problem and Solution, Proceedings of 18th International Conference on Database and Expert Systems Applications (DEXA 2006), pp. 171-181