文本信息处理的若干关键技术研究

英文题名：The Research on Several Key Techniques in Text Information Processing
作者：熊云波
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：文本信息检索模型 ; 文本分类 ; 文本聚类 ; 查询处理 ; 混淆矩阵 ; 流派分类 ; 小波变换
英文关键词：Text information retrieval model ; text categorization ; text clustering ; query processing ; confusion matrix ; genre categorization ; wavelet transformation
学位年度：2006
导师：胡运发
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2006-09-30

摘要

随着信息时代的到来和Internet的日益普及，文本信息迅速膨胀。Internet上有数十亿的网页，成千上万TB的数据。而且，每天有数十万的网页更新，数百万的新的网页加入，使得Internet上的信息丰富而又复杂。如何有效地组织和管理这些信息，并快速、准确、全面地从中找到用户所需要的信息是当前信息科学领域面临的一大挑战。
     文本是最基本、最常见的信息载体。本文以文本信息检索模型为基准，对文本信息处理的几个关键技术包括文本分类、文本聚类和近似查询处理等进行研究。文本分类和文本聚类是对数据进行组织和管理的核心技术。近似查询处理需要快速查询到所需信息，这是解决大规模数据集的一个重要技术。
     以下是本文的主要研究内容：
     (1)文本信息处理的技术基础。包括文档表示模型、切词、特征选择、文本分类和文本聚类。本文简单介绍了集合模型、代数模型、概率模型和概念模型等四种文档表示模型；分析了中文切词的主要问题和主要方法；具体介绍了文档特征及其选择算法；详细介绍了文本分类和文本聚类，并重点概括了一些重要的文本分类和文本聚类算法。
     (2)基于混淆矩阵的层次结构构造。在信息化时代，文档的海量化和复杂化使得对文档进行层次分类成为必要。本文根据描述平面分类器的错误情况的混淆矩阵，提出了两种层次结构构造法。一种是层次聚类法，另一种是混淆类别法。层次聚类法采用合并策略，即：初始时将每个样本看作一类，然后根据它们的相似性或距离逐渐合并，直到形成一个大类为止。混淆类别法根据各类别间容易混淆的概率大于某个阈值t形成混淆类别，从而构造层次结构。对这两种算法均给出了详细的算法。最后通过相关实验对这两种层次结构构造方法进行比较。实验结果表明混淆类别法策略优于层次聚类法。主要是因为层次聚类法认为父类下一层的子类间的混淆关系是对称的，而实际文本并没有体现这一点。
     (3)文档流派分类研究。文档流派描述的是文档的风格，而不是文档的内容。文档的流派和文档的主题是正交的。也就是说，相同主题的文档可以有不同的流派，相同流派的文档可以描述不同的主题。流派分类在信息检索、信息过滤、反动信息的拦截和网上舆情调查等方面发挥了越来越重要的作用。为了对文档的正反面进行分类，本文提出了基于特征情感色彩的文档流派分类方法(情感分类)。对于情感分类来说，它和基于主题的分类在分类方法上并没有本质区别。可以认为，基于文档的正反面情感分类就是一个普通的二类分类问题。所以，情感特征的选择和情感倾向判定就尤为重要。因此，本文主要研究情感特征词的选择、情感倾向判定和情感倾向权值计算，并研究了几种典型的方法。最后，在国家自然科学基金(60173027)的支助下，开发了一个情感原型系统，并将情感分类方法同传统文本分类方法和基于语义模式的方法进行比较。结果表明：情感分类方法较差，基于语义模式的方法最好，传统文本分类方法居中。但是情感分类方法不需要人工标注训练样本，不需要针对每个主题构建一个独立的分类器，所以这种方法有较强的通用性，而且分类速度也要快很多。
     (4)基于小波变换的近似查询处理。传统的决策支持系统(DSS)在数据查询时希望查询系统能够根据提交的查询提供一个准确的结果，反应时间很长，这是一种典型的“黑盒”模式。但在现今的DSS、在线分析处理(OLAP)、在线聚集等领域，往往不需要一个精准的结果，而对系统的反应速度有很高的要求。这时近似查询就应运而生。小波已经证明了在层次分解(压缩)领域的高效率。小波变换通过将GB／TB级的数据压缩为MB级的数据从而达到近似查询的高速反应要求。本文通过这种压缩机制，在前人提出的选择(Select)、投影(Project)和连接(Join)算法的基础上，提出了合并(Union)、差(Difference)和更新(Update)等操作算法。这些操作都是在小波大纲级上进行的。小波大纲是对源数据的压缩。最后给出了有关实验。实验结果表明，在union和difference操作中，使用小波变换的方法优于任意样本(random sampling)法。并且，当数据的更新不是很大时，对小波系数进行update算法的性能几乎和最优小波系数法相当。
With the coming of information era and the prevalence of Internet gradually, text information expands rapidly. There are billions of webs and thousands upon thousands TB data on internet. Besides, there happen millions of web updates on it everyday. This makes information abundant but tanglesome. It is a big challenge how to organize and manage the information efficiently and query the information which users need quickly, whole and exactly.
    Text is a type of basic common information. The paper is based on text information retrieval model, and investigates the vital techniques of text information processing including text categorization, text clustering and approximate query processing. Text categorization and text clustering are two core techniques of organizing and managing text data. And the technique of approximate query processing is applied to query the needed information fast, which is a important technique of solving large scale datasets.
    The main investigations on text information processing in the paper are listed as follows:
    (1) Technical Basic of Text Information Processing. It includes document model, word-dividing, feature selection, text categorization and text clustering. The paper introduces Set Model, Algebraic Model, Probabilistic Model and Concept Model simply; analyses the main problems and methods in process of Chinese word-diving; introduces document feature and feature selection concretely; describes text categorization and text clustering in detail and generalizes some important typical algorithms of text categorization and text clustering.
    (2) Constructions for Hierarchical Structure Based on Confusion Matrix. In information era, documents' large scale and complication make necessity to category them hierarchically. The paper represents two tactics to construct hierarchical structure according to confusion matrix which depicts statistic for a flat classifier's errors probability. One is hierarchical clustering. The other is confusion classification. Hierarchical clustering adopts agglomerative algorithm, that is to say: every sample is regarded as a class in initialization then every two classes is combined to one class according to their comparability or distance until there is only one big class left. The method of confusion classification builds the hierarchical structure according to whether confusion probability between classes is bigger than a certain valve t. And there presents detail algorithms about the two techniques. Finally some experiments are taken on and the comparisons of two technologies' performance for hierarchical categorization are put up. And experiment results show that the performance of confusion classification excels to that of hierarchical clustering and confusion classification can improve the precision and recall of flat document classifier.
    (3) Document Genre Classification Based on the Feature Sentiment. Document genre doesn't describe concrete content of a document but style of the document. Document genre intersects with document topic. That is to say, there is difference in
    writing style of documents although they belong to the same topic and documents with the same genre can describe different topics. Document genre classification has been becoming more and more important in information retrieval, information filtering, counterchecking of reactive information and investigation of public feelings from internet. In order to category positive or negative documents, the paper represents a categorization technology named sentiment categorization which is based on sentiment of documents feature. For sentiment categorization, there is no difference essentially comparing to categorization based on topic. And it can be regarded as a common two-type document categorization. Thus, it is vital to select sentiment features and determine the feature sentiment orientation. The paper investigates mainly the selection of sentiment features, determination of feature sentiment orientation and computation of feature sentiment weight. And some typical methods are brought forward in the paper. Finally, a prototype system is developed and comparison to traditional text categorization and categorization based on semantic pattern is made. Experiment results show that sentiment categorization is inferior to them and categorization based on semantic is best. But it doesn't need label the training samples and not build a self-governed classifier for each topic. Thus it is more general and the speed of its classification is much rapider than other two methods.
    (4) Approximate Query Processing Based on Wavelet Transform. Conventional Decision Support System (DSS) will give an exact answer according to users' query code submitted to query system and it will take a long time to execute the process. This is a typical black box pattern. However, today's DSS applications, OnLine Analytical Processing (OLAP) and online aggregation don't need an exact result but have a high demand for response. Approximate query is a solvent to deal with it. Wavelet has proved high efficiency in hierarchically decomposing. Wavelet transformation can compress GB/TB level of data to MB level. According to this compression mechanism, this paper depicts algorithms such as Union, Difference and Update based on previous works. And these operations are processed in level of wavelet synopsis. Wavelet synopsis is a compression of original data. Finally, some experiments are provided, and its results show that the accuracy of using wavelet is better than that of random sampling to do union and difference operations. And when the update amount of data is not too much, the direct update of wavelet is almost as good as the optimal selected wavelet synopses.

引文

[Acharya99] S.Acharya, P.B.Gibbons, V.Poosala. Aqua: A Fast Decision Support System Using Approximate Query Answers In: Proceedings of the 25th VLDB Conference,Edinburgh, Scotland, 1999: 754—757.
    [Acharya99+] Acharya S., Gibbons P.B., PoosalaV., Ramaswamy S. Join Synopses for Approximate Query Answering. In: Proc. 1999 ACM SIGMOD International Conference on Management of Data, 1999: 275-286, Philadelphia, Pa.
    [Aggarwal99] C.C.Aggarwal, S.C.Gates and P.S.Yu. On the merits of building categorization systems by supervised clustering. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999:352-356.
    [Alexander00] Alexander Strehl and J. K. Aggarwal. A new Bayesian relaxation framework for the estimation and segmentation of multiple motions. In Proc. IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI 2000), Austin, IEEE, April 2000:21-25.
    [AIexander02] Alexander H. Mining for High-Dimensional Clusters using Projections and Visualizations.http://citeseer.nj.nec.com/413852.html http://citeseer.ist.psu.edu/397364.html
    [Anick90] P.Anick, J.Brennan, R. Flynn, D. Hanssen, B. Alvey, and J. Robbins. A direct manipulation interface for Boolean information retrieval via natural language query. In Proc. of the 13th Annual International ACM/SIGIR Conference,1990:135-150, Brussels, Belgium.
    [Ankerst99] Ankerst M., Breunig M. M., Kriegel H-P., Sander J. OPTICS: Ordering Points To Identify the Clustering Structure. In: Alex Delis, Christos Faloutsos, Shahram Ghandeharizadeh (Eds.), Proceedings ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA: ACM Press,1999:49-60.
    [AnnaOl] Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries. In Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy, 2001.
    [Attardi98] G.Attardi, S.D.Marco and D.Salvi. Categorization by context. Journal Universal Computer Science 1998.4, 9, pages 719-736.
    [Barbara97] Barbara D., DuMouchelW., Faloutsos C, Haas P.J., Hellerstein J.M., IoannidisY., Jagadish H.V., Johnson T., Ng R., PoosalaV.,Ross K.A., Sevcik K.C. The New Jersey Data Reduction Report. IEEE Data Engineering Bulletin, 1997, 20(4):3-45.
    [Beineke04] P. Beineke, T. Hastie and S. Vaithyanathan. The sentimental factor: improving review classification via human-provided information. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004.
    [Ber99] A.Berger. Error-correcting output coding for text classification. In International Joint Conference on Artificial Intelligence: Workshop on Machine Learning for Information Filtering, 1999.
    [Berry99] BERRY, M.W. and BROWNE, M. 1999. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM.
    [Berthier96] Berthier A. Robeiro-Neto and Richard Muntz. A belief network model for IR. In Proc. of the 19th Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996: 253-260.
    [Bjorn94] Bjorn Jawerth and Wim Sweldens. An Overview of Wavelet Based Multi-resolution Analyses. SIAM Review, 1994, 36(3):377-412.
    [Boley98] BOLEY, D.L. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 1998, 2(4): 325-344.
    [Bookstein85] A.Bookstein. Implication of Boolean structure for probabilistic retrieval. In Proc. of the 8th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, Montreal, Canada, 1985:11-17.
    [Bradley98] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, 1998:9-15.
    [Branko01] Branko Kavsek, Nada Lavrac, and Anuska Ferligoj. Consensus decision trees: Using consensus hierarchical clustering for data relabelling and reduction. In Proceedings of ECML 2001, volume 2167 of LNAI, Springer, 2001:251-262.
    [Breiman84] L.Breiman, J.Friedman, R.Olshen, and C. Stone. Classification and Regression Trees. Monterey, CA: Wadsworth International Group, 1984.
    [Brian92] Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. Latent semantic indexing is an optimal special case of multidimensional scaling. In Proc. of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhangen, Denmark, 1992: 161-167.
    [Carlos99] Carlos Ordonez, Edward Omiecinski, FREM: Fast and Robust EM Clustering for Large Data Sets [Ph.D. Thesis]. Georgia Institute of Technology Atlanta, GA 30332, USA, 1999.
    [Cavnar94] W. Cavnar and J. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 1994.
    [Chakra97] S.Chakrabarti, B.E.Dom, R.Agrawal and P.Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proc. 23rd International Conference on Very Large Data Bases (VLDB'97), pages 446-455, Athens, GR, 1997.
    [Charu00] Charu C.A. and Philip S. Yu. Finding generalized projected clusters in high dimensional spaces. Sigmod Record, 2000, 29(2):70-92.
    [Cheeseman96] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.Uthurusamy, editors, Advanced in Knowledge Discovery and Data Mining, Cambridge, MA: AAAI/MIT Press, 1996:153-180.
    [Chen02] CHEN Ning 1, CHEN An, ZHOU Long-xiang. An Incremental Grid Density-Based Clustering Algorithm. Journal of Software, 2002, 13(1): 1-7.
    [Church89] CHURCH, K.W., AND HANKS, P. Word association norms, mutual information and lexicography. Proceedings of the 27th Annual Conference of the Association of Computational Linguistics. Association for Computational Linguistics, New Brunswick, NJ, 1989: 76-83.
    [Cutting92] D.R.Cutting, D.R.Karger, J.O.Pedersen and J.W.Tukey. Scatter/Gather: A Cluster- based Approach to Browsing Large Document Collections. SIGIR'92,1992:318-329.
    [Dave03] K. Dave, S. Lawrence, and D. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 22th International World Wide Web Conference, Budapest, Hungary, 2003.
    [David93] David Haines and W. Bruce Croft. Relevance feedback and inference networks. In Proc. of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA, 1993: 2-11.
    [Dong99] G.Dong. X.Zhang, L.Wong and J.Li. CAEP: Classification by aggregating emerging patterns. In Proceedings of the Second International Conference on Discovery Science, Tokyo, Japan, pages 1999:30-42.
    [Dunham05] Margaret H.Dunham著。郭崇慧,田凤占等译。数据挖掘教程。清华大学出版社. 2005:107-138.
    [Eric96] Eric J. Stollnitz, Tony D. DeRose, and David H. Salesin. Wavelets for Computer Graphics: Theory and Applications. Morgan Kaufmann Publishers, San Francisco, CA, 1996.
    [Ester97] Ester M., et al. Density-Connected Sets and their Application for Trend Detectionin Spatial Databases. Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining,AAAI Press, 1997.
    [Ester98] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, Xiaowei Xu. Incremental Clustering for Mining in a Data Warehousing Environment. Proceedings of the 24th VLDB Conference New York (VLDB 1998), USA, 1998:323-333.
    [Escudero00] G.Escudero, L.Marquez and G.Rigau. Boosting applied to word sense disambiguation. In Proceedings of ECML-00, 11th European Conference on Machine Learning. Barcelona, Spain, 2000:129-141.
    [Eyheramendy03] S. Eyheramendy, D. D. Lewis and and D. Madigan. On the naive bayes model for text categorization. Artificial Intelligence & Statistics 2003.
    [Favata91] Favata. F. & R. Walker. A study of the application of Kohonen-type neural networks to the travelling Salesman Problem. Biological Cybernetics, 1991, 64:463-468.
    [Fellbaum95] Fellbaum C. Cooccurrence and antonymy. International Journal of Lexicography,1995,8(4):281-303.
    [Fellbaum95+] Fellbaum C, Miller.GA, Curtiss.S, et al. An auditory processing deficit as a possible source of SLI. Proceedings of the 19th Boston University Conference on Language Development. Ithaca, NY. Cascadilla Press, 1995:204-215.
    [Finn03] A. Finn, N. Kushmerick. Learning to Classify Documents according to Genre. In Processings of UCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis, 2003.
    [Fisher87] D. Fisher. Improving inference through conceptual clutering. In Proc. 1987 AAAI Conf., Seattle, WA, July, 1987: 461-465.
    [Fredic94] Fredic C. Gey. Inferring probability of relevance using the model of logistic regression. In Proc. of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 1994: 222-231.
    [Frommholz01] I. Frommholz. Categorizing web documents in hierarchical catalogues. In Proceedings of the 23rd European Colloquium on Information Retrieval Research (ECIR01). Darmstand, DE, 2001.
    [Fuhr89] N. Fuhr. Optimal polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Information Systems, 1989, 7(3): 183- 204.
    [Furnkranz99] J.Furnkranz. Exploiting structural information for text classification on the WWW. In Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis. Amsterdam, The Netherlands, 1999: 487-497.
    [Gale93] W.A.Gale, K.W.Church and D.Yarowsky. A method for disambiguating wordsenses in a large corpus. Comput. Human. 1993 26(5):415-439.
    [Gelman95] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, London, 1995.
    [Gennari89] J. Gennari, P. Langley, and D. Fisher. Models of incremental concept formation.Artificial Intelligence, 1989, 40: 11-61.
    [George99] George Karypis, Eui-Hong (Sam), Han Vipin Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Technical Report, Department of Computer Science and Engineering, University of Minnesota,USA, 1999:99-007.
    [Gha00] R. Ghani. Using error-correcting codes for text classification. In Proceedings of 17th International Conference on Machine Learning, 2000.
    [Gibbons98] Gibbons P.B., MatiasY. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In: Proc. 1998 ACM SIGMOD International Conference on Management of Data, 1998: 331-342, Seattle, Wash.
    [Godbole02] Godbole.S.Exploiting confusion matrices for automatic generation of topic hierarchies and scaling up multi-way classifiers.Technical Report,Indian Institute of Technology,Bombay, 2002,Available online at http://citeseer.nj.nec.com/godbole02exploiting.html
    [Godbole02+] Godbole,Sarawagi.S,Chakrabarti.S.Scaling multi-class support vector machines using inter-class confusion.In:Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM Press,NewYork,NY,USA,2002,513-518.
    [Golub96] GOLUB, G.H., AND VAN LOAN, C.F. Matrix Computations. Third edition. Johns Hopkins University Press, Baltimore, MD. 1996.
    [Griffiths84] A. Griffiths, L. A. Robinson and P. Willett. Hierarchic agglomerative clustering methods for automatic document classification. Journal of Document, 1984, 40: 175-205.
    [Griffiths84+] A. Griffiths, H. C. Luckhurst and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science, 1984, 37: 3-11.
    [Gross90] Gross D ,Miller K. Adjectives in WordNet. International Journal of Lexicography, 1990, 3(4):265-277.
    [Gunjan99] Gunjan K. Gupta, Alexander Strehl, and Joydeep Ghosh. Distance based clustering of association rules. In Proc. ANNIE 1999, St. Louis, volume 9, ASME, November 1999:759-764.
    [Haas99] Haas P.J., Hellerstein J.M. Ripple Joins for Online Aggregation. In: Proc. 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia,Pa, 1999:287-298.
    [Han97] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar, and B. Mobasher. Clustering in a highdimensional space using hypergraph models. Technical Report, University of Minnesota, Department of Computer Science, 1997:97-019
    [Han04] Han Hua, Wang Xueling, Peng Silong. Image Restoration Based on Wavelet-Domain Local Gaussian Model. Journal of Software, 2004,15 (3):443-450
    [Harman92] D. Harman. Relevance feedback and other query modification techniques. In W.B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data structure & Algorithms. Prentice Hall, Englewood Cliffs, NJ, USA, 1992: 241-263.
    [HATZ97] HATZIVASSILOGLOU.V, AND MCKEOWN, K.R. Predicting the semantic orientation of adjectives. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8~(th) Conference of the European Chapter of the ACL. Association for Computational Linguistics, New Brunswick, NJ, 1997:174-181.
    [Hayes91] P. Hayes and S. Weinstein. Construe/tis: a system for content-based indexing of a database of news stories. In Proceedings of Annual Conference on Innovative Applications of AI, 1991.
    [Hellerstein97] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proc. of the 1997 ACM SIGMOD Intl. Conf. on Management of Data, 1997: 171-182..
    [Hinneburg98] Hinneburg A., Keim D. An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In: Rakesh Agrawal, Paul E. Stolorz, Gregory Piatetsky-Shapiro (Eds.), Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, New York, USA: AAAI Press, 1998:58-65.
    [Hsu02] C. Hsu, C. Lin. A comparison on methods for multi-class support vector machines, IEEE Transactions on Neural Networks. 2002, 13: 415～425.
    [Huang00] X. Huang and S. Roberton, A Probabilistic Approach to Chinese Information Retrieval: Theory and Experiments.In Proceedings of the 22nd Annual BCSIRSG Colloquium on Information Retrieval Research, Cambridge, England, April 2000:178-193.
    [Ioannidis99] Ioannidis Y.E., Poosala V. Histogram-Based Approximation of Set-Valued Query Answers. In: Proc. 25th International Conference onVery Large Data Bases, Edinburgh, Scotland, September 1999.
    [Iwayama95] M. wayama and T.Tokunaga. Cluster-based text categorization: a comparison of category search strategies. In Proceedings of the 18~(th) Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), 1995:273-281.
    [Jagadish03] H.V. Jagadish, Laks V. S. Lakshmanany and Divesh Srivastava. Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. In Proceedings ACM SIGMOD International Conference on Management of Data, Philadephia, Pennsylvania,USA. ACM Press, 1999:37-48.

    [Jain88] A.K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

    [Jay02] Jay Magidson and Jeroen K. Latent class models for clustering: A comparison with K-means. Canadian Journal of Marketing Research, 2002, Volume 20:37-44.
    [Jeffrey99] Jeffrey Scott Vitter and Min Wang. Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data,Philadelphia, Pennsylvania, May 1999.
    [Jin05] Jin Cheqing, Qian Weining, Zhou Aoying. Analysis and Management of Streaming Data: A Survey. Journal of Software, 2004,15 (8):1172-1181.
    [Joa98] T. Joachims. Text categorization with support vector machines: Learning With Many Relevant Features. In Proceedings of 10th European Conference on Machine Learning, 1998:137-142.
    [Johnson99] E. Johnson and H. Kargupta. Collective, hierarchical clustering from distributed, heterogeneous data. In M. Zaki and C. Ho, editors, Large-Scale Parallel KDD Systems, volume 1759 of Lecture Notes in Computer Science, Springer-Verlag, 1999:221-244.
    [Joon94] Joon Ho Lee. Properties of extended Boolean models in information retrieval. In Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 1994: 182-190.
    [Kas80] G.V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 1980, 29:119-127.
    [Kaushik01] Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. Approximate Query Processing Using Wavelets. The VLDB Journal, 2001,10(3):199-223.
    [Kenji94] Kenji Ono, Kazuo Sumita, and Seiji Miike, Abstract generation based on rhetorical structure extraction. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), Kyoto, Japan, Association for Computational Linguistics ,August, 1994, 1:344-348.
    [Kohonen82] Kohonen. T. SelfOrganized Formation of Topologically Correct Feature Maps. Biological Cybernetics, 1982, 43:59-69.
    [Koller97] D.Koller and M.Sahami. Hierarchically classifying documents using very few words. Proceedings of the 14th International Conference on Machine Learning (ML), Nashville, Tennessee, July 1997: 170-178.
    [Kumar99] S. Kumar and J. Ghosh. GAMLS: A generalized framework for associative modular learning systems. In Proceedings of the Applications and Science of Computational Intelligence II, Orlando, Florida, 1999: 24-34.
    [Lai02] Yushen Lai, Chunghsien Wu. Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology, ACM Transactions on Asian Language Information Processing, Vol. 1, No. 1, March 2002: 34-64.
    [Lee91] J.J Lee and P. Kantor. A study of probabilistic information retrieval systems in the case of inconsistent expert judgements. Journal of the American Society for Information Science, 1991, 42(3): 166-172.
    [Lee93] J.H.Lee, W. Y. Kim, and Y.H. Lee. Ranking documents in the thesaurus -based Boolean retrieval systems. Information Processing & Management, 1993, 30(1): 79-91.
    [Lee99] Lee J.H., Kim D.H., Chung C.W. Multi-dimensional Selectivity Estimation Using Compressed Histogram Information. In: Proc. 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pa., 1999: 205-214.
    [Lewis94] D.Lewis and W.Gale. A Comparison of Two Learning Algorithms Categorization. In Proceedings of Symposium on Document Analysis and Information (SDAIR'94), 1994.
    [Lewis94+] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR94, 1994:3-12.
    [Lewis96] D. Lewis, R. E. Schapire, J. P. Callan and R. Papka. Training algorithms for linear text classifiers. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996: 298-306.
    [Lewis98] D.D.Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, the 10th European Conference on Machine Learning, 1998: 4-15.
    [Li99] Li Z D, Fei X L, Wang H Z. A Concept Based Information Retrieval Modal. Proceedings of the International Symposium on Future Software Technology(ISFST299). 1999:296-300.
    [Li00a] J.Li, G.Dong and K. Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Proceedings of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, pages 220-232, 2000.
    [Li00b] J.Li, G.Dong and K. Ramamohanarao. DeEPs: Instance-based classification by emerging patterns. Technical Report, Dept of CSSE, University of Melbourne,2000.
    [Li00c] J.Li, K.Ramamohanarao and G. Dong. The space of jumping emerging patterns and its incremental maintenance algorithms. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA June, 2000.
    [Li01] W.Li, J.Han and J.pei. CMAR: Accurate and efficient classification based on multiple classification rules. In IEEE International Conference on Data Mining (ICDM'01) San Jose, California, November 29-December 2001.
    [Lim90] Y.W. Lim and S.U. Lee, On the color image segmentation algorithm based on the thresholding and the fuzzy c-means techniques, Pattern Recognition, vol.23, 1990:935-952.
    [Lipton90j Lipton R.J., Naughton J.F., Schneider D.A. Practical Selectivity Estimation through Adaptive Sampling. In: Proc. 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, N.J., 1990: 1-12.
    [Liu98] B.Liu, W.Hsu and Y.Ma. Integrating classication and association rule mining. In Proceedings 4th International Conference on Knowledge Discovery and Data Mining (KDD'98), New York, 1998: 80-86.
    [Liu02] Xin Liu, Yihong Gong, Wei Xu, and Shenghuo Zhu. Document Clustering with Cluster Refinement and Model Selection Capabilities. SIGIR'02, Tampere, Finland, August 11-15, 2002:191-198.
    [Loh88] W.Y. Loh and N. Vanichsetakul. Tree-structured classification via generalized discriminant analysis. Journals of the American Statistical Association, 1988, 83: 715-728.
    [Loh97] W.Y.Loh and Y.S. Shih. Split selection methods for classification trees. Statistica Sinica, 1997, 7: 815-840.
    [Losee88] R.M. Losee and A. Bookstein. Integrating Boolean queries in conjunctive normal form with probabilistic retrieval models. Information Processing & Management, 1988, 14(3):315-321.
    [Mag94] J. Magidson. The CHAID approach to segmentation modeling: CHI-squared automatic interaction detection. In R.P. Bagozzi, editor, Advanced Methods of Marketing Research, Cambridge, MA: Blackwell Business,1994:118-159.
    [McCallum98] McCallum.A,Rosenfeld.R,Mitchell.T,Ng.A. Improving text classification by shrinkage in a hierarchy of classes.In:Proceedings of the 15th International Conference on Machine Learning (ICML98).Morgan Kaufmann Publishers Inc, San Francisco,CA,USA,1998:359-367.
    [Michael00] Michael Steinbach, George Karypis, & Vipin Kumar, A Comparison of Document Clustering Techniques , Department of Computer Science and Egineering, University of Minnesota, Technical Report, 2000: 00-034.

    [Miller90] Miller G A. An online lexical database. International Journal of Lexicography, 1990,3 (4):235-244.
    [Miller95] Miller.GA. WordNet:A Lexical Database for English. Comm ACM ,1995 :39～41.
    [Mur98] S.K Murthy. Automatic construction of decision tree from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 1998, 2:345-389.
    [Natsev99] Apostol Natsev, Rajeev Rastogi, and Kyuseok Shim. WALRUS: A Similarity Retrieval Algorithm for Image Databases. In: Proceedings of the 1999 ACM SIGMOD , Philadelphia, Pennsylvania, 1999.
    [Nigam98] k. Nigam, S.Thrun and T.Michell. Learning to Classify Text Labeled and Unlabeled Documents. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI 98), 1998:792-799.
    [Nigam99] K. Nigam, J. Lafferty and A. McCallum. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Information Filtering, Stockholm, Sweden, 1999.
    [Ogawa91] Y. Ogawa, T. Morita, and K. Kobayashi. A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 1991,39:163-179.
    [Oh00] H.-J.Oh, S.H.Myaeng and M.-H.Lee. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, Athens, Greece, 2000:264-271.
    [Ordonez00] C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In ACM SIGMOD Conference, 2000.
    [Osgood57] Charles E. Osgood, George J. Succi, and Percy H.Tannenbaum. 1957. The Measurement of Meaning.University of Illinois Press, Urbana IL.
    [Pang02] B. Pang, L. Lee, and S Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002.
    [Peng01] Fuchun Peng and Dale Schuurmans , Self-Supervised Chinese Word Segmentation. The 4th Internation Symposium on Intelligent Data Analysis(IDA2001),Lisbon, Portugal, September, 2001:238-247.
    [Peng02] Fuchun Peng, et al., Using self-supervised word segmentation in Chinese information retrieval, SIGIR'02, August 11-15, 2002, Tampere, Finland, ACM 1-58113-561-0/02/0008: 345-350.
    [Peng03] F. Peng and D. Schuurmans. Combining naive bayes and n-gram language models for text classification. Proceedings of The 25th European Conference on Information Retrieval Research (ECIR03). April 14-16, 2003, Pisa, Italy.
    [Platt00] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification. Advances in Neural Information Processing Systems, 2000, 12: 547-553.
    [Poosala99] Poosala V., Ganti V. Fast Approximate Answers to Aggregate Queries on a Data Cube. In: Proc. Eleventh International Conference on Scientific and Statistical Database Management, Cleveland, Ohio, July 1999.

    [Qui86] J. R. Quinlan. Induction of decision tree. Machine Learning, 1986,1:81-106.

    [Qui93] J. R. Quinlan. C 4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993.
    [Raghavan86] V.V. Raghavan and S.K.M. Wong. A critical analysis of vector space models for information retrieval. Journal of the American Society for Information Science, 1986, 37(5):279-287.
    [Rakesh99] Rakesh A., Johanners G., Dimitrios G., Prabhakar R. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Richard T. Snodgrass, Marianne Winslett (Eds.), Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota: ACM Press, 1994:94-105.
    [Rastogi98] R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB98), New York, 1998.
    [Ricardo99] B.Y.Ricardo and R.N.Berthier. Modern Information Retrieval. ACM Press, New York, 1999.
    [Ritter89] Ritter, H. J. & T. Kohonen. Self-Organizing Semantic Maps. Biological Cybernetics, 1989, 61:241-254.
    [Rumelhart85] D.E. Rumelhart and D. Zipser. Feature discovery by competitive learning.Cognitive Science, 1985, 9:75-112.
    [Salton83] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York. 1983.
    [Salton88] G. Salton and C. Buckley. Term-weighting approaches in automatic retrieval. Information Processing & Management, 1988, 24(5): 513-523.
    [Sander98] SANDER, J., ESTER, M., KRIEGEL, H.-P., and XU, X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. In Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.
    [Savaresi02] SAVARESI, S.M., BOLEY, D.L., BITTANTI, S., and GAZZANIGA, G. Cluster Selection in divisive clustering algorithms. In Proceedings of the 2nd SIAM ICDM, Arlington, VA, 2002:299-314.
    [Schlimmer86] J.C. Schlimmer and D. Fisher. A case study of incremental concept induction. In Proceedings of the 5th International Conference on Artificial Intelligence (AAAI86), San Mateo: Morgan Kaufmann, 1986.
    [Sebastiani02] F.Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1):1-47.
    [Sheikholeslami98] Sheikholeslami G., Chatterjee S., Zhang Aidong. WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In: Ashish Gupta, Oded Shmueli, Jennifer Widom (Eds.), Proceedings of 24rd International Conference on Very Large Data Bases, New York City, New York,USA: Morgan Kaufmann, 1998:428- 439.
    [Spertus97] E. Spertus. Smokey: automatic recognition of hostile messages. In Proceedings of the Conference on Innovative Applications of Artificial Intelligence, 1997.
    [Sudipto98] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: An Efficient Clustering Algorithm for Large Databases. In: Laura M. Haas, Ashutosh Tiwary (Eds.), Proceedings of ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA: ACM Press, 1998:73-84.
    [Sudipto99] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proc. of the 15th Int'l Conf. on Data Eng., 1999.
    [Timo97] Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen. WEBSOM—self-organizing maps of document collections. In Proceedings of WSOM'97,Workshop on Self-Organizing Maps, Espoo, Finland, June 4-6, Helsinki University of Technology, Neural Networks Research Centre, Espoo,Finland, 1997:310-315.
    [Thomas00] Thomas Emerson, Segmenting Chinese in Unicode, 16th International Unicode Conference, 2000.
    [Turney02] P. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002.
    [Turney03] Turney,P., Littman, M.L. Measuring Praise and Criticism: Inference of Semantic Orientation from Association. In ACM Transactions on Information Systems (TOIS). Vol. 21, No. 4, October 2003:315-346.
    [Utg88] P.E. Utgoff. An incremental ID3. In Proceedings of the Fifth International Conference on Machine Learning, San Mateo, CA, 1988:107-120.
    [Wang99] WANG, W., YANG, J., and MUNTZ, R.R. STING+: An approach to active spatial data mining. In Proceedings 15th ICDE, Sydney, Australia, 1999:116-125.
    [Wang01] K. Wang, S. Zhou and Y. He. Hierarchical classification of real life documents. In Proceedings of the First Siam International Conference on Data Mining. Chicago, 2001.
    [Wartick92] S. Wartick. Boolean operations. In W.B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data structure & Algorithms. Prentice Hall, Englewood Cliffs, NJ, USA, 1992: 264-292.
    [Weston99] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99), Bruges, Belgium, 1999:219-224.
    [Wiebe00] J. Wiebe. Learning subjective adiectives from corpora. In Proceedings of 17thNational Conference on Artificial Intelligence, Austin, Texas, 2000.
    [Wilkinson91] R. Wilkinson and P. Hingston. Using the cosine measure in a neural network for document retrieval. In Proc. of the ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago.USA, Oct.1991: 202-210.
    [William95] William A W. Conceptual Indexing: A Better Way To Organize Knowledge. Forthcoming Technical Report. Sun Microsystems Lab., http://www.sunlabs.com/research/knowledge.1995.
    [Wu04] Wu Shaoquan, Huang Jiwu, Huang Daren. DWT-Based Audio Watermarking with Self-Synchronization. Chinese Journal Of Computers,2004,27(3): 365-370.
    [Xu98] Xiaowei Xu, Martin Ester, Hans-Peter Kriegel, Jorg Sander. A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. The Proceedings of 14th International Conference on Data Engineering (ICDE'98), Orlando, FL, 1998:324-331.
    [Yang92] Y. Yang and C.G. Chute. A linear least squares fit mapping method for information retrieval from natural language texts. In Proceedings of the 14th Conference on Computational Linguistics (COLING92), 1992.
    [Yang97] Y.Yang and J.O.Pedersen. A comparative study on feature selection in text categorization. In Proceedings of ICML97, 14th International Conference on Machine Learning, pages 412-420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.
    [Yang99A] Y.Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, vol.1, nos. 1/2, 1999:67-88.
    [Yang02] Y.Yang, S.Slattery and R.Ghani. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, Vol.18, 2002:219-241.
    [Yi03] J. Yi, T. Nasukawa, R. Bunescu, W. Niblack. Sentiment analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques,Proceedings of the Third IEEE International Conference on Data Mining, November, 2003:19-22.
    [Yossi98] Yossi Matias, Jeffrey Scott Vitter, and Min Wang. Wavelet-Based Histograms for Selectivity Estimation. In Proceedings of the 1998 ACM SIGMOD, Seattle, Washington, June 1998, 448-459.
    [Zaiane02] O.R.Zaiane and M.L.Antonie. Classifying text documents by associating terms with text categories. In Thirteenth Australasian Database Conference (ADC'02), Melbourne, Australia, January 2002: 215-222.
    [Zamir97] O.Zamir,O.Etzioni, O.Madani and R.M.Karp. Fast and Intuitive Clustering of Web Documents, KDD'97, 1997:287-290.
    [Zhang96] Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: H. V. Jagadish, Inderpal Singh Mumick (Eds.), Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada: ACM Press, 1996:103-114.
    [Zhao04] Zhao Hui, Hou Jianrong, Shi Baile. Research on Similarity of Stochastic Non-Stationary Time Series Based on Wavelet-Fractal. Journal of Software, 2004,15 (5):633-640.
    [边01] 边肇祺等．模式识别．清华大学出版社，2001：9-43．
    [卜02] 卜东波，白硕，李国杰．聚类／分类中的粒度原理．计算机学报，2002， 25(8)：810-816．
    [陈02] 陈宁，陈安，周龙骧，贾维嘉，罗三定．基于模糊概念图的文档聚类及其在Web中的应用．软件学报，2002，13(8)：1598-1605．
    [陈05] 陈小云．文本挖掘若干关键技术研究[博士论文]．复旦大学，上海，2005．
    [刁02] 刁力力，胡可云，陆玉昌，石纯一．用Boosting方法组合增强Stumps进行文本分类．软件学报，2002，13(8)：1363-1367．
    [冯05] 冯玉才，张鹏程．基于近似查询的在线分组聚集及其应用．计算机工程．2005，31(16)：97-99．
    [韩05] 韩恺，岳丽华，龚育昌．利用关系数据库系统对半结构化数据进行近似查询．中国科学技术大学学报．2005，35(5)：674-682．
    [解02] 解冲锋，李星．基于序列的文本自动分类算法．软件学报，2002，13(4)：783-789．
    [李05] 李荣陆．文本分类及其相关技术研究[博士论文]．复旦大学，上海，2005．
    [刘04] 刘永丹，曾海泉，李荣陆，胡运发．基于语义分析的倾向性过滤．通信学报，2004，25(7)：78-85．
    [刘04+] 刘永丹．文档数据库若干关键技术研究[博士论文]．复旦大学，上海，2004．
    [鲁00] 鲁松，李晓黎，白硕，王实，文档中词语权重计算方法的改进，中文信息学报，2000，14(6)：8-13．
    [毛05] 毛国君，段立娟，王实，石云．数据挖掘原理与算法．清华大学出版社．2005-109-181．
    [宋02] 宋擒豹等，基于关联规则的Web文档聚类算法，软件学报，2002，13(3)：417-423．
    [苏02] 苏中，马少平，杨强，张宏江．基于Web-Log Mining的Web文档聚类．软件学报，2002，13(1)：99-104．
    [孙02] 孙即祥．现代模式识别．国防科技大学出版社．2002：13-45．
    [唐03] 唐春生，金以慧．基于全信息矩阵的多分类器集成方法．软件学报，2003， 14(6)：1103-1109．
    [万05] 万昊，任勇，山秀明．基于混淆矩阵的全方位角雷达目标识别．微电子与计算机，2005，22(3)：136-143
    [王04] 王建会．中文信息处理中若干关键技术的研究[博士论文]．复旦大学，上海，2004．
    [吴03] 吴雅倩等．基于最大熵方法的中英文基本名词短语识别，计算机研究与发展，2003，40(3)：440-446．
    [袁04] 袁时金，李荣陆，周水庚，胡运发．层次化中文文档分类．通信学报，2004，25(11)：55-63．
    [战99] 战学刚，林鸿飞，姚天顺．中文文献的层次分类方法．中文信息学报，1999，13(6)：20-25
    [张98] 张家騄，齐士钤，俞舸．汉语语音合成系统评价方法．声学学报，1998，23(1)：19-30．
    [张00] 张学工．关于统计学习理论与支持向量机．自动化学报，2000， 26(1)：32-42．
    [张03] 张俐，李晶皎，胡明涵，姚天顺．中文WordNet的研究及实现．东北大学学报(自然科学版)，2003，24(4)：327-329．
    [赵01] 赵一唯，王和珍，李振东．WWW信息检索综述．南京大学学报(自然科学)，2001，37(2)：192-198．
    [周00] 周水庚．中文文本数据库若干关键技术研究[博士论文]．复旦大学，上海， 2000．
    [周01a] 周水庚，关佶红，俞红奇，胡运发．基于Ngram信息的中文文档分类研究．中文信息学报，2001，15(1)：34-39．
    [周01b] 周水庚，关佶红，胡运发．无需词典支持和切词处理的中文文档分类．高技术通讯，2001，03：31-35．
    [周01c] 周水庚，关佶红，胡运发，周傲英．一个无需词典支持和切词处理的中文文档分类系统．计算机研究与发展，2001，38(7)：839-844．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700