文本主题域划分与无监督特征提取

英文题名：Text Subtopic-field Segmentation and Unsupervised Feature Extraction
作者：王小芳
论文级别：博士
学科专业名称：计算数学
中文关键词：主题域划分 ; 特征提取 ; 潜量子 ; 显量子
英文关键词：Text clustering ; Chinese text ; Sub-topic field ; Feature extraction ; Weighting
学位年度：2009
导师：张树功 ; 张文燚
学科代码：070102
学位授予单位：吉林大学
论文提交日期：2009-12-01

摘要

本文针对性地解决文本聚类中的一些相关问题,包括主题域划分和无监督特征提取。当前,文本主题域划分方法较少,现有的方法受领域知识库和非全局优化等方面的制约,在通用性及划分效果等方面有很大的局限性。本文主要建立一种新的全局最优化的,与具体应用领域无关的主题域划分模型,在模型的构造过程中着重考虑了主题域内距离、主题域间距离、主题域内夹角和主题域间夹角等要素,通过求解最优化模型得到最优的主题域划分模式。
     特征提取和权重计算是文本聚类中最为重要的环节.本文提出一种新的特征提取和权重计算方法。首先定义了语义量子,并依据不同类型的语义量子对表达文本主题的贡献将语义量子分为潜量子和显量子;进而借助于改进的向量空间模型进行语义显量子的结构化表达,借助改进的词序列模型对语义潜量子进行结构化表达,从而建立了一种新的基于主题概念模型的文本表示模型;最后采用显量子分布模型进行显量子权重计算,通过在有效区域内潜量子的共现模型进行潜量子权重的计算。该算法无需领域知识库,且支持后续增量式文本聚类,为文本聚类在互联网上的应用奠定基础.
With the rapid development and popularization of the Internet, online information resources are increasing and people have changed the era information age to the rich digital information age. Faced with a deluge of online information resources, it has been difficult to find the real need of information quickly and efficiently. Therefore, how rational and effective way to organize, manage, and use of such information, has gradually become an important field of information processing study. Traditionally, information processing methods mainly rely on manual classification and selection, and web pages would be assigned to one or several more appropriate category through professional analysis of the contents. Obviously, with the rapid growth of Web information capacity, artificial approach has become very unrealistic.
     Text clustering is a powerful tool to organize and manage information, and it can be to solve the current chaotic situation on the Internet, making it easier for users to more accurately locate the information they need. Therefore, an ongoing study of text clustering is necessary and essential. This makes the study of text clustering has become an increasingly important area of research, and it gradually combined with the search engines, information filtering technologies into an important means of obtaining web-based information.
     Text clustering is a classic problem in natural language processing. In order to changing text clustering into a general pattern recognition problem, several problems need to be solved. First, the multi-topic text should be divided into a lot of single-topic sub-topic fields, then the appropriate feature units can be selected in virtue of the characteristics of natural language and the weight of the feature units can be canculated and sorted. Finally, the feature units can be clustered through a lot of clustering strategy. In order to resolve the problems of current sub-topic field segmentation and feature extraction, in this paper main works is the following:
     1. The text representation model was studied. Semantic quantum was defined based on the key elements of characteristics and divided into obvious quantum and latent quantum based on the contribution to expressing the topic and the concept. Obvious quantum has a direct instructions role to express the topic of text and latent quantum can express the text details through the Co-occurrence in effective area. With the improved vector space model to improve significantly the structure expression of obvious quantum and with the improved word-series model to improve significantly the structure expression of latent quantum, thereby a new text representation model based on the topic and the concept was established.
     2. A subtopic-field segmentation technique based on the optimal control model was proposed. A basic supposition that the subtopic-fields segmentation pattern in which the distance and the angle in the subtopic-field is small and the distance and the angle between the subtopic-field is bigger is best was proposed. The object function of the optimal control model was constructed by the within-subtopic-field distance, the between-subtopic-field distance, the within-subtopic-field angle and the between-subtopic-field angle. By solving the optimal control model, optimal subtopic-field segmentation is obtained. The method independent of specific applications is a global optimal method. This method can apply to not only the specific applications but also the Internet information retrieval and processing.
     3. This paper presents a new unsupervised feature extraction model based on the text conceptual model. First of all, we compute the weight of obvious quantum based the obvious quantum entanglement intensity, thereby we compute the weight of latent quantum based the window function of the latent quantum, finally we can obtain the obvious quantum feature sequence and the latent quantum feature sequence according to the respective sorted weights.If only to category the text-sets, we can obtain the categories through the clustering of the obvious quantum features. To reflect the details of the categories, the clustering of the latent quantum features based on the clustering results of obvious quantum is required. In practice, the selection of different features can be based on the different needs, it can greatly reduce the computational complexity, on the other hand greatly reduce redundancy between features.

引文

[1] Vapnik V.The Nature of Statistical Learning Theory[M].Springer-Verlag, 1995.
    [2] Mitchell T.Machine Learning[M].WCB/McGraw-Hill, 1997.
    [3] Bishop C.Pattern Recognition and Machine Learning [M].Springer, 2006.
    [4] Salton G, McGill M.Introduction to Modern Information Retrieval[M].New York USA:McGraw-Hill, 1986.
    [5] Baeza-Yates R A, Ribeiro-Neto B A.Modern Information Retrieval[M].ACMPress / Addison-Wesley, 1999.
    [6] Duda R, Hart P, Stork D.Pattern Classification[M].Wiley-Interscience, 2000.
    [7] Sebastiani F.Machine learning in automated text categorization[C].ACM Computing Surveys (CSUR), 2002, 34(1):1–47.
    [8] Lewis D D.Representation and Learning in Information Retrieval[D].Amherst, MA, USA: Computer Science Department, University of Massachusetts, Feb, 1992.
    [9] Lewis D D, Hayes P J.Guest editorial—special issue on text categorization[C].ACM Transactions on Information Systems, 1994, 12(3):231.
    [10] Aas K, Eikvil L.Text categorisation: A survey[R].Technical report, Norwegian Computing Center, June, 1999.http://citeseer.ist.psu.edu/aas99text.html.
    [11] Joachims T, Sebastiani F.Guest editors’introduction to the special issue on automated text categorization[J].Journal of Intelligent Information Systems, 2002, 18(2-3):103–105.
    [12]中国互联网络信息中心(CNNIC).中国互联网络发展状况报告[R/OL].Technical report, January, 2008.http://www.cnnic.net.cn/index/0E/00/11/index.htm.
    [13] Leong M, Zhou H.Preliminary qualitative analysis of segmented vs bigram indexing in Chinese[C].Proceedings of the 6th Text Retrieval Conference (TREC-6), 1998.551–557.
    [14] He.J , Tan.A.H. Tan.C.L. A Comparative Study on Chinese Text Categorization Methods[C]. 2000 International Workshop on Text and Web Mining, 2000.24–35.
    [15]李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器[J].计算机学报, 2001, 24(1):62–68.
    [16]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究, 2001, 18(9):23–26.
    [17]周水庚,关佶红,俞红奇,胡运发.基于Ngram信息的中文文档分类研究[J].中文信息学报, 2001, 15(1):34–39.
    [18]刘斌,黄铁军,程军,等.一种新的基于统计的自动文本分类方法[J].中文信息学报,2002, 16(6):18–24.
    [19]刘永辉,董明楷,李蓉等.一种基于向量空间模型的多层次文本分类方法[J].中文信息学报, 2002, 16(3):8–14.
    [20] Li B, Chen Y, Bai X, et al.Experimental study on representing units in Chinese text categorization[C].Proceedings of CICLing.Springer, 2003.602-614.
    [21] Xue D, Sun M.Raising high-degree overlapped character bigrams into trigrams for dimensionality reduction in Chinese text categorization[C].Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’04), 2004.584–595.
    [22] Xue D, Sun M.Eliminating high-degree biased character bigrams for dimensionality reduction in Chinese text categorization[C] . Proceedings of the 26th European Conference on Information Retrieval (ECIR’04), 2004.197–208.
    [23]代六玲,黄河燕,陈肇雄.中文文本分类中特征提取方法的比较研究[J].中文信息学报,2004, 18(1):26–32.
    [24]周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17–23.
    [25] Jain AK, Dubes RC.Algorithms for Clustering Data[J].Prentice-Hall Advanced Reference Series, 1988.1?334.
    [26]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报, 2008, 19(1): 48?61.
    [27] Jain AK, Murty MN, Flynn PJ.Data clustering: A review[C].ACM Computing Surveys, 1999,31(3):264?323.
    [28] Gabrilovich E, Markovitch S.Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5.Proceedings of the twentyfirst international conference on Machine learning (ICML’04)[C], New York, NY, USA: ACM Press, 2004.41.
    [29] Salton G, McGill M.Introduction to Modern Information Retrieval[M].New York, NY, USA:McGraw-Hill, 1986.
    [30] Salton G,Wong A, Yang C . A vector space model for automatic indexing[C].Communications of the ACM, 1975, 18(11):613–620.
    [31] Aizawa A.The Feature Quantity: An Information Theoretic Perspective of Tfidf-like Measures[C]. Proceedings of the 23th ACM International Conference on Research and Development in Information Retrieval (SIGIR’00), New York, NY, USA: ACM Press, 2000.104–111.
    [32] Papineni K.Why inverse document frequency?[C] Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies (NAACL’01), Morristown, NJ, USA: Association for Computational Linguistics, 2001.1–8.
    [33] Robertson S.Understanding inverse document frequency: on theoretical arguments for IDF[J].Journal of Documentation, 2004, 60(5):503–520.
    [34] P. H. Sneath and R. R. Sokal.Numerical Taxonomy [M ].Freeman, London, UK, 1973.
    [35] P. Willett.Recent trends in hierarchic document clustering: a critical review [C].In: Information Processing and Management, 24 (5) : 577 - 597, 1988.
    [36] Yunjae jung . Design and Evaluation of Clustering Criterion for Optimal HierarchicalAgglomerative Clustering[D].Phd thesis.University ofMinnesota.2001.
    [37]行小帅,潘进,焦李成.基于免疫规划的K2means聚类算法[ J ].计算机学报, 2003, 26 (5) : 605 - 610.
    [38]陈浩,何婷婷,姬东鸿.基于k2means聚类的无导词义消歧[ J ].中文信息学报, 2005, 19 (4) : 10 - 16.
    [39] A. Casillas, M. T. GonzálezdeLena and R. Martínez.Document clustering into an unknown number of clusters using a Genetic Algorithm [A ].International Conference on Text Speech and Dialogue TSD, 2003.
    [40] Tao L i .Document clustering via Adap tive Subspace Iteration [A ].In: p roceedings of the 12 th ACM International Conference onMultimedia[C ].New York, USA, 364 - 367, 2004.
    [41] A. L ikas, N. Vlassis, and J. J. Verbeek.The global k2means algorithm.Pattern Recognition [ J ].Vol.36,2003, 451-461.
    [42]范金城,梅长林.数据分析[M ].科学出版社.2002年7月第一版.
    [43] T. Kohonen . Self2organized formation of topologically correct feature map s[ J ].Biological Cybernetics, 43:59269, 1982.
    [44] Mitchell T.Machine Learning[M].WCB/McGraw-Hill, 1997.
    [45] Tan C, Wang Y, Lee C . The use of bigrams to enhance text categorization[J].Information Processing and Management, 2002, 38(4):529–546.
    [46] Bekkerman R, Allan J.Using Bigrams in Text Categorization[R].Technical report, CIIR Technical Report, 2005.
    [47] Lewis D D. An evaluation of phrasal and clustered representations on a text categorization task[C].Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), New York, NY, USA: ACM Press, 1992.37–50.
    [48] Caropreso M, Matwin S, Sebastiani F . Statistical phrases in automated text categorization[C].Centre National de la Recherche Scientifique, Paris, France, 2000.
    [49] Koster C H A, Seutter M.Taming wild phrases[C].Proceedings of the 25th European Conference on IR Research (ECIR’03), 2003.161-176.
    [50] Kwok K.Comparing representations in Chinese information retrieval[C].ACM SIGIR Forum, 1997, 31:34–41.
    [51] Nie J, Ren F.Chinese information retrieval: using characters or words?[J] Information Processing and Management, 1999, 35(4):443–462.
    [52] Nie J, Gao J, Zhang J, et al.On the use of words and n-grams for Chinese information retrieval[C].Proceedings of the fifth international workshop on Information retrieval with Asian languages, 2000.141–148.
    [53] Luk R, Kwok K.A comparison of Chinese document indexing strategies and retrieval models[C].ACM Transactions on Asian Language Information Processing (TALIP), 2002, 1(3):225–268.
    [54]黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007, 21(3):8–19.
    [55]刘源,梁南元.汉语处理的基础工程-现代汉语词频统计[J].中文信息学报,1986, 1(1):20–25.
    [56] Salton G, Singhal A, Buckley C, Mitra M.Automatic text decomposition using text segments and text themes [C].In: Bernstein M, Carr L, Osterbye K, eds.Proc.of the 7th ACM Conf.on Hypertext.New York: ACM Press, 1996.53?65.
    [57] Hearst MA.TextTiling: Segmenting text into multi-paragraph subtopic passages[J].Computational Linguistics, 1997,23(1):33?64.
    [58] Morris J, Hirst G.Lexical cohesion computed by thesauri relations as an indicator of the structure of text [J].Computational Linguistics, 1991,17(1):21?42.
    [59] Kozima H.Text segmentation based on similarity between words [C].In: Proc.of the 31st Annual Meeting of the Association for Computational Linguistics.1993.286?288.
    [60] Passoneau RJ, Litman DJ.Intention-Based segmentation: Human reliability and correlation with linguistic cues [C].In: Proc.of the 31st Meeting of the Association for Computational Linguistics.1993.148?155.
    [61] Ponte JM, Croft WB.Text segmentation by topic [C].In: Peters C, Thanos C, eds.Proc.of the 1st European Conf.on Research and Advanced Technology for Digital Libraries.Berlin, Heidelberg: Springer-Verlag, 1997.120?129.
    [62] Reynar JC.Topic segmentation: Algorithms and application [D].Pennsylvania: University of Pennsylvania, 1998.
    [63] Choi FYY.Advances in domain independent linear text segmentation [C].In: Proc.of the North American Chapter of the Association for Computational Linguistics Annual Meeting.Seattle: Association for Computational Linguistics.2000.
    [64] Choi FYY, Wiemer HP, Moore J.Latent semantic analysis for text segmentation [C].In: Lee L, Harman D, eds. Proc. of the 6th Conf. on Empirical Methods in Natural Language Processing. Somerset: Association for Computational Linguistics.2001. 109?117.
    [65] Yaari Y.Segmentation of expository texts by hierarchical agglomerative clustering [C].In: Mitkov R, Nicolov N, Nikolov N, eds. Proc. of the Conf. on Recent Advances in Natural Language Processing. Series: Current Issues in Linguistic Theory.1997.59?65.
    [66] Shi Jing, Hu Ming, Shi Xin, Dai Guo-zhong.Text segmentation based on model LDA [J].Chinese Journal of Computers, 2008(10), 1865-1873.(石晶,胡明,石鑫,戴国忠.基于LDA模型的文本分割[J].计算机学报, 2008(10), 1865-1873.)
    [67] SOGOU.文本分类语料库[DB/OL].[2009-01] .http://www.sogou.com/labs/dl/c.html.
    [68] Kauchak D, Chen F.Feature-Based segmentation of narrative documents [C].In: Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics.2005.32?39.
    [69] Zhu Jing-Bo, Ye Na,Luo Hai-Tao.Text segmentation model based on multiple discriminant analysis [J].Journal of Software, 2007, 18(3):85-94.(朱靖波,叶娜,罗海涛.基于多元判别分析的文本分割模型[J].软件学报, 2007, 18(3):85～94.)
    [70] Lu Yuchang, Lu Mingyu,Li Fan, et al. . Analysis and construction of word weighting function in VSM [J]. Journal of Computer Reasearch & Development, 2002,39(10);1205~1210 (in Chinese)
    [71] C. C. Aggrawal, P. S. YU. Finding generalized projected clusters in high dimensional spaces [C]. The SIGMOD’00, Dallas, 2000.
    [72] M. Dash, H. Liu. Feature selection for clustering [M]. The PAKDD-00, Kyoto,2000.
    [73] F.Sebastiani . Machine learning in automated text categorization [C] . ACM Computing Surveys, 2002, 34(1): 1~47.
    [74] Y. Yang, J. O. Pedersen. A comparative study on feature selection in text categorization [C]. The ICML97, Nashville, 1997.
    [75] M. Rogati, Y. Yang. High performance feature selection for text categorization [C]. The CIKM-02, Mclean, 2002.
    [76] L. Tao, L. Shengping, C. Zheng, et al. An evaluation on feature selection for text clustering [C]. The ICML03, Washington, 2003.
    [77]郭锋,李绍滋,周昌乐,林颖,李胜睿.基于词汇吸引与排斥模型的共现词提取[J].中文信息学报, 2004, 18 (06): 16-22.
    [78] Steinbach M, Karypis G, Kumar V . A Comparison of Document ClusteringTechniques[C]. Proc. of KDD Workshop on Text Mining’00, 2000.
    [79]鲁松,白硕.自然语言处理中术语上下文有效范围的定量描述[J].计算机学报0254-4164,2001:1～12.
    [80] Doug Beeferman , Adam Berger , John Lafferty. A Model of Lexical Attraction and Repulsion [C] . Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics.1997: 373 - 380.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700