基于中间语义的跨语言文本分类模型研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网的发展,网络已成为人们获取信息的重要来源,同时,来自政府部门、学术领域和商业领域的信息也在急剧增加,这些信息涵盖的都是一种多语言的知识库,而普遍的情况是大多数人通常只习惯在自己的母语里查找相关的信息,所以人们能理解的互联网信息往往只是冰山一角。互联网信息的多语言性和人们所能熟练运用语言的有限性,使得语言已经成为人们进行信息获取和理解的主要障碍之一。
     应运而生的跨语言文本分类技术,作为组织和管理来自政府部门、学术领域、商业领域以及国际性组织内部的多语言文本的有力手段,正受到越来越多的关注。它可以克服语言障碍问题,使用户可以更加有效的管理和定位所需要的信息。
     基于辞典的模式和机器系统翻译的技术一度成为人们进行跨语言文本分类的热点研究技术。基于辞典的模式就是采用双语辞典来做翻译,这里主要的问题是词的歧义性,一个词汇可能有多重意义,因此产生类似一般机器翻译系统选词的问题。另一个问题是辞典本身的覆盖度不够,动态的专有名词如人名、地名、机构名称等日新月异,很有可能在翻译过程中在辞典中找不到。而机器系统的翻译主要是针对文献翻译进行的,文献翻译的缺点是在遇到大文本集合的时候执行效率不高,花费代价太大。
     目前不通过翻译进行跨语言文本分类的典型技术是Latent Semantic Indexing(LSI)[1],这是一种基于内容概念的技术。LSI技术虽然不需要翻译,但是SVD的计算比较花费时间, K值只能通过反复尝试来确定。
     针对上述问题,我们提出了一种基于中间语义的跨语言文本分类模型,该模型通过双语语料库的平行文档在统一框架下建模,提取双语之间的语义对应关系。本文较为详细的阐述了基于中间语义的跨语言文本分类模型的原理,研究了其在特征维数和潜在变量对对数变化的情况下的分类性能的稳定性。并把跨语言的文本分类与单语言的文本分类相比较,实验结果显示,基于中间的跨语义语言文本分类具有良好的分类稳定性和准确度。
     本文的创新之处有:第一,利用改进的偏最小二乘理论技术,提出了新的基于中间语义的跨语言文本分类模型;第二,建立了一定的中英文平行语料库,为以后扩充中英文平行语料库打下了一定的基础。
With the development of the Internet, the network becomes the important source of the information, at the same time, information coming from governments, academic fields and business domains increases rapidly. These resources are multilingual knowledge base, however, a general condition is that people are customer to query using native language, it induce people to understand only a very few information. Because of the multilingual information and limitation that people skilled use language, language becomes one of the barriers when people get and use information.
     As one of the most effective text information management methods, Cross Language Text Categorization (CLTC) which can over come language barrier to help people to manage multilingual texts more quickly and easily turns up.
     Based-dictionary and machine translation technology were popular in the Cross language Text Categorization. The method of based-dictionary use bilingual dictionary to translate, but due to lack of context information and words have more than one meaning, it makes removal ambiguity of words difficultly. On the other hand, because of dictionary including not all words, such as people names, place names, so when translate these words would not find them in the dictionary. Besides, it is costly and infeasible when encounter large-scale corpus.
     Latent Semantic Indexing (LSI) was introduced to Cross Language Text Categorization which not used translation technology. It based content concepts, but the SVD complexity is still very high, and k value need do experiments repeatedly.
     To solve these problems, we present a new Cross Language Text Categorization model based on interlingua semantics, which modeling a unified framework that extracts the interlingua semantic pairs from the parallel bilingual corpus. The model principle and the results of the influence of feature dimension and interlingua semantics on the performance of the new Cross Language Text Categorization model are described in this thesis. In addition, we compare new model with mono-language text categorization, and the experiments show that new model have well performance.
     The main creative points of this thesis are: firstly, by extending PLS(partial least squares) principle, we propose a new cross language text categorization model; secondly, build some bilingual corpus, it is the foundation of building bilingual corpus in the future.
引文
[1] U.M.Orengo , C.Huyck ,Portuguese-English experiments using Latent Semantic Indexing .Advances in cross-language information retrieval :third workshop of the cross-language Evaluation Forum , CLEF2002,Rome,Italy,September19-20,2002:revised papers.Berlin ,New York: Springer-Verlag,2003:147-154
    [2] Lan Nie, Brian D.Davison, Xiaoguang Qi, Topical Link Analysis for Web Search. In Proceedings of the 29th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval , pages 91-98,Aug.2006.
    [3] Donald Metzler, W.Bruce Croft, Latent Concept Expansion Using Markov Random Fields .In Proceedings of the 30th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval , pages 311-318,July.2007.
    [4] Ralitsa Angelova,Gerhard Weikum, Graph-based Text Classification:Learn from Your Neighbors.In Proceedings of the 29th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval , pages 485-492,Aug.2006.
    [5] Q.Lu,L. Getoor . Link-based classification.ICML,2003
    [6] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pair wise coupling. J. of Machine Learning Research, 5:975–1005, 2004.
    [7] Cong Li, Ji-Rong Wen, Hang Li. Text Classification Using Stochastic Key Generation. Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003),Washington DC,2003.
    [8] DouglasW. Oard, Bonnie J. Dorr. A Survey ofMultilingual Text Retrieval[A ]. University of Mary land, Tech Rep:UM IACS-TR-96-19 CS-TR-3615, 1996.
    [9]闵金明,孙乐,张俊林.重新审视跨语言信息检索[J].中文信息学报,2006,20(4):33-40
    [10] F. Gey, A. Chen. TREC-9 Cross-Language Information Retrieval( English-Chinese) Overview [A ]. TREC9[C ]. Gaithersburg, Maryland. 2000.
    [11] Lisa Ballesteros, W. Burce Croft. Resolving ambiguity for cross-language retrieval [A ]. In: . p roceedings of the First NTCIR Workshop on Research in Japanese Text Tetrieval and Term Recognition[C ]. Melbourne, Australia. 1998.
    [12] Jian-Yun Nie, Michel Simard ,Pierre L sabelle, Richard Durand. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from theWeb[A ]. ACM SIGIR1999 [C ]. Berkeley, California, United States. 1999.
    [13] Yi Liu,Rong Jin,Joyce Y. A maximum coherence model for dictionary based cross language information retrieval.In Proceedings of the 28th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval , pages 485-492,Aug.2005.
    [14] Jianfeng Gao, Jian-Yun Nie .A Study of Statistical Models for Query Translation: Finding a Good Unit of Translation .In Proceedings of the 29th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval , pages 194-201,Aug.2006.
    [15]王进,陈恩红,张振亚.基于本体的跨语言信息检索模型[J].中文信息学报,2004,18(3):1-8
    [16] Michael L. Littman,Susan T. Dumais,Thomas K. Landauer ,Automatic Cross-Language Information Retrieval using Latent Semantic Indexing,Working notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, 1997.
    [17] Lisa Ballesteros and W. Bruce Croft. Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, 791-801,1996.
    [18] Lide Wu, Xuanjing Huang, etc., FDU at TREC-9: CLIR, QA and Filtering Tasks. In: The Ninth Text REtrieval Conference (TREC 9), 2000
    [19] Jianfeng GAO, Jian-yun NIE, Endong XUN, Jian ZHANG, Ming ZHOU and Changning HUANG, Improving Query Translation for Cross-Language Information Retrieval using Statistical Models, SIGIR, 2001.
    [20]林鸿飞,王剑峰.双语交叉分类模型的设计与实现[J].中文信息学报,2001,16(6):27-32.
    [21] Michael L. Littman,Susan T. Dumais,Thomas K. Landauer ,Automatic Cross-Language Information Retrieval using Latent Semantic Indexing,Working notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, 1997.
    [22] Bob Rehder, Michael L. Littman, Susan Dumais, and Thomas K. Landauer. Automatic 3-language cross-language information retrieval with latent semantic indexing. In The Sixth Text Retrieval Conference Notebook Papers (TREC6), 1997.
    [23] Susan T. Dumais, Thomas K. Landauer and Michael L. Littman. Automaticcross-linguistic information retrieval using latent semantic indexing. SIGIR96 Workshop On Cross-Linguistic Information Retrieval, 1996.
    [24] DEERWESTER S.,DUMAIS S.T.,FURNAS G.W.,LANDAUER T.K., and HARSHMAN R., Indexing by latent semantic analysis. Journal of the American Society for Information Science (1990).
    [25] Qianli JIN, Jun ZHAO, and Bo XU. Weakly-supervised probabilistic latent semantic analysis and its applications in multilingual information retrieval. JSCL, pages 527-533, 2003.(in Chinese)
    [26] F.Sebastiani. Machine learning in automated text categorization . ACM Computing Surveys,2002,34(1):1-47.
    [27]薛德军,中文文本自动分类中的关键问题研究:[博士学位论文].北京:清华大学计算机科学与技术系,2004.
    [28] Porter M.F.An Algorithm for Suffix Stripping . Program,1980,14(3):130-137.
    [29]李荣陆.文本分类及其相关技术研究:[博士学位论文].上海:复旦大学计算机与科学技术系,2005.
    [30] Vapnic V.The Nature of Statistical Learning Theory .New York: Springer-Verlag,1995
    [31] Buckley C.,Salton G.,Allan J.and Singhal A.Automatic Query Expansion Using SMART:TREC3.In Proc.3rd Text Retrieval Conference,Nist,1994
    [32] Salton G.and McGill,M.J.An Introduction to Modern Information Retrieval.McGraw-Hill,1983
    [33] Salton G.and Buckley C.Term weighting approaches in automatic text retrieval . Information Processing and Management,1988,24(5):513-523.
    [34] Joachims T.Text Categorization with Support Vector Machines:Learning with Many Relevant Features.Proceedings of the 10th European Conference on Machine Learning,Lecture Notes In Computer Science,1998,1398:137-142.
    [35] Joachims T.Making large-Scale SVM Learning Practical. Advances in Kernel Methods-Support Vector Learning,B.Scholkopf and C.Burges and A.Smola,eds.,Cambridge,MA,USA:MIT-Press,1999.
    [36] PlattJ.Fast Training of Support Vector Machines using Sequential Minimal Optimization.Advances in Kernel Methods-Support Vector Learning,B.Scholkopf,C,Burges,and A.Smola,eds., Cambridge, MA,USA:MIT Press,1998.
    [37]王昊,跨语言信息检索实现方法与关键技术探讨.情报检索:Journal of Information No.7,2005:46-49
    [38] R.H. Creecy, B.M. Masand, S.J. Smith and D.L. Waltz.Trading mips and memory for knowledge engineering: classifying census returns on the connection machine.Comm.ACM,35:48-63,1992.
    [39] Y.Yang. Expert network: Effective and efficient learning from humandecision in text categorization and retrieval. In 17th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’94),pages13-22,1994.
    [40] Yiming Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval,1Vol 1,No.1/2,pp67-88,1999.
    [41] D.D.Lewis and M.Ringuette. Comparison of two learning algorithms for text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval(SDAIR’94),1994.
    [42] Moulinier. Is learning bias an issue on the text categorization problem? In Technical report, LAFORIN-LIP, University Paris VI,1997.
    [43] V.N.Vapnik.The Nature of statisticallearning Theory.Springer-Verlag,NewYork,1995.
    [44] Mitchell T.M.著.曾华军,张银奎等译。机器学习.北京:机械工业出版社,2003.
    [45] Alfio Gliozzo,Carlo Strapparava. Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora. Proceeding of the ACL Workshop on Building and Using Parallel Texts:pages9-16.
    [46]侯艳飞.跨语言信息检索研究。北京大学学位论文,2003(6).
    [47]王知津,贾福新,郑红军等译.现代信息检索.机械工业出版社.
    [48]吴丹.跨语言信息检索技术应用与研究进展.情报科学,2006 Vol.24,No.9.
    [49] http://trec.nist.gov,2006-01-05.
    [50] http://clef.isti.cnr.it,2006-01-05.
    [51] Chung-HongLee Hsin-ChangYang Sheng-Min Ma. A Novel Multilingual Text Categorization System using Latent Semantic Indexing ICICIC.VOL2 503-506.2006
    [52] Nuria Bel,Cornelis H.A.Koster,and Marta Villegas.Cross-lingual text categorization.In ECDL,pages126-139,2003.
    [53] Yao yong Li, Shawe-Taylor John. Journal of Intelligent Information Systems,Volume27,Number 2,September 2006,page:117-133(17).
    [54] http://wordnet.princeton.edu/
    [55] http://research.microsoft.com/nlp/Projects/MindNet.aspx
    [56] http://www.illc.uva.nl/EuroWordNet/
    [57] http://www.keenage.com/
    [58]聂建云,陈江.利用平行网页建立中英文统计翻译模型.中文信息学报, 2001,15(1):1-10
    [59] Wold, H. Partial least squares [M]. New York: Kotz, S. and Johnson N.L., Encyclopedia of Statistical Science. Wiley, 1985.
    [60]王惠文.偏最小二乘方法及其应用.北京:国防工业出版社,1999.
    [61]高惠璇.应用多元统计分析.北京:北京大学出版社,2005
    [62] Amine, Bentaalah Mohamed Mimoun ,Malki,Computer Systems and Applications , 2007 AICCSA apos; 07.IEEE/ACS InternationalConference on Volume , Issue, 13-16 May 2007 Pages(s):848-855
    [63] Hsin-Chang Yang, Ding-Wen Chen, and Chung–Hong Lee(2007) Mining Multilingual Texts Using Growing Hierarchical Self- Maps. Accepted in The International Conference on Machine Learning and Cybernetics (ICML2007),Hong Kong, China, Aug.19-22,2007

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700