详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
     (1)多类文本分类。本文针对纠错输出编码法ECOC (Error Correcting Output Code)在解码方面的不足,提出了一种基于支持向量机和概率纠错输出编码的多类文本分类算法。通过构造合适的编码矩阵训练多个两类分类器,并采用Sigmoid函数使其决策函数值概率化。提出两种判别测试文本类别的解码方式:类序列概率计算法和求编码矩阵伪逆法。在标准中英文数据集上的实验结果表明,本文的方法优于ECOC法传统的解码方法及其他经典分类算法。在样本类别分布不均的情况下,该算法仍保持较稳定的准确率。
     (3)面向Web实体的搜索。本文以参加的文本检索会议TREC(Text REtrieval Conference)评测的实体追踪(Entity Track)任务为主线,针对网页中的实体提出了一系列挖掘和检索的算法。实体抽取采取了手工辅助自动、规则结合统计的方法,创建了包含多个类型的实体词典。为实体排序提出了文档中心模型DCM(Document-Centered Model)和实体中心模型ECM (Entity-Centered Model),并在此基础上引入语义类别标签,提高检索的精度。另外,基于网页中实体应存在唯一标识的设定,提出了基于规则的主页分配算法。排名第一的评测结果验证了算法的有效性。另一方面,在半结构化的英文维基百科数据集上测试,引入语义类别标签将原有两种模型算法的NDCG指标分别提升了12.1%和25.6%。
     (4)基于激活力和亲和度的复杂网络建模与应用。本文以自然语言文本为例,通过词频、共现、距离等统计量模拟生物学和心理学上的词激活效应,计算词激活力WAF (Word Activation Force)。基于WAF计算词的亲和度,建立无向的词网络,研究词的语义相似性在此基础上,将WAF和亲和度用于文本表示、特征选择和文本分类。本算法还可以用来对蛋白质相互作用网络建模,分析蛋白质的关联性除此之外,实体的亲和度还有助于改善实体检索的排序效果。实验结果表明基于激活力和亲和度的复杂网络建模对Web文本挖掘具有重要意义。
With the rapid development of Internet and telecommunication network, web text becomes the important carrier of information and indispensable source. Web text mining depends on the theories in the fields of data mining, pattern recognition, information retrieval, natural language processing, etc. It aims to get comprehensible and easy-to-use knowledge from numerous and complicated texts. This dissertation focuses on several key problems in web text mining, such as text categorization, SMS filtering, information retrieval, complex network, etc.
     (1) Multiclass text categorization. This dissertation aims at the lack of Error Correcting Output Code (ECOC) in decoding, and proposes a method of multiclass text categorization based on Support Vector Machine (SVM) and probabilistic ECOC. Several binary classifiers are trained according to appropriate encoding matrix. Values of decision functions are transformed to probabilities by a sigmoid-style function. Two decoding algorithms are introduced for classifying samples. One is calculating the probabilities of each classes, the other is solving the pseudo-inverse of the encoding matrix. Experiments on standard Chinese and English datasets show that the methods are superior to traditional ECOC and other classic algorithms. Moreover, our methods keep stable precision in the condition that samples of each class are not evenly distributed.
     (2) Evolutionary SMS filtering. This dissertation proposes a series of algorithms and systems of evolutionary SMS filtering for difficulties of fast updates, personality and lack of training samples. First, a basic evolutionary system is introduced based on Naive Bayes classifier. Its innovations lie in flexible feedback for users, adaptive learning and evolutionary learning. Three types of personalized feedback are put forward according to the uses'habits. Evolutionary learning and adaptive learning are used to update features and their weights. Moreover, this dissertation proposes an interlayer mapping-based SMS filtering algorithm to address the problem in not only high precision but also few training samples. Experimental results show that the proposed method can effectively receive the stream of short messages and update the filter automatically. Interlayer mapping-based filtering algorithm achieves required accuracy with rapid convergence. It can be combined with traditional methods for boosting the performance when samples are enough for training.
     (3)Web entity-oriented search. This dissertation proposes a set of algorithms and systems for entity mining and retrieval based on the Entity Track at Text REtrieval Conference (TREC). Entity lexicons including dozens of types for entity extraction are established through semi-automatic, rule-based and statistic-based methods. Document-Centered Model (DCM) and Entity-Centered Model (ECM) are proposed for entity ranking. In addition, semantic category labels are introduced for improving the accuracy. Considering entities in web pages should be identified uniquely, a rule-based algorithm of homepage allocation is presented. Ranking first in official assessment testifies the effectiveness of the proposed methods. Besides, testing on the semi-structured English Wikipedia dataset indicates that semantic category labels improve DCM and ECM by12.1%and25.6%at NDCG, respectively.
     (4)Modeling and applications of complex network based on activation force and affinity measure. Taking natural language text as an example, Word Activation Force (WAF) like activation effect in biology and psychology is proposed by merging some statistics, such as word frequency, co-occurrence, distance, etc. Then word affinity measure and undirected network used for studying the semantic similarity between words are generated by WAR On this basis, WAF and word affinity measure are applied to text representation, feature selection and text categorization. These methods are also suitable for PPI (Protein-Protein Interaction) network modeling and protein association analysis. In addition, entity affinity measure contributes to the re-ranking in entity retrieval. Experimental results demonstrate complex network modeling based on activation force and affinity measure is of great significance for web text mining.
    [15]Fuhr N. Probabilistic Models in Information Retrieval:[PH.D thesis].University of Twente,2001.
    [16]RobertsonSE,JonesSK. Relevance weighting of search Terms. JASIS,1976, 27:129-14.
    [17]Van RiisbergenCJ. A Theoretical Basis for the Use of Co-occurrence Data in Information Retrieval. Journal of Doeumentation,1977,33:106-119.
    [19]G. Salton, A. Wong, C. Yang. A Vector Space Model For Automatic Indexing. Communications of the ACM,18(11):613-620,1975.
    [21]Y Yang, J O Pedersen. A Comparative Study on Feature Selection in Text Categorization [C]. The 14 th International Conference on Machine Learning, San Francisco:Morgan Kaufmann Publishers,1997.
    [22]Monica Rogati, Yiming Yang. High-Performing Feature Selection for Text Classification [C]. CIKM'02,New York:ACM Press,2002.429.
    [29]Zhai Chengxiang, Lafferty J. Model-based Feedback in the Language Modeling Approach to Information Retrieval. Proceedings of the 10th International Conference on Information and Knowledge Management,2001:403-410.
    [30]Jun Zhu, Zaiqing Nie, Xiaojing Liu, et al. StatSnowball:a Statistical Approach to Extracting Entity Relationships.the International World Wide Web Conference (WWW).2009.
    [31]Xiaojing Liu, Zaiqing Nie, Zaiqing Nie, Xiaojing Liu, et al. BioSnowball: Automated Population of Wikis. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).2010.
    [33]Watts, Duncan J.; Strogatz, Steven H. Collective dynamics of "small-world' networks. Nature, Volume 393, Issue 6684, pp.440-442.1998.
    [34]A. Barabasi, E. Bonabeau. "Scale-Free Networks". Scientific American:50-59. 2003.
    [35]Wu, J. et al. Integrated network analysis platform for protein-protein interactions. Nat. Methods 6,75-77.2009.
    [36]Weston J, Watkins C. Multi-class support vector machines.Royal Holloway, University of London,1998.
    [37]Van Rijsbergen C J, Information Retrieval, London:Butterworths,1979.
    [8]Joachims T. A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In proceedings of ICML-97,14th international conference on machine learning, Nashville,1997, pp.143-151.
    [11]V. Lertnattee and T. Theeramunkong. Imroving Centroid-based Text Classification Using Term-Distribution-Based Weighting and Feature Selection. The 2nd International Conference on Intelligent Technologies,2001,349-355.
    [12]Yiming Yang and Xin Liu, A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,1999,42-49.
    [14]Cheeseman P., Stutz J. Bayesian classfication (AutoClass):Theory and result. Advances in Knowledge Discovery and Data Mining,1996, pp.153-180.
    [17]Chih-Wei Hsu, Chih-Jen Lin. A Comparison of Methods for Multiclass Support Vector Machines. IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL.13, NO.2, MARCH 2002.
    [20]X. Lin, H. Liu. Chinese Question Classification Using Alternating and Iterative One-against-One Algorithm. Journal of Convergence Information Technology,5(3), pp.61-67,2010.
    [22]T. G. Dietterich, G. Bakiri. Solving Multiclass Learning Problems via Error-correcting Output Codes. Journal of Artificial Intelligence Research, vol.2, pp.263-286,1995.
    [23]E. L. Allwein, R. E. Schapire. Reducing Multiclass to Binary:a Unifying Approach for Margin Classifiers. Journal of Machine Learning Research, vol.1, pp.113-141,2000.
    [24]G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and Randomized GACV. Advances in Kernel Methods, USA,1999.
    [25]J. C. Platt. Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods. In Advances in Large Margin Classifiers, USA, 2000.
    [26]S. Escalera. On the Decoding Process in Ternary Error-Correcting Output Codes. IEEE Transactions on Pattern Analysis and Machine Intelligence,32(1), pp.120-134, 2010.
    [27]Eun Bae Kong, Thomas G. Dietterich. Probability Estimation via Error-Correcting Output Coding. Artificial Intelligence and Soft Computing.1997.
    [28]Reza Ghaderi, Terry Windeatt. Least Squares and Estimation Measures via Error Correctiong Output Code. LNCS, vol.2096, ppl48-157,2001.
    [29]Robert E. Schapire. Using output codes to boost multiclass learning problems. In Proceedings of 14th International Conference on Machine Learning. pp313-321. 1997.
    [30]Rong Jin, Jian Zhang. A Smoothed Boosting Algorithm Using Probabilistic Output Codes. In Proceedings of the 22nd International Conference on Machine Learning.2005.
    [1]工业和信息化部发布2011年8月通信业运行状况http://www.miit.gov.cn/ n11293472/n11293832/n11293907/n11368223/14179427.html.
    [10]M. Saluuni, el al. A Bayesian approach to filtering junk e-mail.in Proc. of AAAI Workshop on Learning for Text Categorization. Madison, Wisconsin. pp55-62. 1998.
    [12]Wen Pu, Ning Liu, Shuicheng Yan, et al. Local Word Bag Model for Text Categorization. Seventh IEEE International Conference on Data Mining.2007.10, pp625-630.
    [16]Boykin, P.O, Roychowdhury, V.P. Leveraging social networks to fight spam. IEEEComputer,2005, V38(4):61-68.
    [17]Joseph S. Kong, Behoam Attaran Rezaei, Nima Sarshar, Vwani P. Roychowdhury P. Oscar Boykin. Collaborative Spam Filtering Using E-Mail Networks. IEEE Computer,2006, V39(8):67-73.
    [19]Gordon Cormack. TREC 2006 Spam Track Overview. The Fifteenth Text REtrieval Conference (TREC 2006).2006.
    [12]Krisztian Balog. Overview of the TREC 2009 Entity Track. In Proceeding of the Eithteenth Text REtrieval Conference (TREC 2009).2009.
    [14]Zhanyi Wang, Dongxin Liu, Weiran Xu, et al. BUPT at TREC 2009:Entity Track. In Proceeding of the Eighteenth Text REtrieval Conference (TREC 2009).2009.
    [15]Yunbo Cao, Jingjing Liu, Hang Li. Research on Expert Search at Enterprise Track of TREC 2005. In Proceeding of the Fourteenth Text REtrieval Conference (TREC 2005),2005.
    [16]Shenghua Bao, Huizhong Duan, Qi Zhou, et al. Research on Expert Search at Enterprise Track of TREC 2006. In Proceeding of the Fifteenth Text REtrieval Conference (TREC 2006),2006.
    [17]Yupeng Fu, Yufei Xue, Tong Zhu, et al. THUIR at TREC2007:Enterprise Track. In Proceeding of the Sixteenth Text REtrieval Conference (TREC 2007),2007.
    [18]Krisztian Balog. Overview of the TREC 2008 Enterprise Track. In Proceeding of the Seventeenth Text REtrieval Conference (TREC 2008).2008.
    [19]Krisztian Balog. Overview of the TREC 2011 Entity Track. In Proceeding of the Twentieth Text REtrieval Conference (TREC 2011).2011.
    [20]Krisztian Balog. Category-based Query Modeling for Entity Search. In 32nd European Conference on Information Retrieval (ECIR 2010). pp.319-331,2010
    [21]Mayssam Sayyadian, Azadeh Shakery, et al. Toward Entity Retrieval over Structured and Text Data. the first Workshop on the Integration of Information Retrieval and Databases (WIRD'04).2004.
    [22]Yi Fang, Luo Si, Zhengtao Yu,et al. Entity Retrieval by Hierarchical Relevance Model, Exploiting the Structure of Tables and Learning Homepage Classifiers. In Proceeding of the Eighteenth Text REtrieval Conference (TREC 2009).2009.
    [23]Richard McCreadie, Craig Macdonald, Iadh Ounis, et al. University of Glasgow at TREC 2009:Experiments with Terrier. In Proceeding of the Eighteenth Text REtrieval Conference (TREC 2009).2009.
    [24]Qing Yang, Peng Jiang, Chunxia Zhang, et al. Reconstruct Logical Hierarchical Sitemap for Related Entity Finding. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [25]Dong Wang, Qing Wu, Haiguang Chen, et al. A Multiple-Stage Framework for Related Entity Finding:FDWIM at TREC 2010 Entity Track. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [26]Ekbal, Asif, Saha, Sriparna, et al. Multiobjective approach for feature selection in maximum entropy based named entity recognition. Proceedings of International Conference on Tools w ith Artificial Intelligence (ICTAI 2010),2010:323-326.
    [27]Borthwick A. A maximum entropy approach to named entity recognition. New York University, Department of Computer Science, Courant Institute,1999.
    [28]Lafferty J, McCallum A, Pereira F. Conditional random fields:probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001, ICML 2001:282-289.
    [29]Zhang Zhu-yu, Ren Fei-liang, Zhu Jing-bo. A comparative study of features on CRF-based Chinese named entity recognition. Proceedings of The Fourth National Conference of Information Retrieval and Content Security,2008:111-117.
    [31]Zhanyi Wang, Chunsong Tang, Xueji Sun. PRIS at TREC 2010:Related Entity Finding Task of Entity Track. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [32]Rianne Kaptein, Marijn Koolen. Result Diversity and Entity Ranking Experiments:Anchors, Links, Text and Wikipedia. In proceeding of the Eighteenth Text REtrieval Conference (TREC 2009),2009.
    [34]Balog K, de Rijke M, "Determining expert profiles (with an application to expert finding)," In IJCAI'07:Proc.20th Intern. Joint Conference on Artificial Intelligence, 2007:2657-2662.
    [38]Krisztian Balog. Overview of the TREC 2010 Entity Track. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [39]S. Campinas, D. Ceccarelli, T. E. Perry, et al. The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data. SIGIR 2011 workshop.2011.
    [2]Wu, J. et al. Integrated network analysis platform for protein-protein interactions. Nat. Methods 6,75-77.2009.
    [3]Girvan, M.& Newman, M. E. J. Community structure in social and biological networks. Proc. Natl Acad. Sci. USA 99,7821-7826.2002.
    [4]Radicchi, F., Castellano, C., Cecconi, F., Loreto, V.& Parisi, D. Defining and identifying communities in networks. Proc. Natl Acad. Sci. USA 101,2658-2663. 2004.
    [5]Adam Schenker. Graph-Theoretic Techniques for Web Content Mining. University of South Florida,2003.
    [6]Chuntao Jiang, Frans Coenen, Robert Sanderson, et al. Text Classification using Graph Mining-based Feature Extraction. Knowledge-Based Systems,23 (4),2010.
    [7]Adam Schenker, Mark Last, Horst Bunke. Classification of Web Documents Using a Graph Model. In Proceedings of the Seventh International Conference on Document Analysis and Recognition,2003.
    [8]Adam Schenker, Mark Last, Horst Bunke, et al. Chapter:Clustering of web documents using a graph model. Web document analysis:challenges and opportunities. World Scientific,2003.
    [12]Harris, Zellig. Mathematical Structures of Language.Wiley, New York.1968.
    [13]Salton, G., A.Wang, and C. Yang. A vector-space model for information retrieval. Journal of the American Society for Information Science,18:613-620.1975.
    [14]Landauer, Thomas and Susan T. Dumais. A solution to Plato's problem:The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review,104(2):211-240.1997.
    [15]Lund, Kevin and Curt Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28:203-208.1996.
    [16]Dawei Song, Peter Bruza. Discovering Information Flow Using a High Dimensional Conceptual Space. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,2001.
    [17]Peter Bruza, Dawei Song. A Comparison of Various Approaches for Using Probabilistic Dependencies in Language Modeling.In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,2003.
    [18]Leif Azzopardi, Mark Girolami, Malcolm Crowe. Probabilistic Hyperspace Analogue to Language. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,2005.
    [20]Curran, James R. and Marc Moens. Scaling context space. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 231-238, Philadelphia, PA.2002.
    [21]Grefenstette, Gregory. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Dordrecht.1994.
    [22]Lin, Dekang. Automatic retrieval and clustering of similar words. In Proceedings of the Joint Annual Meeting of the Association for Computational Linguistics and International Conference on Computational Linguistics, pages 768-774, Montreal, Canada.1998.
    [23]Sebastian Pado, Mirella Lapata. Dependency-Based Construction of Semantic Space Models. Computational Linguistics.2007.
    [24]Rubenstein, H.& Goodenough, John B. Contextual correlates of synonymy. Comm. of the ACM 8,627,1965.
    [25]Hodgson, James M. Informational constraints on pre-lexical priming. Language and Cognitive Processes,6:169-205.1991.
    [26]J. Guo, H. Guo, and Z. Wang. An Activation Force-based Affinity Measure for Analyzing Complex Networks. Sci.Rep.1,113; DOI:10.1038/srep00113.2011.
    [30]行花妮,刘刚,王磊.基于GN算法的快速算法在PPI网络中的实现.计算机 与信息技术.2009.
    [31]M. Girvan, M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America (PNAS). vol.99, no.12.2002.
    [32]Kaufman L., Rousseeuw P. J. Finding Groups in Data:An Introduction to Cluster Analysis. John Wiley&Sons, New York,1990.
    [33]Raymond T. Ng, Han Jiawei. Efficient and effective clustering method for spatial data mining. In proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94). ppl44-155.1994.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700