维基百科大数据的知识挖掘与管理方法研究

英文题名：Research on Knowledge Extraction and Management of the Big Data in Wikipedia
作者：肖奎
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：维基百科 ; 大数据 ; 群体协作 ; 知识挖掘 ; 知识管理
英文关键词：wikipedia ; big data ; mass collaboration ; knowledge extraction ; knowledge
英文关键词：management
学位年度：2013
导师：李兵
学科代码：081202
学位授予单位：武汉大学
论文提交日期：2013-11-01

摘要

当前,人类已经进入大数据时代,生产、生活、科研、服务等无不因大数据而改变。与此同时,传统的“数据→信息→知识→智慧→决策”的知识形成过程与决策产生模式面临着大数据的体量巨大、模态多样、真伪难辨以及更新迅速等特性的严峻挑战。将繁芜庞杂的大数据,转换为信息和知识,才能帮助我们做出聪明的选择。实践证明,通过大规模群体协作、非线性、去中心化、自下而上的群体智慧方法,是实现大数据“去芜存菁”、“沙里淘金”的有效途径。
     维基百科是通过群体协作生产知识的最典型平台,同时也是大数据的典型代表。如何从维基百科大数据中挖掘高质量的领域知识,并实现高质量的知识管理是本文主要研究目标。围绕此目标,本文的主要研究工作如下：
     (1)总结了维基百科群体协作环境的特征,其中包括协同编辑词条的方法、词条质量等级的设置、高质量词条的评选规则。
     (2)研究了编辑者群体协作行为对词条质量的影响。基于用户讨论页建立了编辑者网络,分析了编辑者群体里对话者比例与编辑者网络聚类系数对词条质量升级速度的影响,为后面的词条质量检测打下了基础。
     (3)提出了一种维基百科知识质量管理方法,同时应用词条属性与编辑者属性,实现对全部等级的词条评价质量。这些属性数据都可以从维基百科数据库获取,而不同语言版本的维基百科数据库结构都是相同的,因此本文的词条质量检测方法可以方便的用在各种语言版本的词条上。
     (4)应用上述知识质量管理方法,筛选出维基百科大数据里指定领域的高质量词条,并进一步分析这些高质量词条与领域的相关度。抽取那些与领域紧密相关的高质量词条作为本体的概念,抽取这些词条的关系作为本体的关系,构建高质量的领域本体。作为对这个构建本体方法的检验,本文也将构建的领域本体用到O-RGPS领域建模工具中,用来标注角色(Role)、目标(Goal)、流程(Process)、服务(Service)等领域模型。同时,也把领域本体用到S2R2这个Web服务注册管理平台,以支持Web服务的语义标注以及语义搜索。
At present, we have entered the Age of Big Data. Manufacturing, living, researching, serving are all changed by big data. At the same time, the process of knowledge creating and the model of decision making,"data→information→knowledge→wisdom→decision", are facing adverse conditions. Big data is so large, and has too many models, and cannot be distinguished whether it is genuine or fake, and changes so frequnently. Only by transforming the so large and complex data sets into informations and knowledges can we make right choices. Practices show that the methods based on group intelligence, such as mass collaborative methods, nonlinear methods, decentralized methods, can help people hunt valuable knowledges.
     Wikipedia is a typical platform which creates knowledges based on mass collaboration, as well as a typical example of big data. As a mass collaboration platform, knowledge qualities are always uneven. The main goals of this paper are extracting high-quality domain knowledges and managing the knowledges. The contributions are as follows:
     (1) The characteristics of the mass collaboration environment of Wikipedia are summarized, including article editing tasks, article quality rating system, and the voting process of high-quality articles.
     (2) The impacts of mass collaboration behaviors on article qualities are analyzed. The editor netwoks are built based on the User Talk Pages. The impacts of the attributes, such as the ratio of conversational editors and the clustering coefficient of editor network, on the speed of quality promotion are clarified. It is the groundwork of the knowledge quality management task.
     (3) A new method of knowledge quality management in Wikipedia is proposed. This method employs both article attributes and editor attributes, and can assess article qualities of all quality levels. Because all the attribute values can be extracted from the Wikipedia database, this method can be used to detection article qualities of any languages.
     (4) High-quality articles of the specific domain were extracted from Wikipedia by using the quality detection method. After that, the degree of domain relevancy of every article was analysed. The closely related articles were used as concepts of ontology. Then the relations of the concepts were also extracted to build domain ontologies. The domain ontologies were used in a domain modeling tool, O-RGPS, in order to annotate the domain models Role, Goal, Process and Service. On the other hand, the domain ontologies were used in the platform, S2R2, which can support the semantic annotation and semantic search of web services.

引文

[1]Big data. Nature,2008,455(7209):pp.1-136.
    [2]Dealing with data. Science,2011,331(6018):pp.639-806.
    [3]J. Manyika, M. Chui, B. Brown, et al. Big data:The next frontier for innovation, competition, and productivity[J].2011.
    [4]Big Data Across the Federal Government [EB/OL]. [2013-07-02]. http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final_1.pdf.
    [5]J. Dean, S. Ghemawat. MapReduce:simplified data processing on large clusters[J]. Communications of the ACM,2008,51(1):pp.107-113.
    [6]S. P. Ponzetto, M. Strube. Deriving a large-scale taxonomy from Wikipedia. In Proc. of 22nd AAAI Conference on Artificial Intelligence. Vancouver, Canada,2007:pp.1440-1445.
    [7]M. Ruiz-Casado, E. Alfonseca, P. Castells. Automatising the learning of lexical patterns:An application to the enrichment of wordnet by extracting semantic relationships from wikipedia. Data & Knowledge Engineering,2007,61(3):pp.484-499.
    [8]李赟.基于中文维基百科的语义知识挖掘相关研究[D].北京：北京邮电大学,2009.
    [9]张海粟.在线合作社会网中的用户行为与兴趣挖掘[D].南京：中国人民解放军理工大学,2012.
    [10]张海粟,陈桂生,马于涛等.基于在线百科全书的群体兴趣及其关联性挖掘[J].计算机学报,2011,34(11)：2234-2242.
    [11]A. Gregorowicz, M. A. Kramer. Mining a large-scale term-concept network from wikipedia. MITRE Corporation,2006 (202).
    [12]A. Ulanov, D. Ryashchentsev. Framework for Effective Representation of Wikipedia and Graph-based Distance Calculation. HP Laboratories Technical Report,2010 (153).
    [13]S. P. Ponzetto, M. Strube. Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In Proc. of main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics,2006:pp.192-199.
    [14]W. Y. Chung. An Automatic Text Mining Framework For knowledge discovery on the Web [D]. USA:University of Arizona,2004.
    [15]W. M. Wang, C. F. Cheung, W. B. Lee, et al. Mining knowledge from natural language texts using fuzzy associated concept mapping. Information Processing and Management:an International Journal,2008,44(5):pp.1707-1719.
    [16]Bundscltus. Extracting, representing, and mining semantic metadata from text:facilitating knowledge discoversy in biomedicine [D]. USA:Wright State University.
    [17]C. Torniai, J. Jovanovic J, S. Bateman, et al. Leveraging folksonomies for ontology evolution in e-learning environments. In Proc. of IEEE International Conference on Semantic Computing,2008: pp.206-213.
    [18]R. J. Mooney, R. Bunescu. Mining knowledge from text using information extraction. ACM SIGKDD explorations newsletter,2005,7(1):pp.3-10.
    [19]J. Tang, J. Zhang, L. Yao, et al. ArnetMiner:extraction and mining of academic social networks. In Proc. of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,2008:pp.990-998.
    [20]J. Li, J. Tang, J. Zhang, et al. Eos:expertise oriented search using social networks. In Proc. of the 16th international conference on World Wide Web. ACM,2007:pp.1271-1272.
    [21]D. Mimno, A. McCallum. Expertise modeling for marching papers with reviewers. In Proc. of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007:pp.500-509.
    [22]S. Bao, H. Duan, Q. Zhou, et al. A probabilistic model for fine-grained expert search. In Proc. of the 46th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies.2008:pp.914-922.
    [23]E. Zavitsanos, G Paliouras, G. A. Vouros, et al. Discovering subsumption hierarchies of ontology concepts from text corpora. In Proc. of the IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society,2007:pp.402-408.
    [24]B. Sigurbjornsson, R. Zwol. Flickr tag recommendation based on collective knowledge. In Proc. of the 17th international conference on World Wide Web. ACM,2008:pp.327-336.
    [25]G Cong, L. Wang, C. Y. Lin, et al. Finding question-answer pairs from online forums. In Proc. of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM,2008:pp.467-474.
    [26]J. Yeh, N. Yang. Ontology construction based on latent topic extraction in a digital library [M]. Digital Libraries:Universal and Ubiquitous Access to Information. Springer Berlin Heidelberg, 2008:pp.93-103.
    [27]S. Auer, C. Bizer, G. Kobilarov,et al. Dbpedia:A nucleus for a web of open data. In Proc. of 6th International Semantic Web Conference. Busan, Korea,2007:pp.722-735.
    [28]F. M. Suchanek, G Kasneci, G Weikum. YAGO:A Large Ontology from Wikipedia and WordNet. Journal of Web Semantics,2008,6(3):pp.203-217.
    [29]F. Wu, D. S. Weld. Automatically refining the wikipedia infobox ontology. In Proc. of 17th International Conference on World Wide Web (WWW 2008), Beijing, China, pp.635-644.
    [30]F. Orlandi, A. Passant. Modelling provenance of DBpedia resources using Wikipedia contributions. Web Semantics:Science, Services and Agents on the World Wide Web,2011,9(2):pp.149-164.
    [31]L. Lian, J. Ma, J. S. Lei, et al. Automated Construction Chinese Domain Ontology from Wikipedia. In Proc. of 4th International Conference on Natural Computing,2008:pp.670-674.
    [32]N. Tomuro, A. Shepitsen. Construction of Disambiguated Folksonomy Ontologies Using Wikipedia. In Proc. of 2009 Workshop on The People's Web Meets NLP:Collaboratively Constructed Semantic Resources,2009:pp.42-50.
    [33]S. Chernov, T. Iofciu, W. Nejdl, et al. Extracting semantic relationships between wikipedia categories. In Proc. of 1st Workshop on Semantic Wikis (SemWiki 2006). Budva, Montenegro, 2006.
    [34]M. Hepp, D. Bachlechner, K. Siorpaes. Harvesting Wiki Consensus-Using Wikipedia Entries as Ontology Elements. In Proc. of 1st Workshop on Semantic Wikis (SemWiki 2006). Budva, Montenegro,2006.
    [35]Z. S. Syed, T. Finin, A. Joshi. Wikitology:Using wikipedia as an ontology. In Proc. of 2nd international conference on Weblogs and Social Media,2008.
    [36]A. Herbelot, A. Copestake. Acquiring ontological relationships from wikipedia using rmrs. In Proc. of Workshop on Web content Mining with Human Language Technologies, (ISWC06).
    [37]M. Strube, S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In Proc. of 21 st National Conference on Artificial Intelligence. Boston, USA,2006:pp.1419-1424.
    [38]S. P. Ponzetto, M. Strube. An API for measuring the relatedness of words in Wikipedia. In Proc. of 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics,2007:pp.49-52.
    [39]李赞,黄开妍,任福继等.维基百科的中文语义相关词获取及相关度分析计算[J].北京邮电大学学报,2009,32(3)：109-112.
    [40]王瑞琴,孔繁胜.利用Wikipedia的结构化信息计算语义相关性[J].浙江大学学报(工学版),2009,43(2):315-320.
    [41]S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proc. of EMNLP-CoNLL.2007,6:pp.708-716.
    [42]R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proc. of NAACL HLT. 2007:pp.196-203.
    [43]A. Garcia; M. Szomszor; H. Alani, el al. Preliminary results in tag disambiguation using DBpedia. In Proc. of 5th International Conference on Knowledge Capture (K-Cap'09)-1st (?) International Workshop on Collective Knowledge Capturing and Representation (CKCaR'09), Redondo Beach, USA.
    [44]L. Xu, H. Takeda, M. Hamasaki, et al. Typing Software Articles with Wikipedia Category Structure[J]. NII Technical Reports,2010.
    [45]R. Bunescu, M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proc. of EACL.2006,6:pp.9-16.
    [46]P. Mika, M. Ciaramita, H. Zaragoza, et al. Learning to tag and tagging to learn:A case study on wikipedia. IEEE Intelligent Systems,2008,23(5):pp.26-33.
    [47]M. Q. Hu, E. P. Lim, A. X. Sun, H. W. Lauw, B. Q. Vuong. Measuring Article Quality in Wikipedia: Models and Evaluation. In Proc. of 16th ACM Conference on Information and Knowledge Management,2007:pp.243-252.
    [48]M. Q. Hu, E. P. Lim, A. X. Sun, H. W. Lauw, B. Q. Vuong. On Improving Wikipedia Search using Article Quality. In Proc. of 9th ACM International Workshop on Web Information and Data Management,2007:pp.145-152.
    [49]E. P. Lim, B. Q. Vuong, H. W. Lauw, A. X. Sun. Measuring Qualities of Articles Contributed by Online Communities. In Proc. of 2006 IEEE/WIC/ACM International Conference on Web Intelligence,2006:pp.81-87.
    [50]A. Lih. Wikipedia as Participatory Journalism:Reliable Sources?Metrics for evaluating collaborative media as a news resource. In Proc. of 5th International Symposium on Online Journalism. Austin, USA,2004.
    [51]D. M. Wilkinson, B. A. Huberman. Cooperation and Quality in Wikipedia. In Proc. of the International Symposium on Wikis,2007:pp.157-164.
    [52]B. Stvilia, M. B. Twidale, L. C. Smith, et al. Assessing Information Quality of a Community-Based Encyclopedia. In Proc. of 2005 International Conference on Information Quality,2006.
    [53]H. L. Zeng, M. A. Alhossaini, L. Ding, et al. Computing Trust from Revision History. In Proc. of International Conference on Privacy, Security and Trust (PST 2006),2006.
    [54]D. L. McGuinness, H. L. Zeng, P. P. D. Silva, et al. Investigations into Trust for Collaborative Information Repositories:A Wikipedia Case Study. In Proc. of WWW'06 Workshop on Models of Trust for the Web, (MTW 2006),2006.
    [55]J. E. Blumenstock. Size Matters:Word Count as a Measure of Quality on Wikipedia. In Proc. of 17th International Conference on World Wide Web (WWW 2008),2008:pp.1095-1096.
    [56]M. Anderka, B. Stein, N. Lipka. Towards Automatic Quality Assurance in Wikipedia. In Proc. of 20th International Conference on World Wide Web,2011:pp.5-6.
    [57]M. Anderka, B. Stein, N. Lipka. Predicting Quality Flaws in User-generated Content:The Case of Wikipedia. In Proc. of 35th International ACM SIGIR conference on research and development in Information Retrieval,2012:pp.981-990.
    [58]N. Lipka, B. Stein. Identifying Featured Articles in Wikipedia Writing Style Matters. In Proc. of 19th International Conference on World Wide Web,2010:pp.1147-1148.
    [59]E. Lex, M. Voelske, M. Errecalde, et al. Measuring the Quality of Web Content Using Factual Information. In Proc of 2nd Joint WICOW/AIRWeb Workshop on Web Quality,2012:pp.7-10.
    [60]T. Wohner, R. Peters. Assessing the Quality of Wikipedia Articles with Lifecycle Based Metrics. In Proc. of 2009 International Symposium on Wikis and Open Collaboration,2009.
    [61]A. Cusinato, V. D. Mea, F. D. Salvatore, et al. QuWi:Quality Control in Wikipedia. In Proc. of 3rd ACM Workshop on Information Credibility on the Web,2009:pp.27-34.
    [62]李德毅,张海粟,王树良等.维基的词条质量检测研究[J].武汉大学学报·信息科学版,2011,36(12)：1387-1391.
    [63]D. J. Watts, S. H. Strogatz. Collective dynamics of'small-world'networks. Nature,1998,393(6684): pp.440-442.
    [64]A. L. Barabasi, R. Albert. Emergence of scaling in random networks. Science,1999,286(5439):pp. 509-512.
    [65]A. Capocci, V. D. P. Servedio, F. Colaiori, et al. Preferential attachment in the growth of social networks:The internet encyclopedia Wikipedia. Physical Review E,2006,74(3):036116.
    [66]V. Zlatic, M. Bozicevic, H. Stefancic, et al. Wikipedias:Collaborative web-based encyclopedias as complex networks. Physical Review E,2006,74(1):016115.
    [67]F. N. Silva, M. P. Viana, B. A. N. Travencolo, et al. Investigating relationships within and between category networks in Wikipedia. Journal of informetrics,2011,5(3):pp.431-438.
    [68]L. Muchnik, R. Itzhack, S. Solomon, et al. Self-emergence of knowledge trees:Extraction of the Wikipedia hierarchies. Physical Review E,2007,76(1):016106.
    [69]T. Zesch, I. Gurevych. Analysis of the Wikipedia category graph for NLP applications. In Proc. of TextGraphs-2 Workshop (NAACL-HLT 2007).2007:pp.1-8.
    [70]I. C. Wu, C. Y. Wu. Using internal link and social network analysis to support searches in Wikipedia: A model and its evaluation. Journal of Information Science,2011,37(2):pp.189-207.
    [71]L. S. Buriol, C. Castillo, D. Donato, et al. Temporal analysis of the wikigraph. In Proc. of Web Intelligence (WI2006),2006:pp.45-51.
    [72]F. Bellomi, R. Bonato. Network Analysis for Wikipedia. In Proc. of Wikimania,2005.
    [73]D. Laniado, R. Tasso. Co-authorship 2.0:Patterns of collaboration in Wikipedia. In Proc. of 22nd ACM conference on Hypertext and hypermedia.2011:pp.201-210.
    [74]P. Massa. Social networks of Wikipedia. In Proc. of 22nd ACM conference on Hypertext and hypermedia.2011:pp.221-230.
    [75]K. Nemoto, P. Gloor, R. Laubacher. Social capital increases efficiency of collaboration among Wikipedia editors. In Proc. of 22nd ACM conference on Hypertext and hypermedia.2011:pp. 231-240.
    [76]T. Iba, K. Nemoto, B. Peters, et al. Analyzing the creative editing behavior of Wikipedia editors: Through dynamic social network analysis. Procedia-Social and Behavioral Sciences,2010,2(4):pp. 6441-6456.
    [77]N. T. Korfiatis, M. Poulos, G Bokos. Evaluating authoritative sources using social networks:an insight from Wikipedia. Online Information Review,2006,30(3):pp.252-262.
    [78]赵飞,周涛,张良等.维基百科研究综述[J].电子科技大学学报,2010,39(3):321-334.
    [79]M. Tkachenko, A. Ulanov, A. Simanovsky. Fine Grained Classification of Named Entities In Wikipedia. HP Laboratories Technical Report, n 166,2010.
    [80]Y. Watanabe, M. Asahara, Y. Matsumoto. A graph-based approach to named entity categorization in Wikipedia using conditional random fields. In Proc. of 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.2007:pp. 649-657.
    [81]E. Gabrilovich, S. Markovitch. Overcoming the brittleness bottleneck using Wikipedia:Enhancing text categorization with encyclopedic knowledge. In Proc. of National Conference on Artificial Intelligence.2006,21(2):1301.
    [82]S. K. Ray, S. Singh, B. P. Joshi. A semantic approach for question classification using WordNet and Wikipedia. Pattern Recognition Letters,2010,31(13):pp.1935-1943.
    [83]J. Voss. Collaborative thesaurus tagging the Wikipedia way. The Computing Research Repository (CoRR),2006.
    [84]H. L. Yang, C. Y. Lai. Motivations of Wikipedia content contributors. Computers in Human Behavior,2010,26(6):pp.1377-1383.
    [85]Y. Takahashi, H. Ohshima, M. Yamamoto, et al. Evaluating significance of historical entities based on tempo-spatial impacts analysis using Wikipedia link structure. In Proc. of 22nd ACM conference on Hypertext and hypermedia.2011:pp.83-92.
    [86]孙常龙,洪宇,葛运东等.基于维基百科的未登录词译文挖掘[J].计算机研究与发展,2011,48(6):1067-1076.
    [87]余旸,林漳希,夏国平. Wikipedia中的语义析取[J].北京航空航天大学学报,2009,35(10):1283-1286.
    [88]余肠,林漳希,夏国平.基于链接结构分析的主题搜索[J].北京工业大学学报,37(4):614-618.
    [89]李德毅,张海粟,王树良等.维基百科统计分析研究[J].武汉大学学报·信息科学版,2012,37(2)：127-131.
    [90]J. Giles. Internet encyclopaedias go head to head,2005. Published online:14 December 2005. http://www.nature.com/news/2005/051212/full/438900a.html.
    [91]J. R. Quinlan. C4.5:Programs for Machine Learning. Massachusetts, USA:Morgan Kaufmann Publishers,1993.
    [92]I. H. Witten, F. Frank, Data Mining:Practical Machine Learning Tools and Techniques with Java Implementations. Massachusetts, USA:Morgan Kaufmann Publishers,1999.
    [93]李杉,李兵,潘伟丰等.一种mashup服务描述本体的自动构建方法[J].小型微型计算机系统,2011,32(9)：1747-1752.
    [94]K. Xiao, B. Li, X. H. Tan. Domain-oriented semantic knowledge extraction. Journal of Computational Information Systems,2012,8(10):pp.4331-4337.
    [95]T. J. Wu, K. Xiao, X. H. Tan. Approach for building ontology automatically based on Wikipedia. ICIC Express Letters,2012,6(8):pp.2079-2084.
    [96]肖奎,谭小虎,吴天吉.一种面向领域的本体自动构建方法[J].小型微型计算机系统,2013,34(7)：1514-1517.
    [97]D. Garlaschelli, M. I. Loffredo. Patterns of link reciprocity in directed networks. Physical Review Letters,2004,93(26):268701.
    [98]J. Wang, K. Q. He, B. Li, et al. Meta-models of domain modeling framework for networked software. In Proc. of the 6th International Conference on Grid and Cooperative Computing Urumchi,2007.
    [99]J. Wang, Z. W. Feng, J. Zhang, et al. A Unified RGPS-Based Approach Supporting Service-Oriented Process Customization. Web Services Foundations, Springer,2013:pp.657-682.
    [100]B. Li. Software Service Registry & Repository, http://www.s2r2.org/,2009.
    [101]C.Zeng, K. Q. He. Towards improving web service registry and repository model through ontology-based semantic interoperability. In Proc. of 7th International Conference on Grid and Cooperative Computing (GCC 2008), Shenzhen, China, pp.747-752.
    [102]C. Zeng, K. Q. He, B. Li. Toward multi-ontology based interoperability in web service registry. Journal of Computational Information Systems,2009,5(6):pp.1669-1677.
    [103]曾诚,何克清,李兵.Web服务注册管理模型中的语义互操作性研究[J].武汉大学学报(理学版),2009,55(2)：206-210.
    [104]曾诚,何克清,李兵等.一种支持语义互操作的Web服务注册管理机制[J].小型微型计算机系统,2011,32(9)：1709-1715.
    [105]The Wall Street Journal. Jimmy Wales on Wikipedia quality and tips for contributors. November 2009. URL:http://blogs.wsj.com/digits/2009/11/06.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700