Web挖掘技术及其在互联网中的应用研究

英文题名：Research on Web Mining Technologies and Its Application on Internet
作者：王伟
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：Web挖掘 ; 机器学习 ; 话题检测与追踪 ; 用户行为分析
英文关键词：Web Mining ; Machine Learning ; Topic Detection and Tracking ; User Behavior Analysis
学位年度：2013
导师：江铭炎
学科代码：081001
学位授予单位：山东大学
论文提交日期：2013-03-20

摘要

随着信息技术的不断发展,计算机与通信技术不仅推动着现代社会的信息化发展,而且同时影响并在改变着人们的现代生活。然而信息技术同时带来了数据的爆炸式增长,人们迫切需要一种对海量数据进行有效利用和处理的解决方案。在这样的大数据背景下,数据挖掘技术应运而生。Web挖掘技术作为该领域的一个分支,针对的是万维网海量数据的有效梳理和运用。由于互联网技术日新月异,而Web挖掘技术相对发展较晚,因此本文以Web挖掘作为研究核心,并深入分析其在互联网领域的应用。
     本文首先介绍了Web技术的研究背景、现状、技术难点和未来发展方向等方面,以及对数据挖掘、机器学习等相关概念做了深入说明。然后,继续关注Web挖掘技术的实现过程和应用场景,介绍了文本预处理的核心实现过程和话题检测与追踪、用户行为分析两个应用的技术背景。
     作为Web内容挖掘技术的一个重要应用之一,话题检测与动态追踪旨在检测未知话题并且追踪已有话题的后续发展。
     针对网络媒介上新闻事件报道类文本对象的话题检测与动态追踪问题,本文实现了一种混合聚类解决方案。本方案基于“贡献度”对话题模型做了层次化调整,更加适合于构建互联网新闻话题,而且效率性能有了大幅提升。实际互联网新闻数据表明,与K-Means算法相比,本方案准确率和召回率有了显著提升,并且构建的话题树模型层次化效果明显。
     针对中文微博类文本对象的话题检测与动态追踪问题,本文提出了一种基于主题词的增量式模糊聚类解决方案。本方案首先根据微博自身的文本特点,提出了一套信息反垃圾的过滤方案。然后利用时效性和词频两个因素,为主题词建立适应微博特点的权重。最后利用增量式模糊聚类方法完成突发话题的检测过程。实际微博数据表明,本方案可以有效地检测出突发事件、热点话题等,而且时间效率较为理想。
     作为Web使用挖掘技术的一个重要应用之一用户行为分析旨在了解用户习惯、兴趣点等,分析评测用户的产品满意度,以便改善产品提升用户体验。
     针对搜索引擎的用户满意度评测,本文阐述了一种基于用户使用行为的自动化解决方案。本方案首先介绍原始网络日志预先处理过程,即从日志数据中得到具体用户操作行为数据并进行特征抽取。然后,提出了一种基于CURE算法的推荐技术,人工对选取的样本进行标注。最后,利用动态建模技术完成对用户满意度的模型构建。实际搜索引擎数据表明,基于机器学习的自动化评测方案已经接近人工评测水平,达到了实际应用要求,并且动态模型通过多模型构建、自动更新、反馈纠正等机制可以有效延长生命周期,提高了学习的延续性。
With the arrival of information age, computer and communication technologies are not only promoting the informatization development of modern society, and also influencing and even changing our modern life. However, information technologies also brought explosive growth in the amount of data. People urgently need a technical solution for effective utilization and disposal of massive data. Under these circumstances of big data age, data mining technologies aroused. As a branch of data mining, Web mining is especially for massive Internet data. Due to Internet fast changing pace and also the late start of Web mining technologies, this thesis mainly research Web mining technologies and its application on Internet.
     The paper firstly introduces the research background, research situation, technical difficulties and future development direction of Web mining and further illustrates data mining, machine learning and other relative concepts. Then the paper continues to focus on the realization process and application scenarios of Web mining, briefly introduce the Web text preprocessing and two relative applications, one is topic detection and tracking, and the other is user behavior analysis.
     As one of the most important application of Web content mining, topic detection and dynamic tracking aims to detect unknown topics and track the latest development of already known topics.
     According to Internet news topic detection and tracking problem, the paper proposes a solution based on hybrid clustering algorithm. And this solution applies the concept of contribution to build hierarchical topic model with better efficiency. Especially, this model has better adaptability of Internet news. Real Internet data proves that this solution shows better accuracy rate and recall rate than the traditional K-Means methods. And the generated topic tree model has better hierarchical performance.
     According to Chinese Micro-blog topic detection and tracking problem, this paper proposes a solution based on incremental fuzzy clustering algorithm. This solution firstly introduces a set of anti-spam filtering rules based on the characteristic of Micro-blog text. Then considering timeliness and frequency of keywords, the solution proposes keyword weight computing method. And lastly the core incremental fuzzy algorithm complete detection process. Real Micro-blog data proves that this solution could detection sudden incidents effectively with big data processing capacity and low time complexity.
     As one of the most important application of Web usage mining, user behavior analysis aims to understand the usage habits and interest of users and evaluate user satisfaction, in order to further improve user experience.
     According to the evaluation of uses satisfaction of search engine, this paper proposes an automatic solution based on user behavior analysis. The solution firstly introduces log preprocessing method including user behavior transformation and feature extraction. Then the solution proposes a recommended sample tagging method based on CURE algorithm. Lastly, the solution generates a dynamic model for user satisfaction. Real search engine data proves that the proposed automatic evaluation method based on machine learning is close to artificial evaluation level and meets the requirements of practical application. And also with mechanisms of multiple model construction, automatic updating and error feedback, the dynamic modeling method could extend life cycle and promote continuous learning effectively.

引文

[1]王丽.Web数据挖掘在个性化搜索技术上的研究[D].大连交通大学硕士学位论文,2009
    [2]周朕.面向电子商务的WEB数据挖掘研究[D].中南大学硕士学位论文,2011
    [3]Rauber, M.Frfihwirth. Automatically Analyzing and Organizing Music Archives [A]. Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries(ECDL 2001)[C]. Darmstadt.Germany,2001: 402-414
    [4]Cutting, D., Karger, D. and etc. Scatter/Gather:A Cluster-based Approach to Browsing Large Document Collections[A]. SIGIR'92,1992[C].318-329
    [5]James Allan, Topic Detection and Tracking:Event-based Information Organization[M], Kluwer Academic Publishers,2002:1-16
    [6]Pang Xiuli, Yu Qiang, Jiang Wei. A Chinese anti-spam filter approach based on Support Vector Machine[A]. Proceedings of 2007 International Conference on Management Science and Engineering[C],2007
    [7]张小丰.面向Web的数据挖掘技术在网站优化中的个性化推荐方法的研究与应用[J].制造业自动化.2012,34(1)
    [8]刘树超,李永臣,武洪萍.Web数据挖掘研究与探讨[J].制造业自动化.2010,32(9)
    [9]陈莉,焦李成Internet/Web数据挖掘研究现状及最新进展[J].西安电子科技大学学报(自然科学版).2001,28(1)：115-119
    [10]Zhang Haiyang. Web mining as a valuable tool in technology commercialization potential evaluation[A]. International Conference on Wireless Communications, Networking and Mobile Computing[C],2008
    [11]Li Li, Wang Jinliang, Qiao Fei. An overview of data mining[A]. Proceedings of the World Congress on Intelligent Control and Automation (WCICA) [C],2010, 2828-2833
    [12]Zhang Haiyang. A short introduction to data mining and its applications[A]. International Conference on Management and Service Science[C],2011
    [13]Hsu J. Web mining:A Survey of World Wide Web Data Mining Research and Applications [A]. Decision Sciences Institute 2002 Proceedings [C],2002:753-758
    [14]Markov Z, Russell I. An introduction to the WEKA data mining system[A]. Working Group Reports on ITiCSE on Innovation and Technology in Computer Science Education [C],2006:367-368
    [15]Ting I. Web-mining applications in e-commerce and e-services [J], Online Information Review,2008,32(2):129-132
    [16]Da C J, Miguel G, Gong Zigu. Web structure mining:An introduction[A], Proceedings of 2005 International Conference on Information Acquisition [C],2005: 590-595
    [17]Buddo S B, Krishna A V P, Kurra R R, Mishra D K. Knowledge discovery and retrieval on World Wide Web using Web structure mining [A],4th International Conference on Mathematical Modelling and Computer Simulation[C],2010: 527-532
    [18]李中原.基于向量空间模型的网页过滤研究[D].北京化工大学硕士学位论文,2010
    [19]何金凤.基于中文信息检索的文本预处理研究[D].电子科技大学硕士学位论文,2008
    [20]Mohd, Masnizah. Construction of topics and clusters in Topic Detection and Tracking tasks[A].2011 International Conference on Semantic Technology and Information Retrieval[C],2011:171-174
    [21]洪宇,张宇,刘挺,李生.话题检测与跟踪的评测及研究综述[J].中文信息学报.2007,21(6)：72-84
    [22]陈学昌,韩佳珍,魏桂英.话题识别与跟踪技术发展研究[J].中国管理信息化,2011,14(9)：56-59
    [23]于满泉,骆卫华,许洪波,白硕.话题识别与追踪中的层次化话题识别技术研究[J].计算机研究与发展,2006,43(3)：489-495
    [24]Zhang Dan. Topic detection based on K-means[A].2011 International Conference on Electronics, Communications and Control, ICECC 2011-Proceedings [C],2011: 2983-2985
    [25]Dai Xiangying. Online topic detection and tracking of financial news based on hierarchical clustering[A].2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010 [C],2010:3341-3346
    [26]Song Dan, Wang Weidong. Topic Detection and Tracking with a Developed Vector Space Model [J], Computer Technology and Development,2006,16(9):62-65
    [27]Min Keyue, Zhao Yingbin. Design and Implementation of Topic Detection and Tracking System on Web [J], Computer Engineering,2008,34(19):212-214
    [28]刘素芹,柴松.命名实体的网络话题K-means动态检测方法[J].智能系统学报,2011,5(2)：123-126
    [29]程葳.面向互联网新闻的在线话题检测算法[J].计算机工程,2009,35(18)：28-30
    [30]Dolf T, Wessel K. TNO Hierarchical topic detection report at TDT [A]. The 7th Topic Detection and Tracking[C],2004
    [31]张宇.WEB中文文本聚类分类系统的设计与实现.西南交通大学硕士学位论文,2009
    [32]禹航.基于微博客的社区挖掘研究[D].华中科技大学硕士学位论文,2011
    [33]Chen Kuanyu, Luesukprasert L, Chou S T. Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling [A]. IEEE Transactions on Knowledge and Data Engineering [C],2007:1016-1025
    [34]Lin Chenhua, He Yulan. Supervised Joint Sentiment-Topic Detection from Text [A]. IEEE Transactions on Knowledge and Data Engineering [C],2012:1134-1145
    [35]Tu Yining. Indices of novelty for emerging topic detection [J], Information Processing and Management,2012,48(2):303-325
    [36]Lin Furen, Liang C H. Storyline-based summarization for news topic retrospection [J], Decision Support Systems,2008,45(3):473-490
    [37]Meiyu Liang, Junping Du, Juan Hu, Yuehua Yang. Study on food safety emergency topic detection model based on semantics [A].2011 International Conference on Advanced Intelligence and Awareness Internet (AIAI 2011) [C],2011:114-118
    [38]李劲,张华,吴浩雄,向军.基于特定领域的中文微博热点话题挖掘系统BTopicMiner [J].计算机应用,2012,32(8)：2346-2349
    [39]Hong Li, Jinfeng Wei. Netnews Bursty Hot Topic Detection Based on Bursty Features [A].2010 International Conference on E-Business and E-Government (ICEE)[C],2010:1437-1440
    [40]Yaohong Jin. A Topic Detection and Tracking Method Combining NLP with Suffix Tree Clustering [A]. Computer Science and Electronics Engineering (ICCSEE) [C], 2012:227-230
    [41]蒋盛益,麦智凯,庞观松.微博信息挖掘技术研究综述[J].图书情报工作,2012,56(17)
    [42]张晨逸,孙建伶,丁轶群.基于MB-LDA模型的微博主题挖掘[J].计算机研究与发展,2011,48(10)：1796-1802
    [43]朱彤,刘奕群,茹立云,马少平.基于用户行为的长查询用户满意度分析[J].模式识别与人工智能,2012,25(3)：469-474
    [44]余慧佳,刘奕群,张敏,茹立云.基于大规模日志分析的搜索引擎用户行为分析[J].中文信息学报,2007,21(1)：109-114
    [45]孙玲芳,夏聪.Web使用挖掘在用户行为分析中的应用[J].江苏科技大学学报(自然科学版),2011,25(3)：258-261
    [46]Liu Guoqi, Zhu Zhiliang, Li Dancheng. A method of QoS measurement based on user behavior analysis [A]. Proceedings-IEEE International Conference on e-Business Engineering, ICEBE 2009 [C],2009:383-387
    [47]Liu Yiqun, Cen Rongwei, Zhang Min. Automatic search engine performance evaluation based on user behavior analysis [J]. Journal of Software,2008,19(11): 3023-3032
    [48]Hou Songli, Li Yuan. Design and implementation of a online network user behavior analysis system [J]. Advanced Materials Research,2012,566:707-711
    [49]Jun Gong. Analysis the idea of personalized search engine based on user behavior [A].2010 International Conference on Computer Application and System Modeling [C],2010:5450-5452
    [50]Wang Xiaochun, Li Sheng, Yang Muyun. Research on user behavior based on session analysis [J]. Journal of Harbin Institute of Technology,2011,43(5):76-78
    [51]Juan Yunfang, Chang Chichao. An analysis of search engine switching behavior using click streams [A]. Internet and Network Economics-First International Workshop, WINE 2005, Proceedings [C],2005:806-815
    [52]王渊.面向用户的搜索引擎检索结果评价[J].河南图书馆学刊,2007,27(4)：74-76
    [53]江婕,李建民,曾勃炜.基于用户反馈的个性化搜索引擎的研究[J].计算机与现代化,2010,(6)：116-119
    [54]罗敏.基于商务智能的中文搜索引擎用户行为模式研究[D].南开大学硕士学位论文,2009
    [55]Zhang Feng, Li Xialong. Research in automatic search engine replacement algorithm for WEB caching based on user behavior [A].7th Web Information Systems and Applications Conference, WISA 2010, Workshop on Semantic Web and Ontology, SWON 2010 [C],2010:142-145
    [56]Wang Jiazhou, Liu Yiqun, Ma Shaoqing. Sponsored search performance analysis based on user behavior information [J]. Computer Research and Development,2011, 48(1):133-138

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700