面向微博的数据采集和分析系统的设计与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着社交网络的兴起,微博已成为了人们相互交流最重要的场所之一。在微博中,人人都可以发出自己的声音,也可以听到别人的声音,因此形成了巨大的信息量和信息碎片化的特点。本文针对微博的这些特点,设计并实现了面向微博的数据采集和分析系统。主要工作是在获取微博数据的基础上,仿真并分析微博的网络结构,确定微博用户的权威性并完成了热门微博和热门词汇的挖掘。具体而言,本文主要完成了以下工作:
     Ⅰ.研究了目前网络爬虫的设计和应用技术,在此基础上,设计并实现了一种可根据数据种类的不同,创建多种爬虫的数据采集系统,研究人员可使用此系统,按需求抓取不同种类的微博数据用于研究。此外,在爬行过程中,一方面使用多线程技术大幅提高爬虫的效率,另一方面,创造了多AppKey复用机制,突破了新浪对API调用频率的限制,保证了爬虫可以连续不间断的工作。实践表明,此系统连续爬取3天即可抓取300万个微博用户关系;
     Ⅱ.深入分析了微博网络中的用户关系网络的特点,并结合传统的网络节点评价算法,提出了“相对权威度”和“用户活力”两个新的概念,并使用这两个概念完成了对微博用户的重要性评价。实验表明,新算法的评价效果比传统算法提高了20%以上,且评价结果更合理,更符合实际情况;
     Ⅲ.提出了一种从转发和评论两个维度计算一条微博热门程度的方法,保证了评价的准确性。另外,提出了使用传播树的层数对用户权威度进行修正的方法,使得评价更加贴近实际。在完成了热门微博的挖掘后,使用文本处理方法,完成了热门词汇的提取。
     综上,本系统是一个集微博数据采集、微博用户权威度评价和微博热门内容发掘于一体的综合性软件。软件中的数据实时更新,研究人员可以使用此软件进行微博数据的查询、微博用户权威度的查询;普通使用者也可通过此软件查看当前微博中热门内容。
With the rise of social networking, the Micro-Blog has become the most important places for people to interact with each other. With the use of Micro-Blog, everyone can share their opinions with anyone else, but this also causes the phenomenon of massive information and the characteristics of the fragmentation. In this paper, we design the data acquisition and analysis system based on these characteristics of Micro-Blog. Our main work is to determine the user's authority with the data that collected from the Micro-Blog, and after that, we excavate the hot micro-blogs and the hot words on the internet. Specifically, following are the key work we have done:
     Ⅰ. After we did some researches on the design and application of web crawler, we designed a new system to collect the web information, which can create different kinds of crawlers, so it allows the researchers to use this system to crawl any information they need. On the one hand, we use multi-thread technology to dramatically improve the efficiency of the crawlers. On the other hand, to break the API restrictions rules of Sina, we design the multi-user authorization mechanism to ensure the uninterrupted work of the crawlers. The experiment results showed that the system could acquire3,000,000relations between Micro-Blog users within three continuous day;
     Ⅱ. After the depth analysis of the characteristics of the Micro-Blog users' network and traditional network node evaluation algorithm, we came up with two new concept called "relative authority of users" and "the user vitality". And we use these two concepts to complete the evaluation of the importance of Micro-Blog users. The experiments show that the evaluation results of the new algorithm is better than traditional algorithms which have improved more than20%. Besides that, the evaluation results are more reasonable and more in line with the actual situation;
     III. Proposed a method to evaluate the hotness of one micro-blog. This method is based on the forwarding and comments which are the basic action on the internet, and this ensures the accuracy of the evaluation. In addition, we use the degree of discussions tree layers to adjust the users'authority, which makes the evaluation more realistic. After we complete the calculation of the hotness of the Micro-Blog, we use the text processing method to extraction the buzzwords.
     Finally, the system will be a collection of micro-blog data acquisition, micro-blog user authority evaluation and micro-blog hot content found in an integrated software. Because the software is the data real-time update, researchers can use this software for micro-blog data query, micro-blog user authority query; General users can also through this software to check the current micro-blog in popular content.
引文
[1]S. Milstein, A. Chowdhury, G. Hochmuth et al. Twitter and the micro-messaging revolution: Communication, connections, and immediacy-140 characters at a time. O'Reilly Report.2008. 19-25.
    [2]Huberman, Bemardo A, Romero, Daniel M. and Wu, Fang. Social Networks that Matter: Twitter Under the Microscope. arXiv:0812.1045v1.2008.18-20.
    [3]王磊PageRank的算法改进[学位论文],上海,上海交通大学,2009.
    [4]李辉.基于云计算环境的web结构挖掘算法研究[学位论文],杭州,浙江理工大学,2012.
    [5]李宜兵.基于搜索引擎网页排序算法研究[学位论文],沈阳,沈阳理工大学,2011.
    [61 姚文琳等.一种基于本体的PageRank算法的改进策略,计算机工程,2010年6期.
    [7]严海兵.模糊聚类在搜索引擎自动分类上的应用[学位论文],苏州大学,2009.
    [8]刘军.基于Web结构挖掘的HITS算法研究[学位论文],中南大学,2008.
    [9]陈翰生,基于改进HITS算法及位置信息的关键网页信息抽取方法[学位论文],复旦大学,2009.
    [10]王向阳.搜索引擎排名算法及作弊检测技术研究[学位论文],济南,山东大学,2010.
    [11]柳淑升DistanceRank与HITS混合的网页排序算法研究[学位论文],沈阳,东北师范大学,2010.
    [12]Carvalho A L C, Paul A C, Edleno S M. Site level noise removal for search engines//Proc of the 15th Int Conf on World Wide Web. NewYork:ACM,2006:73-82.
    [13]Zoltan Gyongyi, Hector Garcia-Molina, Jan Pedersen. Combating web spam with trustrank. Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, 576-587.
    [14]Pr0-Google's PageRank0. [2008-01-09]. http://pr.efactory.de/e-pr0.shtml.
    [15]F. Seabastiani. Machine learning in Automated text Categorization. ACM Computing Surveys,34(1):1-47,2002.
    [16]龚才春.短文本语言计算的关键技术研究[学位论文],北京,中国科学院计算技术研究所.
    [17]Bharath S, Dave F, Hakan F, et al. Short text classification in Twitter to improve information filtering[C]//Crestani F, Marchand-Maillet S, Chen Hsin-His, et al. Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2010:841-842.
    [18]王永恒,李静远,满彤,等Twitter中近似重复消息的判定方法研究[J].中文信息学报,2011(1):20-27.
    [19]Dmitry D, Oren T, Ari R_ Enhanced sentiment learning using Twitter hashtags and smileys[C]//Huang Chu-Ren, Jurafsky Dan. Proceedings of the 23rd International Conference on Computational Linguistics:Posters. Stroudsburg:Association for Computational Linguistics, 2010:241-249.
    [20]Alec G, Lei Huang, Richa B. Twitter sentiment analysis[R]. [2012-06-06]. http://www-nlp.stanford. edu/courses/cs224n/2009/fp/3.pdf,2012.
    [21]V. Lawenko, J. Allan, E. Degllznlan, etc. Relevance models for topic detection and tracking. In Proceedings of HLT-2002, San Diego, CA,2002.
    [22]Michael M, Nick K. TwitterMonitor:Trend detection over the twitter stream[C]//Elmagarmid A K, Agrawal D. Proceedings of the 2010 International Conference on Management of Data. New York: ACM,2010:1155-1158.
    [23]Page L, Brin S, Motwani R, et al. The pagerank citation ranking:Bringing order to the Web[C]//Proceeding of the 7th International World Wide Web Conference. Amsterdam:Elsevier Science,1998.
    [24]新浪微博官方文档http://open.weibo.com/wiki.
    [25]王晓梅.恶意URL检测项目中基于PageRank算法的网络爬虫的设计和实现[学位论文],北京,北京邮电大学,2009.
    [26]陈勇等.对线程池模式的分析及其实现.现代电子技术.2005年16期.
    [27]NoSQL数据库探讨之一—为什么要用非关系数据库.http://robbin.javaeye.com/blog/524977
    [28]Sang Hoon Lee, Pan-Jun Kim, Hawoong Jeong. Statistical properties of sampled networks. The American Physical Society.2006. Phys, Rev, E,73:016102.
    [29]Xing W Ghorbani. A weighted PageRank algorithm/ZProc of the 2nd Annual Conf on IEEE Communication Networks and Services Research. Piscataway, NJ:IEEE,2004.
    [30]Ramage D, Dumais S, Liebling D. Characterizing Microblogs with Topic Models. AAAI, 2009.
    [31]Shamma D, Kennedy L, Churchill E. Tweet the Debates:Understanding Community Annotation of Uncollected Sources. WSM'09, October 23,2010.
    [32]Sankaranarayanan J, Samet H. TwitterStand:News in Tweets. GIS,009.
    [33]The 2002 Topic Detection and Tracking(TDT2002) Task Definition and Evaluation Plan. ftp://jaguar.ncsl.nist.gOv//tdt/tdt2002/evaluplans
    [34]Norinobu Hatamoto, Yosbiaki Kurosawa, Shogo Hamada, Kazuya Mera, Toshiyuki Takezawa. Finding Social Relationships by Extracting Polite Language in Micro-blog Exchanges. 8th International Conference on NLP, JapTAL 2012, Kanazawa, Japan, October 22-24,2012.
    [35]Liangjie Hong, Amr Ahmed, Siva Gurumurthy, Alexander J. Smola, Kostas Tsioutsiouliklis. Proceedings of the 21st international conference on World Wide Web. New York:ACM,2012: 769-778.
    [36]Weitong H, Yu Z, Shiqiang Y. and Yuchang L. Analysis of the user behavior and opinion classication based on the BBS. Applied Mathematics and Computation[J].2011:668-676.
    [37]He, T., Qu, G., Li, S.5 Tu, X., Zhang, Y, Ren, H.:Semiautomatic Hot Event Detection. In Osmar, L., Zhanhuai, R., Xi'an, L, eds.
    [38]Yang Y. An evaluation of statistical approached to text categorization. Journal of Information Retrieval,1999, 1(1/2):67-88.
    [39]Yang Yiming, Liu Xin. A re-examination of text categorization methods. InProceedings of ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99), 1999:42-49.
    [40]Robert Kraut, Yi-Chia Wang. Twitter and the development of an audience:those who stay on topic thrive. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York:ACM,2012:1515-1518.
    [41]S.PhuviPadawat and T.Murata. Breaking News Detection and Tracking in Twitter[C]. Web Intelligence and Intelligent Agent Technology (WI-IAT),2012 IEEE/WIC/ACM International Conference on. Toronto, ON,2012:120-123.
    [42]李保利,俞士汶.话题识别与跟踪研究[J].计算机工程与应用,2003,39(17):6-10.
    [43]洪宇.基于语义结构和时序特征的话题检测与跟踪技术研究[D].哈尔滨:哈尔滨工业大学,2009.
    [44]M.Cataldi, L.Di Caro and C. Schifanella. Emerging Topic Detection on Twitter based on Temporal and Social Terms Evaluation[A]. In:MDMKDD'11 Proceedings of the Tenth International Workshop on Multimedia Data Mining[C], Washington,2012:1-10.
    [45]Yashodhara Haribhakta, Arti Malgaonkar, Parag Kulkarni. Unsupervised topic detection model and its application in text categorization. Proceedings of the CUBE'12 International Information Technology Conference. New York:ACM,2012: 314-319.
    [46]乐可欣.话题检测研究[D].北京:北京交通大学,2009.
    [47]张晓艳,王挺.话题发现与追踪技术研究[J].计算机科学与探索,2009,3(4):347-357.
    [48]奉国和,郑伟.国内中文自动分词技术研究综述[J].图书情报工作,2011,55(2):41-45.
    [49]王坚,赵恒永.专业搜索引擎的实现与研究——中文分词算法.电子科学技术评论.2005.
    [50]ICTCLAS[EB/OL]. http://ictclas.org/ictclas_introduction.html.
    [51]胥桂仙等.中文文本挖掘中最长频繁序列的发现算法.中央民族大学学报:自然科学版,2006(1).

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700