微博客数据的获取与分析方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
微博客是继博客后迅速发展起来的一种新的社交网络形式,在信息传媒领域形成了很大的影响力。对于传统的社交网络形式,数据的获取与分析技术已日趋成熟,但对于微博网络数据的获取及微博网络特性的研究,还不够完善。本文研究了微博的特点及作用,微博数据获取的两种技术,以新浪微博为例,设计并实现了微博数据获取与分析系统,仿真并分析了微博网络的网络特性。本文主要的工作目的是在获取微博数据的基础上,分析微博数据,由此得出微博网络的特性。具体的工作如下:
     1、研究了使用网络页面爬虫获取数据的相关技术,包括通用网络爬虫,聚焦网络爬虫,网页预处理,文本分类等的基本原理和工作流程。
     2、深入研究了利用微博系统的SDK获取数据的工作流程,该技术通过调用微博平台提供的API来获取用户数据,调用API需通过用户身份的鉴权,目前主要用到的是OAuth鉴权,该方法步骤简单,抓取数据的准确度和效率高,本文应用该方法获取微博数据。
     3、从简化认证步骤,提高获取效率,避免重复爬取等方面考虑,对SDK程序进行了改进,经多次实验证明经过改进的程序能长时有效的获取微博数据,此方法获取的微博数据作为研究微博网络特性的数据集。
     4、设计了微博数据获取和分析系统的总体框架,系统的数据库,功能模块和界面,实现了微博的数据获取和分析的基本功能,借助于该系统可对微博网络做更深入的研究。
     5、对微博的网络拓扑,节点的入度分布,出度分布等进行了分析,通过分析得出微博网络具有小世界特性,无标度和高聚类特性。
Microblogging is quickly developed into a new form of social network following blog. It has great influence on the field of information media. For the traditional form of social network, data acquisition and analysis technology has matured, but the microblogging network data acquisition and the research of microblogging network characteristics is still not perfect. This paper studies the characteristics and the effect of microblogging, and two microblogging data acquisition techniques. Using Sina microblogging for example, microblogging data acquisition and analysis system was designed and achieved, network characteristics of microblogging were simulated and analyzed. The main purpose is to analysis the characteristic of microblogging network according to the data obtained in microblogging. Specific work is as follows:
     1、Study on the technologies of getting data using web page crawler, including the basic principles and workflow of general web crawler, focused crawler, web pre-processing, text classification etc.
     2、Study on the workflow of getting the data using microblogging system SDK, this technology gets the user data by calling the API provided by the microblogging platform, and calling the API requires the user identity authentication. Currently, the major authentication is OAuth which is described in detail in this paper, and this method has simple steps and it can get microblogging data accurately and efficiently.
     3、The SDK program has been improved by several experiments to simplify the certification procedures, improve crawling efficiency and avoid duplication of crawling. The improved program can acquire data continually. The microblogging data fetched by this method is data set of researching microblogging network characteristics.
     4、Designe the framework of data fetching and analysis system. System database, function modules and interface were also designed. The basic functions of microblogging data acquisition and analysis were achieved. Using the system, microblogging network can be studied in-depth.
     5、Analysed microblogging network topology, the in-degree distribution and out-degree distribution, the conclusion is that microblogging network has small-world, scale-free and high clustering properties.
引文
[1]胡东锋.微博是这样炼成的.北京:人民邮电出版社.2010,6.1-4
    [2]Java A,Song X,Finin T,Tseng B.Why We Twitter:Understanding Microblogging usage and communities.Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web minning and social network analysis.2007.56-65
    [3]刘丽清.微博特性之浅析.东南传播.2009(11).153-154
    [4]顺风.“微博客”对互联网的八大影响.软件工程到师.2009(11).33-34
    [5]文瑞.微博之识.软件工程师.2009(12).19-20
    [6]王娟.微博客用户的使用动机与行为—基于技术接受模型的实证研究.山东大学.硕士学位论文.2010.24-29
    [7]王树义,王鑫.基于微博客Twitter的企业竞争情报搜集.情报学报.2010年6月第29卷(3).545-552
    [8]Kwak, H., Lee, C., Park, H., and Moon, S. What is Twitter, a Social Network or a News Media. Proc WWW'10.2010.591-600
    [9]杨冠超.微博客热点话题发现策略研究.浙江大学.硕士学位论文.2011.38-44
    [10]郑雅真.新浪微博的发展研究.北京交通大学.硕士学位论文.2010.6-13
    [11]袁浩.主题爬虫搜索Web页面策略的研究.中南大学.硕士学位论文.2009.10-28
    [12]汪涛,樊孝忠.主题爬虫的设计与实现.计算机应用.2004年6月.24(z1).270-272
    [13]齐海凤.网络舆情热点发现与事件跟踪技术研究.哈尔滨工程大学.硕士学位论文.2006.11-30
    [14]吕秀娟.搜索引擎中资源获取的设计与实现.南京理工大学.硕十学位论文.2002.13-20
    [15]刘洁清.网站聚焦爬虫研究.江西财经大学.硕十学位论文.2006.11-30
    [16]刘玮玮.搜索引擎中主题爬虫的研究与实现.南京理工大学.硕十学位论文.2006.4-25
    [17]M.Spiliopoulou.Web Usage Mining for Web Site Evaluation. Commounications of ACM.2000, 43(8).127-134
    [18]周立柱,林玲.聚焦爬虫技术研究综述.计算机应用.2005,25(9).1965-1966
    [19]MP.Singh, Deep Web Structure.IEEE Internet Computing.2007(12).4-5
    [20]Jiawei Han,Micheline Kamber著.范明,孟小峰泽.数据挖掘概念与技术.北京:机械工业出版社.2011,1.30-41
    [21]韩家炜,孟小峰等.Web挖掘研究.计算机研究与发展.2001.38(4).405-410
    [22]张志刚,陈静,李晓明.一种HTML网页净化方法.情报学报.2004.4(23).387-393
    [23]许雁鸣.博客资源的爬取与检索.山东大学.硕士学位论文.2008.4-20
    [24]Sun Ai-xin. Suryanto M A, Liu Ying. Blog classification using tags:all empirical study. ICSDL 2007, LNCS 4882.2007.307-316
    [25]胡骏,李星.校园网信息资源搜索引擎的研究与实现.计算机工程与设计.2006.27(24).4629-4634
    [26]Wang Jiying, Lochovsky F H. Data-rich section extraction from HTML pages. Proceeding of the Third International Conference on Web Information Systems Engineering (Workshops). Singapore- IEEE Computer Society.2002.20(2).313-322
    [27]J.Rennie. Using Reinforcement Learning to Crawler the Web efficiently. In Proc of the International Conference on Machine Learning.1999.433-476
    [28]杨杰,徐炜民.搜索引擎原型系统的研究与设计.小型微型计算机系统.2002年10月,23(10).1193-1195
    [29]王小帆,李翔,陈关荣.复杂网络理论及应用.北京:清华出版社.2006,4.20-35
    [30]韦春龙.复杂网络中无标度拓扑生长机理的研究.南京邮电大学.硕士学位论文.2008.28-38
    [31]王林,戴冠中.复杂网络的度分布研究.西北工业大学学报.2006年8月,24(4):405-408
    [32]Sang Hoon Lee,Pan-Jun Kim,Hawoong Jeong. Statistical proerties of sampled networks.The American Physical Society.2006. Phys. Rev. E,73:016102
    [33]吴金闪,狄增如.从统计物理学看复杂网络研究.物理学进展.2004,24(1).18-26
    [34]彭俊.复杂网络的拓扑结构及传播模型的研究.西安电子科技大学.硕士学位论文.2009.9-30
    [35]Zinoviev,Dmitry. Topology and Geometry of Online Social Networks. Proc.12th World Multi-Conference on Systemics, Cybernetics and Informatics VI.2008.138-143
    [36]R.L.朗克,M.朗格内克著.张占忠等译.统计学方法与数据分析引论.北京:科学出版社.20033.52-83
    [37]Ahn Y Y,Han S,Kwak H,et al. Analysis of topological characteristics of huge online social networking services. Proceedings of the 16th international conference on World Wide Web.New York:ACM Press.2007.835-844
    [38]Watts D J, Strogatz S H. Collective dynamics of small-world'networks. Nature.1998, 393.440-442
    [39]Barabasi AL,Albert R. Emergence of scaling in random networks. Science.1999, 286(5439).509-512

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700