互联网舆情信息挖掘方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
及时掌握舆情动态、积极引导社会舆论,是维护社会稳定和执政党执政安全的重要举措。随着Internet迅猛发展,互联网拥有越来越庞大的用户群,且逐渐发展成为群众发布信息、获取信息和传递信息的主要载体。因此,基于互联网的舆情信息挖掘技术越来越受到广泛关注。舆情是指一定时期内一定范围内的社会群体对某些社会现象和现实的主观反映。互联网舆情信息挖掘技术作为舆情信息挖掘的有效手段成为研究热点。然而,现有互联网舆情信息挖掘技术的研究中暴露出信息海量性、处理时效性和预警准确性方面的问题,因此亟需互联网舆情信息挖掘在理论体系和挖掘方法上实现突破。
     本文针对互联网舆情信息挖掘技术进行研究,在明确舆情及其相关概念基础上,着重探讨互联网舆情信息挖掘的体系结构和互联网舆情信息形成过程中不同阶段所采用的不同挖掘方法。主要研究内容如下:
     互联网舆情信息挖掘的体系结构是一项重要的研究内容。本文提出包括属性层、信息采集层、挖掘层和处置层的互联网舆情信息挖掘四层体系结构。其中属性层覆盖舆情信息存在空间、发生时间、变化走势和转化机制中的一般规律;信息采集层覆盖互联网舆情信息采集过程中涉及到的关注主题类、采集空间、采集内容等问题;挖掘层覆盖互联网舆情信息处于不同挖掘时机、基于不同挖掘目的、所采用的挖掘方法;处置层覆盖互联网舆情信息的评价、分析与处置手段。四层体系结构是互联网舆情信息挖掘的基础。
     在互联网舆情信息的产生阶段,本文提出内容敏感网页的舆情监控方法,实现敏感信息监控和不良信息过滤。针对内容敏感网页监控方法,本文提出用户兴趣聚焦度的概念,把用户过滤需求看作以用户感兴趣事物为核心、由不同用户兴趣聚焦度为半径形成的非形式化连续空间,借此表达用户在过滤倾斜情况上的需求。基于用户兴趣聚焦度,本文提出中文敏感网页过滤算法,一方面把网页结构中的URL分析、主题句分析、正文分析相结合,另一方面把用户兴趣聚焦度量化后引入机器学习算法的训练阶段用于正文分析。实验结果表明,内容敏感网页过滤算法有效提高了网页的过滤精度和处理速度,解决了互联网舆情信息产生阶段的舆情发现问题。
     在互联网舆情信息的传播阶段,本文提出针对大多数用户阅读的新闻主题进行挖掘的舆情监测方法,及时了解群众关心的舆情热点并避免某些问题转化为突发事件爆发。针对频繁访问主题监控方法,本文提出基于差值编码双向链表的数据流中频繁项监测确定性算法Frequent Sketch(FS)。FS算法的空间复杂度O(log(εn)/ε),数据项平摊处理时间O(1),算法生成的全局摘要S是ε-亏度摘要。基于FS算法及其在窗口数据流上的扩展算法FS-Win,本文提出一种互联网频繁访问主题挖掘算法。实验分析表明,该算法能够实时地进行用户频繁访问主题挖掘,解决了互联网舆情信息传播阅览阶段的监测问题。
     在互联网舆情信息的转载阶段,本文提出针对大多数网页转载的新闻主题进行挖掘的舆情计量方法,了解当前互联网舆情主题的状态,发现热门舆情事件的发生和群众对事件的舆论倾向。针对舆情态势计量方法,本文提出NISAC指数方法,NISAC指数借鉴经济指数和社会指数的编制方法,以互联网空间中含有特定词的页面数量为基础进行指数编制。数据分析表明,NISAC指数能够对互联网反映出的社会运行安全态势进行监测、评估和预警,解决了互联网舆情信息转载阶段的掌控问题。
To dominate and lead the public opinion is one of important acts of maintaining social stability and Party ruling security. With the rapid expansion of information technology, Internet become the main platform of information releasing, exchanging and acquiring with a huge number of users. Instead of public opinion survey, public opinion mining on the Internet become more and more important. Public opinion is the aggregate of individual attitudes or beliefs held by the adult population in some area in a period. As a method to collect public opinion, public opinion mining on the Internet becomes the researching focus. However, problems of existing public opinion mining techniques on huge-volume processing, high-speed mining and high-accuracy pre-alarm call for improvements in public opinion architecture and mining algorithms.
     This thesis focus on the Internet public opinion mining techniques. After clarifying the notion of public opinion and relating concepts, this paper mainly studies the architecture of Internet public opinion mining and mining algorithms on different periods of public opinion information forming. The main contents are as follows:
     Research on the architecture of Internet public opinion information mining is quite important. This thesis proposed four-level architecture of Attribute Level, Information Collecting Level, Mining Level and Disposing Level. Among them, Attribute Level includes basic rules in public opinion collecting, catching, tracking and leading; Information Collecting Level includes what is collected, where to collect and how to collect; Mining Level includes three-phase public opinion forming model of Releasing, Acquiring and Citation, and mining algorithms on different mining phases; Proposing Level includes evaluating, analyzing and proposing methods. The four-level architecture is the base of Internet public opinion mining.
     During the Releasing phase, we monitored content-suspicious pages to fulfil the use of harmful information filtering and suspicious information monitoring. This thesis proposed the notion of User Interest Focusing Degree (UIFD), which use how the set of interest constituted to measure the user interest. Thus user interest is regarded as an informal continuum With different UIFD around the objects user interested in. This thesis implemented the UIFD-based Chinese web pages filtering approach on public opinion, which includes pages structure analyzer of URL, title, body and machine learning algorithm with UIFD imported into the training procedure. UTFD-based Filtering algorithms earns high efficiency in Chinese content-suspicious web pages filtering.
     During the Acquiring phase, we timely maintained the list of frequently accessed news topic on the Internet, to get the hot topic in time and avoid them transforming to unexpected affairs. This thesis put forward frequent items maintaining algorithm of Frequent Sketch (FS), which keeps the deficient synopsis by maintaining a sorted doubly-linked list of groups storing the frequency delta in between and pruning the counters periodically. Compared with existing algorithms, FS acts better in accuracy, processing speed and memory used. Frequently accessed news topic mining approach on FS-Win algorithm (FS expanded to windowed stream) and topic similarity algorithm, can acquire frequently accessed news topic in time.
     During the Citation phase, we measure the spreading degree of news topics, to help user comprehend current public opinion broadcasting situation, find out what hot topic and people's attitude is. This thesis introduced a measurement model of Internetpublic opinion-----NISAC indexes. Similar to the compiling methods of economicalindexes and natural indexes, NISAC indexes are compiled based on the number of web pages which contain certain keyword. NISAC indexes can help describe the public opinion situation quantificationally, understand the spreading degree of hot topic. We can acquire unexpected affairs of abnormal spreading degree by monitoring the indexes of certain keyword contained in affairs relating pages. In a word, NISAC indexes are used to monitor, evaluate and pre-alarm the social security situation reflected on the Internet.
引文
1 王来华.舆情研究概念——理论、方法和现实热点.天津市社会科学院出版社,2003
    2 毕宏音.影响民众舆情的中介性社会事项.广西社会科学.2004,11:157-159
    3 王来华.“舆情”问题研究论略.天津社会科学.2004,2:78-80
    4 王来华,林竹,毕宏音.对舆情、民意和舆论三概念异同的初步辨析.新视野.2004,5:62-67
    5 王来华,刘毅.2004年舆情研究综述.天津大学学报.2005,7(4):309-313
    6 张克生.舆情机制是国家决策的基本机制.理论与现代化.2004,4:71-73
    7 张克生.国家决策:机制与舆情.天津市社会科学院出版社,2004
    8 张克生.舆情研究中对系统方法的运用与创新.理论与现代化.2005,5:65-68
    9 杜骏飞.流言的流变:sars舆情的传播学分析.南京大学学报(哲学.人文科学.社会科学).2003,40(5):116-124
    10 刘毅.突发性群体事件中舆情信息的汇集与分析.学术交流.2005,10:131-135
    11 刘毅.舆情视角下的突发性群体事件机制研究.湖北社会科学.2005,9:160-162
    12 王来华,陈月生.舆情的主客体关系与突发性群体事件.社科纵横.2004,4:22-24
    13 北大方正技术研究院.以科技手段辅助网络舆情突发事件的监测分析——方正智思舆情辅助决策支持系统.信息化建设.2005,10:50-52
    14 李晓明,朱家稷,阎宏飞.互联网上主题信息的一种收集与处理模型及其应用.计算机研究与发展.2003,40(12):1667-1671
    15 阎耀军.社会稳定的计量及预警预控管理系统的构建.社会学研究.2004,3:54-60
    16 秦州.新闻搜索中的舆情“峰值”——中国近年来重大矿难报道web页面数分析.新闻界.2005,5:94-96
    17 王石番.民意理论与实务.台湾:黎明文化事业公司,1995,7-23
    18 R.Kosala,H.Blockeel.Web Mining Research:A Survey.SIGKDD Explorations.2000,2(1):1-15
    19 史忠植.知识发现.北京:清华大学出版社,2002,334
    20 F.Sebastiani.Machine Learning in Automated Text Categorization.ACM Computing Surveys.2002,34(1):1-47
    21 E.hong Han,G.Karypis.Centroid-based Document Classification:Analysis and Experimental Results.Proceedings of PDKK.2000:424-431
    22 J.Quinlan.Discovering Rules by Induction from Large Collections of Examples.Expert Systems in the Micro-Electronic Age.1979:168-201
    23 J.Quinlan.Induction of Decision Trees.Machine Learning.1986,1(1):81-106
    24 J.Quinlan.Generating Production Rules from Decision Trees.Proceeding of IJCAI-87.1987
    25 J.Quinlan.Simplifying Decision Tree.Internet.Journal of Man-Machine Studies.1987,27:221-234
    26 J.Quinlan.An Empirical Comparison of Genetic and Decision-tree Classifiers.Proceedings of ICML-88.1988
    27 J.Quinlan.C4.5:Programs for Machine Learning.San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,1993
    28 C.Apte,F.Damerau,S.Weiss.Text Mining with Decision Rules and Decision Trees.In Proceedings of the Conference on Automated Learning and Discovery.1998
    29 S.Chakrabarti,B.E.Dom,R Indyk.Enhanced Hypertext Categorization Using Hyperlinks.L.M.Haas,A.Tiwary,(Editors) Proceedings of SIGMOD-98,ACM International Conference on Management of Data.Seattle,US,1998:307-318
    30 K.Ming,A.Chai,H.L.Chieu,et al.Bayesian Online Classifiers for Text Classification and Filtering.SIGIR'02:Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval.New York,NY,USA,2002:97-104
    31 M.Iwayama,T.Tokunaga.Cluster-based Text Categorization:A Comparison of Category Search Strategies.E.A.Fox,R Ingwersen,R.Fidel,(Editors)Proceedings of SIGIR-95,18th ACM International Conference on Research and Development in Information Retrieval.Seattle,US,1995:273-281
    32 A.McCallum,K.Nigam.A Comparison of Event Models for Naive Bayes Text Classification.AAAI-98 Workshop on Learning for Text Categorization.1998
    33 A.McCallum,K.Nigam,J.Rennie,et al.A Machine Learning Approach to Building Domain-specific Search Engines.The Sixteenth International Joint Conference on Artificial Intelligence(IJCAI-99).1999
    34 T.Joachims.Text Categorization with Support Vector Machines:Learning with many Relevant Features.Proceedings of the European Conference on Machine Learning.Berlin,German,1998:137-142
    35 T.Joachims,N.Cristianini,J.Shawe-Taylor.Composite Kernels for Hypertext Categorisation.C.Brodley,A.Danyluk,(Editors) Proceedings of ICML-01,18th International Conference on Machine Learning.Williams College,US,2001:250-257
    36 V.Vapnik.Statistical Learning Theory.New York:John Wiley,Sons,1998
    37 V.Vapnik.Estimation of Dependencies Based on Empirical Data.Spring-Verlag.1982
    38 V.Vapnik,E.Levin,Y.L.Cun.Measuring the Vc-dimension of a Learning Machine.Neural Computation.1994,6:851-876
    39 E.D.Wiener,J.O.Pedersen,A.S.Weigend.A Neural Network Approach to Topic Spotting.In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval(SDAIR'95.1995:317-332
    40 T.Denoeux.An Evidence-theoretic Neural Network Classifier.In IEEE International Conference on Systems,Man and Cybernetics.1995
    41 P.Y.Lee,S.C.Hui,A.C.M.Fong.Neural Networks for Web Content Filtering.IEEE Intelligent Systems.2002,17(5):48-57
    42 李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器.计算机学报.2001,24(1)
    43 李蓉,叶世伟,史忠植.Svm-knn分类器——一种提高svm分类精度的新方法.电子学报.2002,30(5)
    44 范焱,郑诚,王清毅,蔡庆生,刘洁.用naive Bayes方法协调分类web网页.软件学报.2001,12(9)
    45 陈晓云,陈祎,王雷,等.基于分类规则树的频繁模式文本分类.软件学报.2006,17(5):1017-1026
    46 A.Wexelblat,P.Maes.Footprints:History-rich Tools for Information Foraging.InProceedings of ACM CHI 1999 Conference on Human Factors in Computing Systems.1999:75-84
    47 S.Jong,C.Ming-Syan,S.Philip.Using a Hashbased Method with Transaction Trimming for Mining Association Rules.IEEE Transactions on Knowledge and Data Engineering.1997,9(5):813-825
    48 J.Borges,M.Levene.A Dynamic Clustering-based Markov Model for Web Usage Mining.eprint arXiv:cs/0406032.2004,6:1-20
    49 M.Pazzani,J.Muramatsu,D.Billsus.Syskill and Webert:Identifying Interesting Web Sites.In Proceedings of AAAI'96.1996:54-59
    50 H.Lieberman.An Agent That Assists Web Browsing.In Proceedings of IJCAI.1995:924-929
    51 J.Allan,J.Carbonell,G.Doddington,et al.Topic Detection and Tracking Pilot Study Final Report.Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.1998
    52 金澈清,钱卫宁,周傲英.流数据分析与管理综述.软件学报.2004,15(8):1172-1181
    53 G.Hulten,P.Domingos.A General Framework for Mining Massive Data Streams.Journal of Computational and Graphical Statistics.2003,12
    54 P.Domingos,G.Hulten.Learning from Infinite Data in Finite Time.MIT Press,2002,673-680
    55 G.Hulten,P.Domingos.Catching up with the Data Research Issues in Mining Data Streams.Workshop on Research Issues in Data Mining and Knowledge Discovery.Santa Barbara,CA,2001
    56 B.Babcock,S.Babu,M.Datar,et al.Models and Issues in Data Stream Systems.Proceeding of PODS.2002
    57 M.Fang,N.Shivakumar,H.Garcia-Molina.Computing Iceberg Queries Efficiently.Proceedings of VLDB.1998:299-310
    58 D.Zuckerman.Randomness-optimal Oblivious Sampling.Random Structures and Algorithms.1997,11(4):345-367
    59 M.Charikar,K.Chen,M.Farach-Colton.Finding Frequent Items in Data Streams.Proceedings of the 29th International Colloquium on Automata,Languages and Programming,(ICALP).2002:693-703
    60 A.Dobra,M.Garofalakis,J.Gehrke,et al.Processing Complex Aggregate Queries Over Data Streams.In Proceeding ofACM SIGMOD.2002
    61 J.Gehrke,F.Kom,D.Srivastava.On Computing Correlated Aggregates Over Continual Data Streams.In Proceeding of ACM SIGMOD.Santa Barbara,CA,2001
    62 S.Babu,J.Widom.Continuous Queries Over Data Streams.SIGMOD Record.2001
    63 M.Khan,Q.Ding,W.Perrizo.K-nearest Neighbor Classification on Spatial Data Stream Using P-trees.Proceeding of PAKDD.2002:517-528
    64 H.Wang,W.Fan,R Yu,et al.Mining Conceptdrifting Data Streams Using Ensemble Classifiers.In Proceeding of ACM SIGKDD.2003
    65 C.Olston,B.Babcock.Distributed Top-k Monitoring.Proceeding of ACM SIGMOD.San Diego,CA,2003
    66 E.D.Demaine,A.Lopez-Ortiz,J.Munro.Frequency Estimation of Internet Packet Streams with Limited Space.Proceedings of the 10th Annual European Symposium on Algorithms.2002:348-360
    67 M.Richard,H.Christos,S.Scott.A Simple Algorithm for Finding Frequent Elements in Streams and Bags.ACM Transactions on Database Systems.2003,28(1):51-55
    68 G.Manku,R.Motwani.Approximate Frequency Counts Over Streaming Data.Proceeding of VLDB.Hong Kong,China,2002
    69 J.Vitter.Random Sampling with a Reservoir.ACM Transactions on Mathematical Software.1985,11(1):37-57
    70 S.Guha,A.Meyerson,N.Mishra,et al.Clustering Data Streams:Theory and Practice.IEEE TKDE.2003,15
    71 L.O'Callaghan,N.Mishra,A.Meyerson,et al.High-performance Clustering of Streams and Large Data Sets.Proceeding of ICDE.2002
    72 S.Guha,N.Mishra,R.Motwani,et al.Clustering Data Streams.In Proceeding on Foundations of Computer Science.2000
    73 C.Aggarwal,J.Han,J.Wang,et al.A Framework for Clustering Evolving Data Streams.Proceeding of VLDB.Berlin,Germany,2003
    74 Y.Zhu,D.Shasha.Efficient Elastic Burst Detection from Data Streams.In Proceeding of ACM SIGKDD.Washington DC,2003
    75 M. Sullivan, A. Heybey. Tribeca: A System for Managing Large Databases of Networktraffic. In Proceeding of USENIX Annual Technical Conference. 1998
    76 Y. Zhu, D. Shasha. Statstream: Statistical Monitoring of Thousands of Data Streams in Real Time. Proceeding of VLDB. Hong Kong, China, 2002
    77 F. Korn, S. Muthukrishnan, Y. Zhu. Ipsofacto: A Visual Correlation Tool for Aggregate Network Traffic Data. In Proceeding of SIGMOD. 2003
    78 R. Motwani, J. Widom, A. Arasu. Query Processing, Resource Management, and Approximation in a Data Stream Management System. Proceeding of 1 st Biennial CIDR. Asilomar, CA, 2003
    79 S. Chandrasekaran, O. Cooper, A. Deshpande, et al. Telegraphcq: Continuous Dataflow Processing for an Uncertain World. In Proceeding of 1st Biennial Conference on Innovative Data System. 2003:269-280
    80 B. Babcock, M. Datar, R. Motwani. Load Shedding Techniques for Data Stream Systems. Workshop on Management and Processing of Data Streams. 2003
    81 B. Babcock, S. Babu, M. Datar, et al. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. Proceeding of ACM SIGMOD. San Diego, CA, 2003
    82 A. Arasu, B. Babcock, S. Babu, et al. Characterizing Memory Requirements for Queries Over Continuous Data Streams. ACM TODS. 2004
    83 D. Thomas, R. Motwani. Caching Queues in Memory Buffers. Proceeding of SODA. 2004
    84 U. Srivastava, J. Widom. Flexible Time Management in Data Stream Systems. Technical report, Stanford University, 2003
    85 R. Ananthakrishna, A. Das, J. Gehrke, et al. Efficient Approximation of Correlated Sums on Data Streams. IEEE TKDE. 2003, 15(3):569-572
    86 A. Das, J. Gehrke, M. Riedewald. Approximate Join Processing Over Data Streams. In Proceeding of ACM SIGMOD. San Diego, CA, 2003
    87 M. Charikar, K. Chen, M. Farach-Colton. Finding Frequent Items in Data Streams. Proceedings of the 29th International Colloquium on Automata, Languages and Programming. 2002:693-703
    88 R. Kooi. The Optimization of Queries in Relational Databases. Case Western Reserve University, Ph.D. thesis. 1980
    89 G. Piatetsky-Shapiro, C. Connell. Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Record. 1984, 14(2):256-276
    90 P.Gibbons,Y.Matias,V.Poosala.Fast Incremental Maintenance of Approximate Histograms.In Proceeding of the 23rd International Conference on Very Large Data Bases.1997:466 475
    91 H.Jagadish,N.Koudas,S.Muthukrishnan,et al.Optimal Histograms with Quality Guarantees.In Proceeding of the 24rd International Conference on Very Large Data Bases.1998:275-286
    92 B.Jawerth,W.Sweldens.An Overview of Wavelet Based Multiresolution Analyses.SIAM Review.1994,36(3):377-412
    93 M.Datar,A.Gionis,P.Indyk,et al.Maintaining Stream Statistics Over Sliding Windows.SIAM Journal on Computing.2002,31(6):1794-1813
    94 B.Babcock,M.Datar,R.Motwani.Sampling from a Moving Window Over Streaming Data.In Proceeding of ACM-SIAM SODA.2002
    95 C.Giannella,J.Han,J.Pei,et al.Mining Frequent Patterns in Data Streams at Multiple Time Granularities,Next Generation Data Mining,MIT Press,2003.191-212
    96 L.Golab,S.Garg,T.Ozsu.On Indexing Sliding Windows Over On-line Data Streams.In Proceeding of EDBT.Heraklion,Crete,Greece,2004
    97 C.Estan,G.Varghese.New Directions in Traffic Measurement and Accounting:Focusing on the Elephants,Ignoring the Mice.ACM Transactions on Computer Systems.2003,21(3):270-313
    98 A.Feldmann,A.Greenberg,C.Lund,et al.Deriving Traffic Demands for Operational Ip Networks:Methodology and Experience.Proceedings of the ACM SIGCOMM.2000:257-270
    99 N.Duffield,M.Grossglauser.Trajectory Sampling for Direct Traffic Observation.Proceedings of the ACM SIGCOMM.2000:271-282
    100 M.Wang,T.Madhyastha,N.Chan,et al.Data Mining Meets Performance Evaluation:Fast Algorithms for Modeling Bursty Traffic.18th International Conference on Data Engineering.2002
    101 J.Kleinberg.Bursty and Hierarchical Structure in Streams.In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2002:91-101
    102 M.Vlachos,K.Wu,S.Chen,et al.Fast Burst Correlation of Financial Data.In Proceeding of 9th European Conference of Practices in Knowledge and Data Discovery(PKDD).2005
    103 Y.Zhu,D.Shasha.Efficient Elastic Burst Detection in Data Streams.In Proc.of SIGKDD.2003
    104 S.Qin,W.Qian,A.Zhou.Adaptively Detecting Aggregation Bursts in Data Streams.DASFAA.2005
    105 黄萱菁,夏迎炬,吴立德.基于向量空间模型的文本过滤系统.软件学报.2003,14(3):435-442
    106 J.Gao,E.Xun,J.Zhang,et al.Trec-9 Clir Experiments at Msrcn.In Proceedings of the ninth Text Retrieval Conference(TREC 2000).2000
    107 H.Xu,Z.Yang,B.Wang.Trec-11 Experiments at Cas-ict:Filtering and Web.In proceedings of the eleventh text retrieval conference(TREC 2002).2002
    108 Y.Lu,J.Hu,F.Ma.Sjtu at Trec 2004:Web Track Experiments.In proceedings of the 13th text retrieval conference(TREC 2004).2004
    109 谢海光,陈中润.互联网内容及舆情深度分析模式.中国青年政治学院学报.2006,(3):95-100
    110 P.NetProtect.Report on Currently Available Cots Filtering Tools.Tech.rep.,University of York,2001
    111 C.Apte,F.Damerau,S.Weiss.Text Miningwith Decision Rules and Decision Trees.Proceedings of the Conference on Automated Learning and Discovery.CMU,1998
    112 P.Clark,T.Niblett.The Cn2 Induction Algorithm.Mach.Learn.1989,3(4):261-283
    113 C.Jin,W.Qian,C.Sha.Dynamically Maintaining Frequent Items Over a Data Stream.Proceedings of the 2003 ACM CIKM.2003:287-294