基于用户个性挖掘的Web社区营销研究

英文题名：Web Community Marketing Research Based on User Characteristic and Interest Mining
作者：余伟
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：社区排序 ; 模糊搜索 ; 用户特征挖掘 ; 用户兴趣挖掘 ; 社区用户交互 ; 时间一致性 ; 时间感知
英文关键词：Community order ; Fuzzy search ; User characteristic mining ; User interest mining ; Community user interaction ; Time consistency ; Time perception
学位年度：2011
导师：李石君
学科代码：081203
学位授予单位：武汉大学
论文提交日期：2011-05-01

摘要

随着Web社区的蓬勃发展,基于Web社区的网络营销越来越受到企业的关注。调查数据统计,截至2010年底,全国社区用户数量达到2.94亿,占全国网民总数的70.3%,2010年全国互联网广告市场份额达321.20亿元。但是消费者的购买行为在日益发展的社会形态中发生变化,传统的互联网广告已经不能够取得人们的信任,用户往往通过在互联网上搜索相关信息和评论来进行决策。社区营销成为时下网络营销推广的产物,利用Web社区进行口碑传播在消费者决策中扮演了极其重要的角色。对于消费者来说,其对人际信息的信任程度远高于对广告的信任。因此基于Web社区进行网络营销成为了品牌低成本、高效率的信息推广方式。
     Web社区营销发展时间较短,尚未形成有效的理论和统一的方法。Web社区营销的核心是互动和精准营销,本文就如何选择合适的社区进行社区营销；如何让用户在社区中检索到合适的主题；如何挖掘虚拟用户的真实特征属性和兴趣爱好；如何发现社区中失效的主题四个角度展开研究,解决了Web社区营销中的一些基本技术问题,形成了基本理论,主要研究内容如下：
     (1)针对如何选择Web社区,提出了基于数据质量评估和抽样方法的Web社区排序理论。通过建立数据质量,给出了评价社区数据源优劣的量化标准,从而使得评价标准可以度量和扩展,这种方法解决了传统排序算法中排序标准不能完整的反映真实评价的问题；而通过合适的抽样方法,从庞大的社区主题中随机抽取样本,使样本能够反映总体的特性,解决了社区中主题数量庞大不好度量的问题。
     (2)针对社区中资源的模糊搜索,提出了基于Trie树的新型模糊算法。当用户只记得某个单词的一部分时,用户只需输入该部分,通过本文的系统仍然可以找到需要的结果。并且具有交互功能：用户每输入一个字母,系统就会实时的提示用户可能目标词。为了实现高效性从而不影响用户满意度,本文提取了一种基于Trie树的算法。实验表明该算法能高效的实现本系统。
     (3)针对用户的特征属性和兴趣爱好挖掘,本文提出了基于本体语义分析的用户特征属性和兴趣爱好挖掘方法。通过建立用户的行为模型和特征模型,建立特征属性的属性集和推断规则集,建立不确定性的推断方法,来根据用户的行为特征和言论推断用户的特征属性和兴趣爱好。实验结果表明该方法具有良好的扩展性和准确性,解决了Web社区营销中目标的精准定位的问题。
     (4)为提高挖掘用户特征属性和兴趣爱好的效率,提出了基于交互关系的用户特征挖掘方法。本文通过大量社区用户数据统计和分析,研究了Web社区中用户之间的交互行为和兴趣相似度,建立了基于假设检验的理论评价方法,证明了社会学家关于“交往亲密的朋友具有更多的兴趣相似性”的观点在虚拟Web社区中同样具有适用性。在此基础上构造了快速挖掘Web社区中兴趣相似用户集合的算法,并通过置信度量和算法检验,证明了此算法在快速实现Web社区中兴趣相似的用户挖掘是有效的。
     (5)针对社区中主题失效的问题,提出了社区中主题网页时间一致性的建模、度量、推理和发现方法。网页的时间一致性是指网页所述的时间与实际时间相符,它是评价网络信息质量的一项重要指标,关系到网页内容的时效性和精准性。大量时间敏感度较高的网页中均存在时间的不一致性,严重影响了用户对网页内容的理解和决策行为。本文首先针对主题网页的时间维度进行了建模,包括对网页信息的时间敏感性分析、基于时间序列的网页分类和网页的时间维度抽取；然后针对网页时间一致性进行了度量与推理,包括对网页事件的时间不致性分类、网页事件的时间不一致性建模和主题网页中不一致的发现。通过此方法可以实现自动过滤Web社区中的时间不一致的主题,提高用户的使用感受。
     本文的研究为Web社区营销提供了理论支撑和技术支持,解决了如何从众多Web社区中进行甄别和排序；实现了社区主题的模糊查询方法；解决了如何精确挖掘用户特性特征和属性；实现了网络社区中过时主题信息的建模和发现方法。
With the rapid development of Web communities, Web-based community network marketing recieves more and more attention from business. Survey data shows, by the end of 2010, community users have reached 294 million, accounting for 70.3% of total Internet users, and the national Internet advertising market share reaches 32.12 billion yuan in 2010. However, with the change of purchases in developing society, people tend to search related information on the internet for decision-making, instead of relying on traditional internet advertising. As a product of network marketing and promotion, community marketing plays a very important role in the consumer decision-making by word-of mouth advertising. For consumers, they trust the information among people more than advertisers. Therefore, Web-based community network marketing becomes a low-cost, high efficiency way of information promotion.
     Because of the short development period, Web community marketing has not yet built an effective theory and a unified approach. As the core of Web community marketing is the interaction and precision marketing, this paper studies four aspects: How to choose the appropriate community for community marketing; How to make users access to appropriate topic in community; how to mine the true characteristics and interests of the virtual user; how to find out-dated topic of community. Based on this, this paper solves some basic technical problems of Web community marketing and builts the basic theory. Main contents are as follows:
     (1)About how to choose Web community, this paper proposes a Web community ranking theory based on data quality assessment and sampling methods. The establishment of data quality gives a quantitative criterion for the evaluation of Web community data sources, which makes the evaluation criterion be measured and extended. This approach solves the problem that criteria in traditional sort algorithms can not completely reflect the real evaluation; and through the appropriate sampling method, which randomly draws out samples from large community topics so that samples can reflect the overall characteristics of the community, solves the problem about bad metrics of huge number of topics.
     (2) According to the fuzzy search of community resources, this paper proposes a new fuzzy algorithm based on Trie tree. When a user only remembers part of a word, the user just need to enter the remembered part, our system can still find the desired results. What's more, our system has interactive characteristics:when a user enters a letter, the system will prompt the user possible target word in time. Experiments show that the algorithm can efficiently implement the system.
     (3)In view of users' characteristics and interests mining, this paper presents a method for users' characteristics and interests mining based on ontology semantic analysis. Through building users' behavior model and characteristics model, establishing a characteristic set of properties and inferred properties of rule sets, and then creating uncertainty inference method, to infer the user's characteristics and interests according to the user's behavior characteristics and attributes of speech. Experimental results show that the method has good scalability and accuracy and solves the problem on the precise location of targets in Web communities marketing.
     (4) In order to improve the efficiency of mining user characteristics attributes and interests, this paper puts forward a mining method of user characteristics based on the interactive relationship. In this paper, according to a lot of data statistics and analysis, we present an evaluation method based on the theory of hypothesis testing, proving the sociologist's point of view about "intimate friends have more similar interests" also has applicability in the virtual Web community. Afterwards in terms of statistical regularities, this paper constrcts the user group discovery algorithm. Final results show that this is a fast and effective method on mining user groups who have some interest.
     (5) Aimed at the problem about out-dated topics in Web community, this paper presents the modeling, measurement, reasoning and discovering methods of time consistency of topic pages in Web community. Time Consistency of Web pages which related to the timeliness and content accuracy is that the time webpages referred to matches the actual time, it is an important indicator for evaluating the quality of network information. Many time-sensitive pages exist time inconsistency, seriously affecting the user's understanding of content and decision-making. This paper firstly constructs a model on the time dimension of the theme pages, including time-sensitive analysis of web information, time series-based classification and time dimension extraction of webpages; then measures and reasons on the web time consistency, including time inconsistency classsification of web events, time inconsistency modeling of web events and time inconsistency discovering of topic pages. This method can achieve automatic filtering time inconsistency topic in Web communities to improve the user's experience.
     This study provides theoretical and technical support for the Web community marketing, and solves the problem that how to identify and sort from a lot of Web communities, realizes the fuzzy query method of community topics, addresses how to precisely mine users'characteristics and attributes and achieves outdated topic information modeling and discoverying method in Web community.

引文

[1]中国互联网络信息中心.第27次中国互联网络发展状况统计报告.2011年1月
    [2]杨楠,弓丹志,孟小峰.Web社区发现技术综述.计算机研究与发展.2005,42(3)：439-447.
    [3]杨宇航,赵铁军,于浩,郑德权.Blog研究.软件学报Vol.19, No.4, April 2008
    [4]Ying Zhou, Joseph Davis:Community discovery and analysis in blogspace. 2006:1017-1018
    [5]Biao Xiang, En-Hong Chen, Tao Zhou:Finding Community Structure Based on Subgraph Similarity. CoRR abs/0902.2425.2009
    [6]Fang Wei, Chen Wang, Li Ma, Aoying Zhou:Detecting Overlapping Community Structures in Networks with Global Partition and Local Expansion. APWeb 2008:43-55
    [7]Huajing Li, Zaiqing Nie, Wang-Chien Lee, C. Lee Giles, Ji-Rong Wen:Scalable community discovery on textual data with relations. CIKM 2008:1203-1212
    [8]杨楠,林松祥,高强,孟小峰.一种从马尔可夫聚类簇发现潜在WEB社区特征的方法.计算机学报Vol.30, No.7, July 2007
    [9]沈华伟,程学旗,陈海强,刘悦.基于信息瓶颈的社区发现.计算机学报.Vol.31, No.4, April 2008
    [10]Yun Chi, Shenghuo Zhu, Xiaodan Song, Jun'ichi Tatemura, Belle L. Tseng: Structural and temporal analysis of the blogosphere through community factorization. KDD 2007:163-172
    [11]Pedro DeRose, Xiaoyong Chai, Byron J. Gao, Warren Shen, AnHai Doan, Philip Bohannon, Xiaojin Zhu:Building Community Wikipedias:A Machine-Human Partnership Approach. ICDE 2008:646-655
    [12]Woochang Hwang, Taehyong Kim, Murali Ramanathan, Aidong Zhang. Bridging Centrality:Graph Mining from Element Level to Group Level. Proc. of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Pages:336-344, August,2008.
    [13]Parag Singla, Matthew Richardson Yes, There is a Correlation-From Social Networks to Personal Behavior on the Web. WWW 2008
    [14]付长胜,肖侬,赵英杰,陈涛.基于协商的跨社区访问的动态角色转换机制.软件学报Vol.19, No.10, October 2008
    [15]徐隽,姚静,牛军钰.论坛社区用户时空特征建模与挖掘.郑佳谦.计算机研究与发展2007.
    [16]Amit Goyal, Francesco Bonchi, Laks V. S. Lakshmanan:Discovering leaders from community actions. CIKM 2008:499-508
    [17]Naohiro Matsumura, Yukio Ohsawa, Mitsuru Ishizuka. profiling of participants in online-community [J]. American Association for Artificial Intelligence, 2002,27 (4).
    [18]Nitin Agarwal, Huan Liu, Lei Tang, Philip S. Yu:Identifying the influential bloggers in a community. WSDM 2008:207-218
    [19]Pei-Yu Chen, Yen-Chun Chou, Robert J. Kauffman:Community-Based Recommender Systems:Analyzing Business Models from a Systems Operator's Perspective. HICSS 2009:1-10
    [20]WenYen Chen, Dong Zhang, Edward Y. Chang:Combinational collaborative filtering for personalized community recommendation. KDD 2008:115-123
    [21]Kavita A. Ganesan, Neelakantan Sundaresan, Harshal Deo:Mining tag clouds and emoticons behind community feedback. WWW 2008:1181-1182
    [22]J Leskovec, A Krause, C Guestrin, C Faloutsos. Cost-effective Outbreak Detection in Networks. SIGKDD2007.
    [23]Nilesh Bansal, Nick Koudas:BlogScope:A System for Online Analysis of High Volume Text Streams. VLDB 2007:1410-1413
    [24]Lan Nie, Brian D. Davison, Baoning Wu:Ranking by community relevance. SIGIR 2007:873-874
    [25]Lei Tang, Huan Liu, Jianping Zhang, Zohreh Nazeri:Community evolution in dynamic multi-mode networks. KDD 2008:677-685
    [26]P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals:A top-down, compositional, and incremental approach. In VLDB,2007.
    [27]丁伟莉,赵华,郑德权等.中文Blog热门话题检测与排序技术研究.中国中文信息学会十五周年学术会议,北京,2006：282-289.
    [28]邱立坤,程威,龙志伟等.面向BBs的话题挖掘初探.全国第八届计算语言学联合学术会议,南京,2005401.407.
    [29]Lin Hongfei,Yang Yuansheng. The representation and update mechanism for user profile. Journal of Computer Research and Development,2002,39(7) 1843-847
    [30]陈海强,程学旗,刘悦.基于用户兴趣的寻找虚拟社区核心成员的方法.中文信息学报.2009,23(2).
    [31]Yogesh L. Simmhan, Beth Plale, Dennis Gannon:A Survey of Data Provenance in e-Science. SIGMOD 2005:31-36
    [32]Jennifer Golbeck, Aaron Mannes:Using Trust and Provenance for Content Filtering on the Semantic Web. WWW 2006
    [33]Adam Jatowt, Mitsuru Ishizuka:Temporal Web Page Summarization. WISE 2004: 303-312
    [34]Na Dai, Brian D. Davison:Freshness Matters:In Flowers, Food, and Web Authority. SIGIR 2010
    [35]Jaewon Yang, Jure Leskovec:Patterns of Temporal Variation in Online Media. WSDM 2011 February 9-12:177-186
    [36]Marius Pasca:Towards Temporal Web Search. SAC 2008:1117-1121
    [37]Hu Ran, Wang Zhuo, Xu Jianfeng:Web Quality of Agile Web Development. IEEE 2009 International Conference on Services Science, Management and Engineering: 426-429
    [38]刘凯鹏,方滨兴.一种基于社会性标注的网页排序算法.计算机学报.Vol.33,No.6,June 2010:1014-1023
    [39]王伟,张文博,魏峻,钟华,黄涛.一种资源敏感的Web应用性能诊断方法.软件学报Vol.21, No.2, February 2010:194-208
    [40]Guilan Dai, Xiaoying Bai, Chongchong Zhao:A Framework for Time Consistency Verification for Web Processes Based on Annotated OWL-S. IEEE 2007 The Sixth International Conference on Grid and Cooperative Computing
    [41]Wang-Chiew Tan:Provenance in Databases:Past, Current, and Future. IEEE 2007 Bulletin of the IEEE Computer Society Technical Committee on Data Engineering: 3-58
    [42]Sandra de F. Mendes Sampaio, Chao Dong, and Pedro R. Falcone Sampaio: Incorporating the Timeliness Quality Dimension in Internet Query Systems. WISE 2005:53-62
    [43]Marius Pasca, Enrique Alfonseca:Web-Derived Resources for Web Information Retrieval:From Conceptual Hierarchies to Attribute Hierarchies. SIGIR 2009:596-603
    [44]Abdullah Mueen, Suman Nath, Jie Liu:Fast Approximate Correlation for Massive Time-series Data. SIGMOD 2010:171-182
    [45]Junghoo Cho, Sourashis Roy, Robert E. Adams:Page Quality:In Search of an Unbiased Web Ranking. SIGMOD 2005
    [46]Klaus Berberich, Srikanta J. Bedathur, Thomas Neumann, Gerhard Weikum:A time machine for text search. SIGIR 2007:519-526
    [47]Peiquan Jin, Xiaowen Li, Hong Chen, Lihua Yue:CT-Rank:A Time-aware Ranking Algorithm for Web Search. Journal of Convergence Information Technology. Volume 5, Number 6, August 2010
    [48]Zhumin Chen, Jun Ma, Chaoran Cui, Hongxing Rui, Shaomang Huang:Web Page Publication Time Detection and its Application for Page Rank. SIGRE 2010:859-860
    [49]宋杰,王大玲,鲍玉斌,申德荣.基于页面Block的Web档案采集和存储.软件学报.Vol.19, No.2, February 2008:275-290
    [50]HaiquanChen, Wei-Shinn Ku, HaixunWang, Min-TeSun:Leveraging Spatio-Temporal Redundancy for RFID Data Cleansing. SIGMOD 2010:51-62
    [51]杨怡玲,管旭东,尤晋元.基于页面内容和站点结构的页面聚类挖掘算法.软件学报.Vo1.13,No.3,2002：467-469
    [52]Laure Berti-Equille:Measuring and Constraining Data Quality with Analytic Workflows. VLDB 2008
    [53]Armin Roth:Completeness-driven Query Answering in Peer Data Management Systems. VLDB 2007
    [54]Wisam Dakka, Luis Gravano, Panagiotis G. Ipeirotis:Answering General Time-Sensitive Queries. CIKM 2008:1437-1438
    [55]Ying Zhang, Xuemin Lin, Gaoping Zhu, Wenjie Zhang, Qianlu Lin:Efficient Rank Based KNN Query Processing Over Uncertain Data. ICDE 2010
    [56]Kuang Chen, Harr Chen, Neil Conway, Joseph M. Hellerstein, Tapan S. Parikh: USHER:Improving Data Quality with Dynamic Forms. ICDE 2010
    [57]叶小平,汤庸.时态变量“Now”语义及相应时态关系运算.软件学报.Vol.21,No.4,April 2010:694-701
    [58]刘冬宁,汤庸.时态数据库时间轴的动态逻辑模型.软件学报Vol.21, No.4, April2010：694-701
    [59]汤庸,刘海,郭欢,叶小平TempDB:寸态数据管理系统.计算机研究与发展.2010：442-445
    [60]艾瑞市场研究中心.2010年中国网络社区研究报告.2010年10月.
    [61]应德全,应晓敏,叶继华.一种基于图论的聚类算法.计算机工程与应用.2009：45(3)
    [62]余伟,李石君,洪辉,田建伟：基于覆盖关系的Deep Web数据源排名.《计算机研究与发展》增刊.Vol44,No.z3,29-34,2007
    [63]F. Naumann:Quality-Driven Query Answering. LNCS 2261,2002, pp.51-66.
    [64]Chiara Francalanci, Barbara Pernici:Information quality assessment: Dataquality assessment from the user's perspective. IQIS'04. June 2004
    [65]Arjun Dasgupta:A Random Walk Approach to Sampling Hidden Databases. Sigmod'07 Yang W. Lee, Diane M. Strong:Knowing-Why About Data Processes and Data Quality. Journal of Management Information Systems. December 2003
    [66]Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma:Query Selection Techniques for Efficient Crawling of Structured Web Sources. ICDE 2006:47.
    [67]Jayant Madhavan, David Ko, tucja Kot. Google's Deep-Web Crawl. In Proceedings of the VLDB,2008.
    [68]Sriram Raghavan, Hector Garcia-Molina:Crawling the Hidden Web. VLDB 2001:129-138
    [69]Augusto de Carvalho Fontes, Fobio Soares Silva:SmartCrawl:a new strategy for the exploration of the hidden Web. WIDM 2004:9-15
    [70]A. Arasu, and H. Garcia-Molina. Extracting structured data from Web pages. In SIGMOD,2003.
    [71]Jiying Wang, Ji-Rong Wen, Frederick H. Lochovsky, Wei-Ying Ma:Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. VLDB 2004:408-419
    [72]James Caverlee, Ling Liu, Daniel Rocco:Discovering Interesting Relationships among Deep Web Databases:A Source-Biased Approach. World Wide Web 2006,9(4): 585-622.
    [73]Zhen Zhang, Bin He, Kevin Chen-Chuan Chang:Light-weight Domain-based Form Assistant:Querying Web Databases On the Fly. VLDB 2005:97-108
    [74]Wensheng Wu, Clement T. Yu, AnHai Doan, Weiyi Meng:An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. SIGMOD Conference 2004:95-106.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700