基于文本聚类的在线零售商信誉维度研究

英文题名：Research on Online Retailers Reputation Dimensions Based on Text Clustering
作者：陈获帆
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：信誉维度 ; 文本聚类 ; 文本评论 ; 文本-数值转换
英文关键词：Reputation dimension ; Text Clustering ; Text Comments ; Text-number Conversion
学位年度：2009
导师：赵学锋
学科代码：1201
学位授予单位：华中科技大学
论文提交日期：2009-05-01

摘要

随着零售电子商务的快速发展,在线信誉管理系统的研究越来越受到学者们的重视,目前一些简单的在线信誉管理系统已成功地运用于众多C2C电子商务网站以及一些B2C购物代理网站,但是目前的在线信誉管理系统的维度设计还不够完善,针对上述存在的问题,本次研究将从客户评论的角度,采用文本挖掘的方法来研究B2C在线零售商的信誉维度,从而对目前主流的B2C电子商务网站的信誉维度进行优化。
     本次研究利用文本聚类技术对客户文字评论进行处理与研究,主要可分为两大部分,第一部分为文本转换,它可以分为三个步骤:(1)文本集合的生成;(2)特征项集合的生成;(3)VSM数值矩阵的生成和优化。通过这三步,我们可以将大量复杂的文档转换成可以被计算机直接处理的数值矩阵,为聚类分析奠定了基础,其中,第二步和第三步是我们的研究重点,包括特征项选择算法,权重函数的确定等方面的研究。第二部分为聚类分析与应用,这一阶段由两步组成:(1)将生成的数据矩阵进行聚类分析,得出聚类结果。(2)对聚类结果进行评价检验,并应用到相关领域。在聚类过程中,我们将采用层次聚类和k-means聚类相结合的方式,用层次聚类算法作为主要的聚类手段,而用k-means聚类算法进行迭代检验。在得出聚类结果之后,我们将进行知识提取,并应用到相关领域。
     通过本次研究我们可以发现聚类分析在电子商务中的应用是可行的,并且具有很重要的意义。这是一种新的信誉维度确立方法,具有一定的科学性和合理性。除了确立在线零售商的信誉维度,我们在聚类过程中还可以发现不同客户群体和不同零售商群体的典型特征,从而制定出差别化的客户服务方案等。随着统计技术与计算机技术、人工智能技术的紧密结合,新的面向具体应用领域的、具有弹性的聚类分析技术和应用软件将会层出不穷,其解决问题的广度和深度将会得到更大的提高。
With the rapid development of retail e-commerce, online reputation management system is attracting a lot of attention of the scholars. At present, some simple online reputation management system has been applied to many C2C and B2C e-commerce sites successfully. However, there are still some shortcomings in online reputation management systems’reputation dimension design .To solve the above problems, we should research the reputation assessment dimension of B2C online retailers with the method of text mining in this paper, to optimize the reputation dimension of the mainstream B2C e-commerce site ,which from the view of customers’perspective.
     This paper process and research the customers’perspective with text clustering technology can be divided into two major sections. The first section is the text-number conversion, it can be divided into three steps: (1) The generation of text collection; (2) The generation of Characteristics collection; (3) The generation and optimization of VSM numerical matrix. Through the three steps, a large number of complex documents can be converted to numerical matrix which can be processed by the computer directly, and it would lay the foundation for cluster analysis, of which, the second and third step is the focus of our research, including the characteristics collection algorithm and the VSM weighting function study. The second section is cluster analysis and application, it can be divided into two steps: (1) Data matrix processing with clustering analysis. (2) Clustering analysis results text and related fields application. In the clustering process, we will combine the hierarchical clustering and the k-means clustering, with hierarchical clustering algorithm as the main means of clustering, and using k-means clustering algorithm for iterative testing., we will extract knowledge and applied to related fields after the findings of the clustering results.
     We can found that the application of text clustering analysis in e-commerce is feasible through this study, and it is of great importance to us. This is a new method to establish the reputation dimensions, it is of science and rationality. In addition to the established of the online customers’reputation dimensions, we can also found the features of different customer groups and different online retailer groups through text clustering, and then develop a differentiated customer service programs. With the closely combination of statistical technology and computer technology and artificial intelligence technology, the new flexible cluster analysis techniques which for specific application and software will be endless, the solution to the problem will be greater improved on its span and depth.

引文

[1]艾瑞咨询:美国B2C电子商务增速放缓. http://www.iresearch.com.cn/html/Consulting/Online_Shopping/DetailNews_id_91804.html. 2009
    [2]艾瑞咨询: 2008年中国网络购物市场发展数据报告. http://www.iresearch.com.cn/Report/Free.asp?id=1240. 2009
    [3]艾瑞咨询: 08年人均网购额超1600元,C2C仍是购物首选. http://news.iresearch.cn/viewpoints/90683.shtml. 2009
    [4] Akhtcr F., Hobbs D., Maamar Z. How users perceive trust in virtual environment. International Conference on Information and knowledge engineering, Las Vegas, Nevada, 2003: 23-26
    [5] Hart P., Saunders C. Power and trust critical factors in the adoption and us of electronic data interchange. Organizational Science 1997, 8(1): 23-42
    [6] D. Gefen. E-commerce: the role of familiarity and trust Omega. The International Journal of Management Science, 2000, 28(6): 725-37
    [7]傅铅生,张立刚. B2C单方可信状态下信息体系建设的研究.中国管理信息化, 2005(1): 37-39
    [8] Pavlou D. Gefen. Building effective online market places with institution-based trust. Information Systems Res, 15(1): 35-53
    [9] Obreiter P. A Case for Evidence-aware Distributed Reputation Systems: Proc. Of 2nd Int. Conf. On Trust Management, Oxford, 2004: 33-47
    [10]中国互联网信息中心,第23次中国互联网统计报告, http://www.cnnic.net.cn/index.htm. 2009. 1
    [11] Resnick P., R. Zeckhauser. The value of reputation on eBay: A controlled experiment. Experiment Econom, 2006, 9(2): 79-101
    [12] Resnick P., R. Zeckhauser. Trust among strangers in Internet transactions: Empirical analysis of ebay’s reputation system. The economics of the Internet and E-Commerce
    [13] Resnick P., R. Zeckhauser. Reputation system. Comm, 2000, 43(12): 45-48
    [14] Kotha S. The role of online buying experience as a competitive advantage: Evidence from third-party ratings for e-commerce firms. Journal of Business, 2004, 77(2): 109-133
    [15] Posselt T. Pre-sale vs. Post-sale e-satisfaction: Impact on repurchase intention and overall satisfaction. Journal of Interactive marketing, 2005, 19(4): 35-47
    [16]于建红,鲁耀斌. B2C电子商务信任评价体系及其应用.工业工程与管理, 2007(7): 116-121
    [17]赵学锋.网络零售商信誉评价指标体系构建研究.管理评论, 2008
    [18]邵兵家,孟宪强,张宗益.中国B2C电子商务中消费者信任前因的实证研究.科研管理, 2006, 27(5): 144-150
    [19]王涛.电子商务企业产品评价指标体系的建立及模糊评价.现代情报, 2007(1): 178-180
    [20] Ketchen D. J., Shook C. L. The Application of Cluster Analysis in Strategic Management Research: an Analysis and Critique, Strategy Management Journal, 200017: 441-458
    [21] Freeman R. T., Yin H. J. Tree View Self-organ-isation of Web Content. Neurocomputing, 2005(63): 415-446
    [22]牟廉明.数据挖掘中聚类方法比较研究.内江师范学院学报, 2003, 18(2): 16-20.
    [23]崔志明,谢春丽.基于Web的文本挖掘研究.微电子学与计算机, 2005, 22(10): 51-53
    [24]张云涛,龚玲.数据挖掘原理与技术.北京:电子工业出版社, 2004
    [25] Bingham E., Kab. Topic identification in dynamic text by complexity pursuit. Neural Processing Letters, 2003, 17(3): 69-83
    [26]康恺,林坤辉,周昌乐.基于主题词频数特征的文本主题划分.计算机应用, 2006, 26(8): 1994-1196
    [27]郭庆琳,樊孝忠.基于文本聚类的自动文摘系统的研究与实现.计算机工程, 2006(2): 30-32
    [28]张其文,李明.文本主题的自动提取方法研究与实现.计算机工程与设计,2006(8): 2744-2746
    [29] Regina Barzilay, Min-Yen Kan, Kathleen R. McKeown. Simfinder: A Flexible Clustering Tool for Summarization. In p roceedings of the2 Workshop on Summarization in NAACL 01. Pittsburg, 2001
    [30]林鸿飞,马雅彬.基于聚类的文本过滤模型.大连理工大学学报, 2003, 42(2)
    [31]史忠植.知识发现.北京:清华大学出版社, 2002
    [32]丁露,崔平. SOM聚类算法在文本分类上的应用.现代情报, 2007, 9(9): 162-164
    [33]吴启明,易云飞.文本聚类综述.河池学院学报, 2008, 28(2): 86-91
    [34]苏伟峰,李绍滋,李堂秋.一个基于概念的中文文本分类模型.计算机工程与应用, 2002, 112(6): 193-195
    [35]董振东,董强.知网.北京:计算语言学文集, 1999
    [36]吴立德.大规模中文文本处理.上海:上海复旦大学出版社, 1997
    [37]鲁松,白硕.文本中词语权重计算方法的改进. 2000 International Conference on Multilingual Information Processing, 2000: 31-36
    [38]张云涛,龚玲.数据挖掘原理与技术.北京:电子工业出版社, 2004
    [39] F. Sebastian. I Machine learing in automated text categorization. ACM Computin Surveys, 2002, 34(1): 1-47
    [40] Y. Yang, J. O. Pedersen. A comparative study on feature selection in text categorization. The ICML97, Nashville, 1997
    [41] M. Rogat, I. Y. Yang. High performance feature selection for text categori-zation. The CIKM-02, Mclean, 2002
    [42] Lu Yuchang, Lu Mingyu, Li Fan, et. Al. Analysis and construction of word weighting function in VSM. Journal ofComputerResearch&Development, 2002, 39(10): 1205-1210
    [43] C. C. Aggrawa, P. S. Yu. Finding generalized projected clusters in high dimensional spaces. The SIGMOD 00, Dallas, 2000
    [44] M. Dash, H. Liu. Feature selection for clustering. The PAKDD00, Kyoto, 2000
    [45] Salton G., Wong A., Yang C. S. A Vector Space Model for Automatic Indexing. Communications of the ACM, 1975(18): 613-620
    [46] Mao W. L., Chu W. W. Free-text Medical Do-cument Retrieval via Phrase-based Vector Space Model. Proceedings of AMIA Annual Symposium, 2002: 46-51
    [47] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berke-ley: University of California Press, 1967: 281-297
    [48] DHILLON I. S., GUAN Y., KOGAN J. Iterative clustering ofhigh dimensional text data augmented by local search[C]//Proceedings of the 2002 IEEE International Conference on Da-ta Mining. Maebashi, Japan: IEEE Press, 2002: 131-138
    [49] LARSENB, AONE C. Fast and effective textmining using lin-ear time document clustering[C]//Proceedings of the Fifth ACM SIGKDD Int’1 Conference on Knowledge Discovery and Data Mining. San Diego, California: ACM Press, 1999: 16-22
    [50] SHI Zhong. Efficient online spherical K-means clustering[C]//Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. Montreal, Canada: IEEE Press, 2005: 3180-3185
    [51]行小帅,潘进,焦李成.基于免疫规划的K-means聚类算法.计算机学报, 2003, 26(5): 605-610
    [52] STEINBACH M., KARYPIS G., KUMAR V. A comparison of document clustering techniques[C]//Proceedings of the 6th ACM-SIGKDD International Conference on Text Mining. Bos-ton, MA, USA: ACM Press, 2000: 103-122
    [53] DHILLON I. S., MODHAD S. Concept decompositions for large sparse text data using clustering[J]. Machine Learning, 2001, 42(1): 143-175
    [54]张猛,王大玲,于戈.一种基于自动阈值发现的文本聚类方法.计算机研究与发展, 2004, 41(10): 1748-1753
    [55]刘远超,王晓龙,徐志明等.文本聚类综述.中文信息学报, 2006, 20(3): 55-62
    [56] S. Theodoridis, K. Koutroumbas. Pattern Recognition. Academic Press, 1999
    [57] Richard O. Duda, Peter E. Hart, David G. Stork. Pattern Classification. China Machine Press, 2004, Chapter 10
    [58]张红兵. SPSS宝典.北京:电子工业出版社, 2007
    [59]刘源,谭强,沈旭昆.信息处理用汉代分词规范及自动分词方法.北京:清华大学出版社, 1992
    [60]马庆国.管理统计——数据获取、统计原理、SPSS工具与应用研究.北京:科学出版社, 2002

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700