The Best Answers? Think Twice: Identifying Commercial Campagins in the CQA Forums

详细信息查看全文

作者：Cheng Chen ; Kui Wu ; Venkatesh Srinivasan…
关键词：CQA forum ; anomaly detection ; paid poster ; online detection system
刊名：Journal of Computer Science and Technology
出版年：2015
出版时间：July 2015
年：2015
卷：30
期：4
页码：810-828
全文大小：1,114 KB
参考文献：[1]Jeon J, Croft W B, Lee J H, Park S. A framework to predict the quality of answers with non-textual features. In Proc. the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 2006, pp. 228-35.
[2]Jurczyk P, Agichtein E. Discovering authorities in question answer communities by using link analysis. In Proc. the 16th ACM Conference on Information and Knowledge Management, November 2007, pp. 919-22.
[3]Agichtein E, Castillo C, Donato D, Gionis A, Mishne G. Finding high-quality content in social media. In Proc. the International Conference on Web Search and Web Data Mining, February 2008, pp. 183-94.
[4]Wang G, Wilson C, Zhao X, Zhu Y, Mohanlal M, Zheng H, Zhao B Y. Serf and turf: Crowdturfing for fun and profit. In Proc. the 21st International Conference on World Wide Web, April 2012, pp. 679-88.
[5]Liu Y, Li S, Cao Y, Lin C Y, Han D, Yu Y. Understanding and summarizing answers in community-based question answering services. In Proc. the 22nd International Conference on Computational Linguistics, Volume 1, August 2008, pp. 497-04.
[6]Bian J, Liu Y, Agichtein E, Zha H. Finding the right facts in the crowd: Factoid question answering over social media. In Proc. the 17th International Conference on World Wide Web, April 2008, pp. 467-76.
[7]Bian J, Liu Y, Zhou D, Agichtein E, Zha H. Learning to recognize reliable users and content in social media with coupled mutual reinforcement. In Proc. the 18th International Conference on World Wide Web, April 2009, pp. 51-0.
[8]Kleinberg J M. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46(5): 604-32.
[9]Bian J, Liu Y, Agichtein E, Zha H. A few bad votes too many? Towards robust ranking in social media. In Proc. the 4th International Workshop on Adversarial Information Retrieval on the Web, April 2008, pp. 53-0.
[10]Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: Bringing order to the Web. Technical Report SIDL-WP-1999-0120, Stanford Digital Library Technologies Project, 1998.
[11]Pera M S, Ng Y. A community question-answering refinement system. In Proc. the 22nd ACM Conference on Hypertext and Hypermedia, June 2011, pp. 251-60.
[12]Fichman P. A comparative assessment of answer quality on four question answering sites. Journal of Information Science, 2011, 37(5): 476-86.
[13]Sakai T, Ishikawa D, Kando N, Seki Y, Kuriyama K, Lin C. Using graded-relevance metrics for evaluating community QA answer selection. In Proc. the 4th International Conference on Web Search and Web Data Mining, February 2011, pp. 187-96.
[14]Jindal N, Liu B. Opinion spam and analysis. In Proc. the International Conference on Web Search and Web Data Mining, February 2008, pp. 219-30.
[15]Ott M, Choi Y, Cardie C, Hancock J T. Finding deceptive opinion spam by any stretch of the imagination. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1, June 2011, pp. 309-19.
[16]Mukherjee A, Liu B, Glance N S. Spotting fake reviewer groups in consumer reviews. In Proc. the 21st International Conference on World Wide Web, April 2012, pp. 191-00.
[17]Huang M, Yang Y, Zhu X. Quality-biased ranking of short texts in microblogging services. In Proc. the 5th International Joint Conference on Natural Language Processing, November 2011, pp. 373-82.
[18]Huang C, Jiang Q, Zhang Y. Detecting comment spam through content analysis. In Proc. the 2010 International Conference on Web-Age Information Management, July 2010, pp. 222-33.
[19]Chen C, Wu K, Srinivasan V, Zhang X. Battling the Internet water army: Detection of hidden paid posters. In Proc. the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, August 2013, pp. 116-20.
[20]Kapur J, Kesavan H. Entropy Optimization Principles with Applications. Academic Press Inc., 1992.
[21]McFadden D. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, Zarembka P(ed.), New York: Academic Press, 1974, pp. 105-42.
[22]Chang C, Lin C. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27:1-7:27.
[23]Fan R, Chang K, Hsieh C, Wang X, Lin C. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 2008, 9: 1871-874.
[24]Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 80-9.
[25]Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, March 2003, 3: 1289-305.
作者单位：Cheng Chen (1)
Kui Wu (1)
Venkatesh Srinivasan (1)
R. Kesav Bharadwaj (1)

1. Department of Computer Science, University of Victoria, Victoria, V8P 5C2, Canada
刊物类别：Computer Science
刊物主题：Computer Science, general
Software Engineering
Theory of Computation
Data Structures, Cryptology and Information Theory
Artificial Intelligence and Robotics
Information Systems Applications and The Internet
Chinese Library of Science
出版者：Springer Boston
ISSN：1860-4749

文摘

In an emerging trend, more and more Internet users search for information from Community Question and Answer (CQA) websites, as interactive communication in such websites provides users with a rare feeling of trust. More often than not, end users look for instant help when they browse the CQA websites for the best answers. Hence, it is imperative that they should be warned of any potential commercial campaigns hidden behind the answers. Existing research focuses more on the quality of answers and does not meet the above need. Textual similarities between questions and answers are widely used in previous research. However, this feature will no longer be effective when facing commercial paid posters. More context information, such as writing templates and a user’s reputation track, needs to be combined together to form a new model to detect the potential campaign answers. In this paper, we develop a system that automatically analyzes the hidden patterns of commercial spam and raises alarms instantaneously to end users whenever a potential commercial campaign is detected. Our detection method integrates semantic analysis and posters-track records and utilizes the special features of CQA websites largely different from those in other types of forums such as microblogs or news reports. Our system is adaptive and accommodates new evidence uncovered by the detection algorithms over time. Validated with real-world trace data from a popular Chinese CQA website over a period of three months, our system shows great potential towards adaptive detection of CQA spams.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700