面向在线社区的用户信息挖掘及应用研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

面向在线社区的用户信息挖掘及应用研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Online Communities Oriented User Information Mining and Its Applications
作者：刘璟
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：在线社区 ; 用户信息 ; 用户名抽取 ; 用户链指 ; 用户专业水平估计 ; 众包任务难度估计
英文关键词：online communities ; user information ; username extraction ; user linking ; user
英文关键词：expertise estimation ; crowdsourcing task difculty estimation
学位年度：2014
导师：洪小文 ; 刘挺
学科代码：0812
学位授予单位：哈尔滨工业大学

摘要

近些年，随着各种在线社区的发展，网络上积累了海量的用户信息，包括了用户账户信息（例如用户名）、用户人口信息（例如性别和年龄等）、用户社交关系（例如朋友关系和回复关系等）以及用户生成内容等。一方面，这些用户信息可以帮助企业更好的理解和定位客户，另外一方面可以为用户提供更好的个性化信息系统，同时可以帮助社会学家更好的理解人类行为。因此，挖掘在线社区中的用户信息是构建新的社会化应用以及理解人类行为的关键。
     然而，在线社区中的用户信息挖掘存在着各种挑战，包括了非结构化的挑战、跨社区的挑战和非度量化的挑战。非结构化的挑战是指在线社区中的用户信息以非结构化的形式呈现在各种不同类型的网页中，这些网页的布局结构的多样性和动态性为用户信息的自动抽取带来了困难。跨社区的挑战是指一个用户的信息碎片化的分布在不同的社区中，这为全方面理解一个用户带来了很大的困难。非度量化的挑战是指各种用户属性信息（例如影响力、专业水平等）缺少显式的直接度量，这为用户属性信息的直接应用带来了困难。本文主要针对这三个挑战进行了研究，并对用户信息的应用研究进行了一定的探索。具体的，本文的主要研究内容可概括如下：
     （1）针对用户信息的非结构化挑战，本文研究了面向用户生成内容网页的用户名抽取问题。本文提出了一种基于弱指导学习的方法。该方法利用少量的、由统计意义上稀有的字符串构成的用户名，自动收集和标注大量训练数据，解决了目前有指导学习方法需要人工标注训练数据的问题。同时，本文方法仅依赖于从单页面中抽取出的特征，克服了已有方法对于多页面特征的依赖性。实验结果表明，本文方法显著性优于仅基于单页面特征的有指导学习方法，并且和基于多页面特征的有指导学习方法性能相当。
     （2）针对用户信息跨社区的挑战，本文研究了跨社区的用户链指问题。本文将用户链指问题分为两步：（a）同名消歧，即判断使用相同用户名的用户是否属于同一个自然人；（b）不同名消解，即收集一个自然人所使用的所有不同的用户名。本文关注解决同名消歧任务。首先，本文进行了用户问卷调查和基于About.me数据的分析，量化的说明了解决同名消歧任务的重要性。这是第一个量化的研究人们使用用户名行为习惯的工作。然后，本文提出根据用户名的语言模型概率自动获取训练数据的方法。同时，本文在Yahoo! Answers的数据集上实验验证了该方法所基于的假设的合理性。本文方法解决了目前有指导学习方法需要人工标注数据的困难。实验结果表明，本文方法在自动标注的训练集上学习到的分类器是有效的。
     （3）针对用户信息非度量化的挑战，本文以用户专业水平估计为例研究了用户信息的度量。具体的，本文研究了问答社区中用户专业水平的估计问题。本文提出了基于竞赛模型的用户专业水平估计方法。该方法将用户专业水平的估计问题转换成了根据一系列二人竞赛的比赛结果估计选手的能力水平的问题。具体的，本文方法克服了基于链接分析的方法不能将问答关系和答案质量信息等异构信息进行统一建模的问题。同时，本文方法通过对每场比赛的难度进行建模，克服了基于答案质量的方法将每个问题相等对待的问题。实验结果表明，与基于链接分析的方法和基于答案质量的估计方法相比，本文提出的竞赛模型在估计活跃用户的专业水平时性能有显著性提高。
     （4）本文从应用的角度出发，在结构化、度量化、跨社区链指的用户信息基础上，研究了基于用户信息的众包任务难度估计。具体的，本文以问答社区中的问题难度估计为例进行了研究。本文利用用户专业水平的度量信息，提出了基于用户竞赛的模型估计问题的难度。用户专业水平的度量为问题难度的估计提供了指导，解决了之前方法不能处理观察值为偏序关系的问题。实验结果验证了本文所提出的模型的有效性。最后，本文利用跨社区的用户链指信息，研究了跨社区的问题难度估计问题。
     总之，本文一方面致力于解决用户信息挖掘中非结构化、跨社区和非度量化的挑战，另一方面从应用的角度出发，尝试了将结构化、度量化、跨社区链指的用户信息应用到众包任务难度估计的问题上来。本研究取得了一些初步的成果，期待这些成果能对本领域的其他研究者提供借鉴。随着用户信息挖掘技术的不断完善，相信用户信息挖掘技术会为各种社会化应用以及社会计算相关的研究带来更大的帮助。
In recent years, with the development of various online communities, there is a hugeamount of user information cumulated on the web, including user account information(e.g. usernames), user demographic information (e.g. gender, age and location), usersocial relation (e.g. friend relation and reply relation) and user generated content. Onone hand, the user information can help enterprises better understand their clients andtarget new clients more accurately. On the other hand, the user information can be usedto build better personalized information systems. Additionally, the user information canhelp sociologists to understand human behavior better. Hence, the technologies of mininguser information from online communities are the keys to build new social applicationsand help understand human behavior.
     However, there are several challenges for mining user information from online com-munities, including unstructured data challenge, cross-community challenge and no mea-surement challenge. Unstructured data challenge means that the user information in on-line communities are shown as on the web pages in an unstructured way. The diversityand the dynamics of the web page layouts brings challenges to the automatic extraction ofthe user information as structured data. Cross-community challenge means that the difer-ent aspects of the user information are distributed in diferent online communities, whichmakes it difcult to fully understand all aspects of users. No measurement challengemeans that there is no explicit measurement of user characteristics (e.g., user influencelevels and user expertise levels), which makes it difcult to directly apply the user infor-mation. This paper mainly focuses on addressing these three challenges, and explores thethe applications of the user information. Specifically, the main contents of this paper canbe summarized as follows:
     (1) To address the unstructured data challenge, this paper studies the problem ofextracting usernames from the web pages containing user-generated content. This paperproposes a weakly supervised learning approach. The proposed approach utilizes a smallamount of statistically rare usernames to automatically collect and label large-scale train-ing data, which solves the problem with previous work that requires manually labeledtraining data. The proposed approach relies on only single page features, and addresses the problem with previous work that requires multiple page features. The experimen-tal results show that the proposed approach significantly outperforms the start-of-the-artapproach with single page features, and has comparable performance with the start-of-the-art approach with multiple page features.
     (2) To address the cross-community challenge, this paper studies the problem of link-ing users across multiple online communities. We define that the problem of linking usersacross multiple communities can be divided into two tasks:(a) the alias-disambiguationtask, which is to diferentiate users under the same usernames; and (b) the alias-conflationtask, which means to find all diferent usernames used by a natural person. In this paper,we focus on the alias-disambiguation task of the user linking problem. We start quantita-tively analyzing the importance of the alias-disambiguation step by conducting a surveyand an experimental analysis on a dataset of About.me. To the best of our knowledge, it isthe first study to quantify the human behavior on the usage of usernames. We then demon-strate an approach to automatically create a training data set by leveraging the knowledgeof the n-gram probability of a username. We verify the efectiveness of this approachby using the dataset of Yahoo! Answers. This approach addresses the problem with theprevious work that requires manually labeled training data. Additionally, we verify theefectiveness of the classifiers trained with the automatically generated training data.
     (3) To address the no measurement challenge, this paper studies the problem ofestimating user expertise scores as an example of measuring user characteristics. Specifi-cally, this paper considers the problem of estimating the relative expertise scores of usersin community question and answering services. This paper proposes a competition-basedmethod to estimate user expertise score. This method casts the problem of estimatinguser expertise scores as a problem of estimating relative skill levels of players in two-player games. Compared with the link analysis based approaches, our proposed methodsimultaneously models question-answer relation and answer quality information in a u-nified way. Compared with the answer quality based approaches, our proposed methodconsiders the difculty levels of diferent competitions, rather than weighting diferen-t questions equally. The experimental results show that our proposed competition-basedmodel significantly outperforms the link analysis based methods and answer quality basedapproaches on the dataset of active users.
     (4) Taking an application viewpoint, this paper studies the problem estimating thedifculty levels of crowdsourcing tasks based on the structured, linked and measured us- er information. Specifically, this paper studies the problem of estimating question (i.e.crowdsourcing task) difculty levels in community question and answering services. Thispaper proposes a user competition-based approach to estimating question difculty lev-els by leveraging the measurement of user expertise levels. The measurement of userexpertise levels can help address the problem with previous work that cannot deal withthe partial order observations. The experimental results show the efectiveness of ourproposed model. Finally, this paper studies the problem of calibrating question difcultyscores across communities by leveraging linked user information.
     In conclusion, this paper not only focuses on addressing the unstructured data chal-lenge, cross-community challenge and no measurement challenge, but also studies anapplication of structured, linked and measured user information, which is the problem ofestimating the difculty levels of crowdsourcing tasks. This research has achieved somepreliminary results, and we hope this can be helpful to other researchers in this area. Webelieve that the development of user information mining technologies will help buildingnew social applications and the research of social science.

引文

1http://www.facebook.com
    2http://www.tripadvisor.com/ForumHome/
    3http://www.frommers.com/community/
    4http://www.gogobot.com/
    5http://www.weibo.com/
    6http://www.douban.com/
    7http://www.mturk.com
    8http://www.stackoverflow.com/
    9http://www.guru.com
    10http://www.elance.com
    11http://www.topcoder.com
    12http://www.zhubajjie.com
    13http://www.taskcn.com/
    1http://www.google.com/
    1http://www.myspace.com/
    2http://www.facebook.com/
    3http://about.me/
    4http://answers.yahoo.com
    5http://www.weibo.com/
    6http://www.renren.com/
    7http://www.newsmth.com/
    8http://www.tianya.cn/
    9http://blog.sina.com.cn/
    10http://qzone.qq.com/
    11http://zhidao.baidu.com/
    12http://wenwen.soso.com/
    13http://twitter.com
    14http://linkedin.com
    15http://flickr.com
    16http://answers.yahoo.com/
    17http://code.google.com/apis/maps/
    1http://zhidao.baidu.com/
    2http://answers.yahoo.com
    3http://help.yahoo.com/l/us/yahoo/answers/network/contributor.html
    4www.turbotax.com
    1http://www.mturk.com
    2http://www.stackoverflow.com/
    3http://www.topcoder.com
    4http://www.zhubajjie.com
    5http://stackoverflow.com
    6http://blog.stackoverflow.com/category/cc-wiki-dump/
    7http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
    8http://mathoverflow.net/
    9http://dumps.mathoverflow.net/
    10http://math.stackexchange.com
    11http://mathoverflow.net/faq
    12http://area51.stackexchange.com/proposals/3355/mathematics
    [1] Linden G, Smith B, York J. Amazon.com recommendations: Item-to-item collab-orative filtering[J]. IEEE on Internet Computing,2003,7(1):76–80.
    [2] Pang B, Lee L, Vaithyanathan S. Thumbs Up?: Sentiment Classification UsingMachine Learning Techniques[C]. Proceedings of the ACL-02Conference on Em-pirical Methods in Natural Language Processing (EMNLP).2002:79–86.
    [3] Kosinski M, Stillwell D, Graepel T. Private traits and attributes are predictablefrom digital records of human behavior[J]. Proceedings of the National Academyof Sciences (PNAS),2013,110(15):5802–5805.
    [4] Hong L, Doumith A S, Davison B D. Co-factorization Machines: Modeling Us-er Interests and Predicting Individual Decisions in Twitter[C]. Proceedings of theSixth ACM International Conference on Web Search and Data Mining (WSDM).2013:557–566.
    [5] Danescu-Niculescu-Mizil C, Lee L, Pang B, et al. Echoes of Power: LanguageEfects and Power Diferences in Social Interaction[C]. Proceedings of the21stInternational Conference on World Wide Web (WWW).2012:699–708.
    [6] Danescu-Niculescu-Mizil C, West R, Jurafsky D, et al. No Country for OldMembers: User Lifecycle and Linguistic Change in Online Communities[C]. Pro-ceedings of the22Nd International Conference on World Wide Web (WWW).2013:307–318.
    [7] Newman M. Communities, modules and large-scale structure in networks[J]. Na-ture Physics,2012,8(1):25–31.
    [8] Zhang J, Ackerman M S, Adamic L. Expertise Networks in Online Communities:Structure and Algorithms[C]. Proceedings of the16th International Conference onWorld Wide Web (WWW).2007:221–230.
    [9] Liu J, Song Y I, Lin C Y. Competition-based User Expertise Score Estimation[C].Proceedings of the34th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR).2011:425–434.
    [10] Gruhl D, Guha R, Liben-Nowell D, et al. Information Difusion ThroughBlogspace[C]. Proceedings of the13th International Conference on World WideWeb (WWW).2004:491–501.
    [11]吴信东,李毅,李磊.在线社交网络影响力分析[J].计算机学报,2014,4:735.
    [12] Leskovec J, Kleinberg J, Faloutsos C. Graphs over Time: Densification Laws,Shrinking Diameters and Possible Explanations[C]. Proceedings of the EleventhACM SIGKDD International Conference on Knowledge Discovery in Data Mining(KDD).2005:177–187.
    [13] Zafarani R, Liu H. Connecting Corresponding Identities across Communities[C].Proceedings of the Third International Conference on Weblogs and Social Media(ICWSM).2009.
    [14] Abel F, Henze N, Herder E, et al. Interweaving Public User Profiles on the Web[C].Proceedings of the18th International Conference on User Modeling, Adaptation,and Personalization (UMAP).2010:16–27.
    [15] Malhotra A, Totti L, Meira W, et al. Studying User Footprints in Diferent OnlineSocial Networks[C]. IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining (ASONAM).2012:1065–1070.
    [16]孙韬.社会化媒体中提升用户参与度的关键因素研究[D].北京大学,2013.
    [17]李栋,徐志明,李生, et al.在线社会网络中信息扩散[J].计算机学报,2014,1:015.
    [18] Hopcroft J, Lou T, Tang J. Who Will Follow You Back?: Reciprocal RelationshipPrediction[C]. Proceedings of the20th ACM International Conference on Informa-tion and Knowledge Management (CIKM).2011:1137–1146.
    [19] Lee K, Caverlee J, Webb S. Uncovering Social Spammers: Social Honeypots+Machine Learning[C]. Proceedings of the33rd International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (SIGIR).2010:435–442.
    [20] Liu B. Structured data extraction: Wrapper generation[M]. Web Data Mining.Springer,2011:363–423.
    [21] Cai D, Yu S, Wen J R, et al. Extracting Content Structure for Web Pages Based onVisual Representation[C]. Proceedings of the5th Asia-Pacific Web Conference onWeb Technologies and Applications (APWeb).2003:406–417.
    [22] Zheng S, Song R, Wen J R. Template-independent News Extraction Based onVisual Consistency[C]. Proceedings of the22nd National Conference on ArtificialIntelligence (AAAI).2007:1507–1512.
    [23] Song R, Liu H, Wen J R, et al. Learning Block Importance Models for WebPages[C]. Proceedings of the13th International Conference on World Wide Web(WWW).2004:203–211.
    [24] Weninger T, Hsu W H, Han J. CETR: Content Extraction via Tag Ratios[C].Proceedings of the19th International Conference on World Wide Web (WWW).2010:971–980.
    [25] Sun F, Song D, Liao L. DOM Based Content Extraction via Text Density[C].Proceedings of the34th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR).2011:245–254.
    [26] Kushmerick N. Wrapper Induction for Information Extraction[D].[S.l.]:[s.n.],1997. AAI9819266.
    [27] Muslea I, Minton S, Knoblock C. A Hierarchical Approach to Wrapper Induc-tion[C]. Proceedings of the Third Annual Conference on Autonomous Agents.1999:190–197.
    [28] Soderland S. Learning Information Extraction Rules for Semi-Structured and FreeText[J]. Machine Learning,1999:233–272.
    [29] Zheng S, Song R, Wen J R, et al. Joint Optimization of Wrapper Generationand Template Detection[C]. Proceedings of the13th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDD).2007:894–902.
    [30] Zhu J, Nie Z, Wen J R, et al.2D Conditional Random Fields for Web Informa-tion Extraction[C]. Proceedings of the22Nd International Conference on MachineLearning (ICML).2005:1044–1051.
    [31] Zhu J, Nie Z, Wen J R, et al. Simultaneous Record Detection and Attribute Labelingin Web Data Extraction[C]. Proceedings of the12th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDD).2006:494–503.
    [32] Xin X, Li J, Tang J, et al. Academic Conference Homepage UnderstandingUsing Constrained Hierarchical Conditional Random Fields[C]. Proceedings ofthe17th ACM Conference on Information and Knowledge Management (CIKM).2008:1301–1310.
    [33] Hao Q, Cai R, Pang Y, et al. From One Tree to a Forest: A Unified Solution forStructured Web Data Extraction[C]. Proceedings of the34th International ACM SI-GIR Conference on Research and Development in Information Retrieval (SIGIR).2011:775–784.
    [34] Wong T L, Lam W. Learning to Adapt Web Information Extraction Knowledgeand Discovering New Attributes via a Bayesian Approach[J]. IEEE Transaction onKnowledge and Data Engineering (TKDE),2010:523–536.
    [35] Laferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Prob-abilistic Models for Segmenting and Labeling Sequence Data[C]. Proceedings ofthe Eighteenth International Conference on Machine Learning (ICML).2001:282–289.
    [36] Wang J, Lochovsky F H. Data Extraction and Label Assignment for Web Databas-es[C]. Proceedings of the12th International Conference on World Wide Web(WWW).2003:187–196.
    [37] Lu Y, He H, Zhao H, et al. Annotating Structured Data of the Deep Web[C]. Pro-ceedings of the23rd IEEE International Conference on Data Engineering (ICDE).2007:376–385.
    [38] Yang J M, Cai R, Wang Y, et al. Incorporating Site-level Knowledge to ExtractStructured Data from Web Forums[C]. Proceedings of the18th International Con-ference on World Wide Web (WWW).2009:181–190.
    [39] Song X, Liu J, Cao Y, et al. Automatic Extraction of Web Data Records Contain-ing User-generated Content[C]. Proceedings of the19th ACM International Con-ference on Information and Knowledge Management (CIKM).2010:39–48.
    [40] Liu B, Grossman R, Zhai Y. Mining Data Records in Web Pages[C]. Proceedingsof the Ninth ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining (KDD).2003:601–606.
    [41] Debnath S, Mitra P, Pal N, et al. Automatic Identification of Informative Sec-tions of Web Pages[J]. IEEE Transaction on Knowledge and Data Engineering(TKDE):1233–1246.
    [42]宋鑫莹.网络信息自动化高效抽取技术研究[D].哈尔滨工业大学,2013.
    [43]王允,李弼程,林琛.基于网页布局相似度的Web论坛数据抽取[J].中文信息学报,2010,24(2):68–75.
    [44] Iofciu T, Fankhauser P, Abel F, et al. Identifying Users Across Social TaggingSystems.[C]. Proceedings of the Third International Conference on Weblogs andSocial Media (ICWSM).2011.
    [45] Vosecky J, Hong D, Shen V. User identification across multiple social network-s[C]. First International Conference on Networked Digital Technologies (NDT).2009:360–365.
    [46] Nunes A, Calado P, Martins B. Resolving User Identities over Social NetworksThrough Supervised Learning and Rich Similarity Features[C]. Proceedings of the27th Annual ACM Symposium on Applied Computing (SAC).2012:728–729.
    [47] Narayanan A, Shmatikov V. Myths and Fallacies of”Personally Identifiable Infor-mation”[J]. Communication of ACM,2010,53(6):24–26.
    [48] Yuan N J, Zhang F, Lian D, et al. We Know How You Live: Exploring the Spectrumof Urban Lifestyles[C]. Proceedings of the First ACM Conference on Online SocialNetworks (COSN).2013:3–14.
    [49] Backstrom L, Dwork C, Kleinberg J. Wherefore Art Thou R3579x?: AnonymizedSocial Networks, Hidden Patterns, and Structural Steganography[C]. Proceedingsof the16th International Conference on World Wide Web (WWW).2007:181–190.
    [50] Frankowski D, Cosley D, Sen S, et al. You Are What You Say: Privacy Risksof Public Mentions[C]. Proceedings of the29th Annual International ACM SI-GIR Conference on Research and Development in Information Retrieval (SIGIR).2006:565–572.
    [51] Narayanan A, Shmatikov V. Robust De-anonymization of Large Sparse Dataset-s[C]. Proceedings of the2008IEEE Symposium on Security and Privacy (S&P).2008:111–125.
    [52] Narayanan A, Shmatikov V. De-anonymizing Social Networks[C]. Proceedings ofthe2009IEEE Symposium on Security and Privacy (S&P).2009:173–187.
    [53] Labitzke S, Taranu I, Hartenstein H. What your friends tell others about you: Lowcost linkability of social network profiles[C]. Proceedings of the5th InternationalACM Workshop on Social Network Mining and Analysis (SNA-KDD).2011.
    [54] Rao J R, Rohatgi P. Can Pseudonymity Really Guarantee Privacy?[C]. Proceedingsof the9th Conference on USENIX Security Symposium (USENIX).2000:7–7.
    [55] Novak J, Raghavan P, Tomkins A. Anti-aliasing on the Web[C]. Proceedings ofthe13th International Conference on World Wide Web (WWW).2004:30–39.
    [56] Sanderson C, Guenter S. Short Text Authorship Attribution via Sequence Ker-nels, Markov Chains and Author Unmasking: An Investigation[C]. Proceedingsof the2006Conference on Empirical Methods in Natural Language Processing(EMNLP).2006:482–491.
    [57] Gamon M. Linguistic Correlates of Style: Authorship Classification with DeepLinguistic Analysis Features[C]. Proceedings of the20th International Conferenceon Computational Linguistics (COLING).2004.
    [58] Graham N, Hirst G, Marthi B. Segmenting Documents by Stylistic Character[J].Natural Language Engineering (NLE),2005,11(4):397–415.
    [59] Soon W M, Ng H T, Lim D C Y. A Machine Learning Approach to CoreferenceResolution of Noun Phrases[J]. Computational Linguistic (CL),2001,27(4):521–544.
    [60] Bengtson E, Roth D. Understanding the Value of Features for Coreference Resolu-tion[C]. Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP).2008:294–303.
    [61] Cai J, Strube M. End-to-end Coreference Resolution via Hypergraph Partition-ing[C]. Proceedings of the23rd International Conference on Computational Lin-guistics (COLING).2010:143–151.
    [62] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: ASurvey[J]. IEEE Transaction on Knowledge and Data Engineering (TKDE),2007,19(1):1–16.
    [63] Bhattacharya I, Getoor L. Collective Entity Resolution in Relational Data[J]. ACMTransaction Knowledge Discovery Data (TKDD),2007,1(1).
    [64] Kalashnikov D V, Chen Z, Mehrotra S, et al. Web People Search via Connec-tion Analysis[J]. IEEE Transaction on Knowledge and Data Engineering (TKDE),2008,20(11):1550–1565.
    [65] Elmacioglu E, Tan Y F, Yan S, et al. PSNUS: Web People Name Disambiguationby Simple Clustering with Rich Features[C]. Proceedings of the4th InternationalWorkshop on Semantic Evaluations (SemEval).2007:268–271.
    [66] Yoshida M, Ikeda M, Ono S, et al. Person Name Disambiguation by Bootstrap-ping[C]. Proceedings of the33rd International ACM SIGIR Conference on Re-search and Development in Information Retrieval (SIGIR).2010:10–17.
    [67] Mann G S, Yarowsky D. Unsupervised Personal Name Disambiguation[C].Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL (CONLL).2003:33–40.
    [68] Brin S, Page L. The Anatomy of a Large-scale Hypertextual Web Search En-gine[C]. Proceedings of the Seventh International Conference on World Wide Web(WWW).1998:107–117.
    [69] Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal ofACM,1999,46(5):604–632.
    [70] Campbell C S, Maglio P P, Cozzi A, et al. Expertise Identification Using EmailCommunications[C]. Proceedings of the Twelfth International Conference on In-formation and Knowledge Management (CIKM).2003:528–531.
    [71] Zhou D, Orshanskiy S A, Zha H, et al. Co-ranking Authors and Documents in aHeterogeneous Network[C]. Proceedings of the2007Seventh IEEE InternationalConference on Data Mining (ICDM).2007:739–744.
    [72] Jurczyk P, Agichtein E. Hits on Question Answer Portals: Exploration of LinkAnalysis for Author Ranking[C]. Proceedings of the30th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR).2007:845–846.
    [73] Jurczyk P, Agichtein E. Discovering Authorities in Question Answer Communitiesby Using Link Analysis[C]. Proceedings of the Sixteenth ACM Conference onConference on Information and Knowledge Management (CIKM).2007:919–922.
    [74] Bouguessa M, Dumoulin B, Wang S. Identifying Authoritative Actors in Question-answering Forums: The Case of Yahoo! Answers[C]. Proceedings of the14thACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing (SIGKDD).2008:866–874.
    [75] Bian J, Liu Y, Zhou D, et al. Learning to Recognize Reliable Users and Contentin Social Media with Coupled Mutual Reinforcement[C]. Proceedings of the18thInternational Conference on World Wide Web (WWW).2009:51–60.
    [76] Pal A, Konstan J A. Expert Identification in Community Question Answering:Exploring Question Selection Bias[C]. Proceedings of the19th ACM InternationalConference on Information and Knowledge Management (CIKM).2010:1505–1508.
    [77]曹云波.关于网络社区问答知识重用的研究[D].上海交通大学,2011.
    [78]王宝勋.面向网络社区问答对的语义挖掘研究[D].哈尔滨工业大学,2013.
    [79] Jeon J, Croft W B, Lee J H, et al. A Framework to Predict the Quality of An-swers with Non-textual Features[C]. Proceedings of the29th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR).2006:228–235.
    [80] Agichtein E, Castillo C, Donato D, et al. Finding High-quality Content in SocialMedia[C]. Proceedings of the2008International Conference on Web Search andData Mining (WSDM).2008:183–194.
    [81] Liu Y, Bian J, Agichtein E. Predicting Information Seeker Satisfaction in Commu-nity Question Answering[C]. Proceedings of the31st Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SI-GIR).2008:483–490.
    [82] Cong G, Wang L, Lin C Y, et al. Finding Question-answer Pairs from OnlineForums[C]. Proceedings of the31st Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIR).2008:467–474.
    [83] Suryanto M A, Lim E P, Sun A, et al. Quality-aware Collaborative QuestionAnswering: Methods and Evaluation[C]. Proceedings of the Second ACM Interna-tional Conference on Web Search and Data Mining (WSDM).2009:142–151.
    [84] Balog K, Azzopardi L, de Rijke M. Formal Models for Expert Finding in EnterpriseCorpora[C]. Proceedings of the29th Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIR).2006:43–50.
    [85] Balog K, Azzopardi L, de Rijke M. A Language Modeling Framework for ExpertFinding[J]. Information Process&Management (IPM),2009,45(1):1–19.
    [86]包胜华.基于Web的实体信息搜索与挖掘研究[D].上海交通大学,2008.
    [87] Zhou Y, Cong G, Cui B, et al. Routing Questions to the Right Users in OnlineCommunities[C]. Proceedings of IEEE25th International Conference on Data En-gineering (ICDE).2009:700–711.
    [88] Li B, King I. Routing Questions to Appropriate Answerers in Community QuestionAnswering Services[C]. Proceedings of the19th ACM International Conference onInformation and Knowledge Management (CIKM).2010:1585–1588.
    [89] Xu F, Ji Z, Wang B. Dual Role Model for Question Recommendation in Com-munity Question Answering[C]. Proceedings of the35th International ACM SI-GIR Conference on Research and Development in Information Retrieval (SIGIR).2012:771–780.
    [90] Yang L, Qiu M, Gottipati S, et al. CQArank: Jointly Model Topics and Expertise inCommunity Question Answering[C]. Proceedings of the22Nd ACM InternationalConference on Information&Knowledge Management (CIKM).2013:99–108.
    [91] Si X, Chang E Y, Gyo¨ngyi Z, et al. Confucius and Its Intelligent Disciples: Inte-grating Social with Search[J]. Proceedings of the Very Large Database Endowment(VLDB),2010,3(1-2):1505–1516.
    [92] Chawla S, Hartline J D, Sivan B. Optimal Crowdsourcing Contests[C]. Proceed-ings of the Twenty-third Annual ACM-SIAM Symposium on Discrete Algorithms(SODA).2012:856–868.
    [93] Yang J, Adamic L A, Ackerman M S. Competing to Share Expertise: The TaskcnKnowledge Sharing Community.[C]. Proceedings of the Second International Con-ference on Weblogs and Social Media (ICWSM).2008.
    [94] Whitehill J, Ruvolo P, Wu T, et al. Whose Vote Should Count More: OptimalIntegration of Labels from Labelers of Unknown Expertise.[C]. Proceedings ofAdvances in Neural Information Processing Systems (NIPS).2009，22:2035–2043.
    [95] Welinder P, Branson S, Belongie S, et al. The Multidimensional Wisdom of Crowd-s.[C]. Proceedings of Advances in Neural Information Processing Systems (NIPS).2010，10:2424–2432.
    [96] Zhou D, Platt J C, Basu S, et al. Learning from the Wisdom of Crowds by MinimaxEntropy.[C]. Proceedings of Advances in Neural Information Processing Systems(NIPS).2012:2204–2212.
    [97] Bachrach Y, Graepel T, Minka T, et al. How To Grade a Test Without Knowingthe Answers-A Bayesian Graphical Model for Adaptive Crowdsourcing and Ap-titude Testing[C]. Proceedings of the22Nd International Conference on MachineLearning (ICML).2012.
    [98] Baker F B. The basics of item response theory[M]. ERIC,2001.
    [99] Rasch G. Probabilistic Models For Some Intelligence And Attainment Tests[J].1981.
    [100] Engelhard Jr G. The measurement of writing ability with a many-faceted Raschmodel[J]. Applied Measurement in Education,1992,5(3):171–191.
    [101] Bond T G, Fox C M. Applying the Rasch model: Fundamental measurement inthe human sciences[M]. Psychology Press,2013.
    [102] Lange R, Moran J, Greif W R, et al. A Probabilistic Rasch Analysis of Ques-tion Answering Evaluations.[C]. Proceedings of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies(NAACL-HLT).2004:65–72.
    [103] Liu K, Terzi E. A framework for computing the privacy scores of users in on-line social networks[J]. ACM Transactions on Knowledge Discovery from Data(TKDD),2010,5(1):6.
    [104] Ackerman M S, McDonald D W. Answer Garden2: Merging OrganizationalMemory with Collaborative Help[C]. Proceedings of the1996ACM Conferenceon Computer Supported Cooperative Work (CSCW).1996:97–105.
    [105] Yang J, Adamic L A, Ackerman M S. Crowdsourcing and Knowledge Sharing:Strategic User Behavior on Taskcn[C]. Proceedings of the9th ACM Conference onElectronic Commerce (EC).2008:246–255.
    [106] Archak N. Money, Glory and Cheap Talk: Analyzing Strategic Behavior of Contes-tants in Simultaneous Crowdsourcing Contests on TopCoder.Com[C]. Proceedingsof the19th International Conference on World Wide Web (WWW).2010:21–30.
    [107] Wang K, Thrasher C, Hsu B J P. Web Scale NLP: A Case Study on Url WordBreaking[C]. Proceedings of the20th International Conference on World WideWeb (WWW).2011:357–366.
    [108] Wang K, Thrasher C, Viegas E, et al. An Overview of Microsoft Web N-gram Cor-pus and Applications[C]. Proceedings of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Language Technologies (NAACL-HLT).2010:45–48.
    [109] Yang W. Identifying Syntactic Diferences Between Two Programs[J]. SoftwarePractice Expert,1991:739–755.
    [110] Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines[J]. ACMTransaction. Intelligence System Technology (TIST),2011,2(3):27–54.
    [111] Zheng S, Zhou D, Li J, et al. Extracting Author Meta-Data from Web Using VisualFeatures[C]. Proceedings of the Seventh IEEE International Conference on DataMining Workshops.2007:33–40.
    [112] Peters M E, Lecocq D. Content Extraction Using Diverse Feature Sets[C]. Proceed-ings of the22Nd International Conference on World Wide Web (WWW).2013:89–90.
    [113] Kumar S, Zafarani R, Liu H. Understanding User Migration Patterns in SocialMedia[C]. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelli-gence (AAAI).2011.
    [114] Liu K, Terzi E. A Framework for Computing the Privacy Scores of Users in OnlineSocial Networks[J]. ACM Transation Knowledge Discovery Data (TKDD),2010,5(1):6:1–6:30.
    [115] Gonzalez R C, Woods R E, Eddins S L. Digital image processing using MAT-LAB[M]. Pearson Education India,2004.
    [116] Cover T M, Thomas J A. Elements of information theory[M]. John Wiley&Sons,2012.
    [117] Chang C C, Lin C J. LIBSVM: a library for support vector machines[J]. ACMTransactions on Intelligent Systems and Technology (TIST),2011,2(3):27.
    [118] Platt J C. Probabilistic Outputs for Support Vector Machines and Comparisonsto Regularized Likelihood Methods[C]. Proceedings of Advances in Large MarginClassifiers.1999:61–74.
    [119] Acuna E, Rodriguez C. The treatment of missing values and its efect on classifieraccuracy[C]. Proceedings of Classification, Clustering, and Data Mining Applica-tions.2004:639–647.
    [120] Harper F M, Moy D, Konstan J A. Facts or Friends?: Distinguishing Informationaland Conversational Questions in Social Q&A Sites[C]. Proceedings of the SIGCHIConference on Human Factors in Computing Systems (CHI).2009:759–768.
    [121] Liu Y, Li S, Cao Y, et al. Understanding and Summarizing Answers in Community-based Question Answering Services[C]. Proceedings of the22Nd InternationalConference on Computational Linguistics (COLING).2008:497–504.
    [122] Mendes Rodrigues E, Milic-Frayling N. Socializing or Knowledge Sharing?:Characterizing Social Intent in Community Question Answering[C]. Proceedingsof the18th ACM Conference on Information and Knowledge Management (CIK-M).2009:1127–1136.
    [123] Elo A E. The rating of chessplayers, past and present[M]. Vol.3. Batsford London,1978.
    [124] Herbrich R, Minka T, Graepel T. Trueskill: A Bayesian skill rating sys-tem[C]. Proceedings of Advances in Neural Information Processing Systems (NIP-S).2006:569–576.
    [125] Mease D. A penalized maximum likelihood approach for the ranking of collegefootball teams independent of victory margins[J]. The American Statistician,2003,57(4):241–248.
    [126] Callaghan T, Mucha P J, Porter M A. Random walker ranking for NCAA divisionIA football[J]. American Mathematical Monthly,2007,114(9):761–777.
    [127] Park J, Newman M E. A network-based ranking system for US college football[J].Journal of Statistical Mechanics: Theory and Experiment,2005,2005(10):P10014.
    [128] Shah C, Pomerantz J. Evaluating and Predicting Answer Quality in Communi-ty QA[C]. Proceedings of the33rd International ACM SIGIR Conference on Re-search and Development in Information Retrieval (SIGIR).2010:411–418.
    [129] Wang X J, Tu X, Feng D, et al. Ranking Community Answers by ModelingQuestion-answer Relationships via Analogical Reasoning[C]. Proceedings of the32Nd International ACM SIGIR Conference on Research and Development in In-formation Retrieval (SIGIR).2009:179–186.
    [130] Sakai T, Ishikawa D, Kando N. Overview of the NTCIR-8Community QA PilotTask (Part II): System Evaluation[J]. Proceedings of NTCIR-8,2010:433–457.
    [131] Sakai T, Ishikawa D, Kando N, et al. Using Graded-relevance Metrics for E-valuating Community QA Answer Selection[C]. Proceedings of the Fourth ACMInternational Conference on Web Search and Data Mining (WSDM).2011:187–196.
    [132] Yang J, Wei X, Ackerman M S, et al. Activity Lifespan: An Analysis of UserSurvival Patterns in Online Knowledge Sharing Communities[C]. Proceedings ofthe Third International Conference on Weblogs and Social Media (ICWSM).2010.
    [133] Nam K K, Ackerman M S, Adamic L A. Questions in, Knowledge in?: A S-tudy of Naver’s Question Answering Community[C]. Proceedings of the SIGCHIConference on Human Factors in Computing Systems (CHI).2009:779–788.
    [134] Church K. How Many Multiword Expressions Do People Know?[C]. Proceedingsof the Workshop on Multiword Expressions: From Parsing and Generation to theReal World (MWE).2011:137–144.
    [135] Schein A I, Popescul A, Ungar L H, et al. Methods and Metrics for Cold-start Recommendations[C]. Proceedings of the25th Annual International ACM SI-GIR Conference on Research and Development in Information Retrieval (SIGIR).2002:253–260.
    [136] Sugiyama K, Hatano K, Yoshikawa M. Adaptive Web Search Based on UserProfile Constructed Without Any Efort from Users[C]. Proceedings of the13thInternational Conference on World Wide Web (WWW).2004:675–684.
    [137] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embed-ding and clustering.[C]. Proceedings of Advances in Neural Information Process-ing Systems (NIPS).2001，14:585–591.
    [138] Belkin M, Niyogi P, Sindhwani V. Manifold Regularization: A Geometric Frame-work for Learning from Labeled and Unlabeled Examples[J]. Journal of MachineLearning Research (JRML),2006,7:2399–2434.
    [139] Cohen J. A Coefcient of Agreement for Nominal Scales[J]. Educational andPsychological Measurement,1960,20(1):37.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700