基于实体和信息网络知识提取的邮件管理系统的设计和实现

作者：文捷
论文级别：硕士
学科专业名称：计算机软件和理论
中文关键词：电子邮件 ; 实体识别 ; 信息网络知识提取 ; 评分&聚类
英文关键词：Email ; indentify of entity ; knowledge extraction of information network ; ranking&clustering
学位年度：2010
导师：汪卫
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2010-09-30

摘要

随着邮件个人用户的数据和信息级数增长,个人信息管理的研究成为热点。电子邮件作为个人信息的重要载体在个人信息业务中占据着重要的地位。随着个人信息的增加,用户在对邮件进行管理和使用变得越来越困难。例如,查询时用户经常遇到遗忘关键字的困扰,又或者搜索出来的结果不尽如人意。对此普通的邮件工具很难为用户组织和管理个人信息提供更好的帮助。与此同时,实体的相关研究也变得越来越热门,实体的识别(?)分类和应用也成为了业界关注的话题,而这些研究也与实体的载体文档有着密不可分的关系。电子邮件,就是日常生活中最常见的实体载体。于是,本文决定利用实体的特性以及它与邮件的密切联系,以此为基础进行数据挖掘分析以改善和帮助邮件个人用户组织和管理电子邮件中的个人信息。
     另外,由于电子邮件系统自身的特殊性,个人电子邮件用户以及与他通信的用户之间,也自然而然地构成了一个信息网络。随着信息网络变得随处可见,从信息网络提取知识变成一项重要的工作。于是随着研究的进一步深入,本文尝试从邮件个人用户及与他通信的用户构成的信息网络之中提取知识和有用的信息,以帮助用户更好地组织邮件数据和管理个人信息、。
     在信息网络的知识提取过程中,当前最热门的两种方式就是评分和聚类。评分和聚类都可以提供给用户信息网络数据上的总览,而每一种方法都是当前的一个热门研究方向。然而要注意到的是,评分和聚类是不能孤立地对待和处理的,只评分不聚类经常产生大量的无意义的数据。类似的,聚类大量数据在一个聚类而不加区分在大部分情况下也是没有意义的。本文在学习了当前最新的关于信息网络的评分聚类的研究后,根据电子邮件系统本身的特性提出了一些改进的方法,为用户管理电子邮件信息网络中的用户和邮件数据的重要性和提高日常工作查询结果的精度提供了帮助。
     本文首先提出了一个基于实体发现、查找和管理的邮件管理系统,并且在继续学习和研究后,提出了改进的方案,在信息网络知识提取的基础上,利用前人的经验和自身研究的特性,运用改进过的聚类和评分方法,有效改善了上述问题。同时对关键技术—中文分词,实体挖掘,实体关联管理,查询结果及信息网络结构图形化展现以及如何评分和聚类—的实现,提出了自己的想法和处理机制,达到了提高用户邮件管理效率的目的。
With the increment of personal user's information and data,the research on PIM is getting more and more popular and intense. As an important information repository,Email is playing a important role in the research on PIM.But with the increase of the personal information,personal users have to face more and more difficulties during using and management.For example,they always forget the query keywords when they prepare to search some important information from their email,or the users are not satisfied with the result from querying.It means that the email is difficult to help personal user organize and manage their information.
     At the same moment,the research about entity become a hot topic whereas more and more research pay attention to the identifying,categorizing and application of entity which are close to the media of entity.Email is the most common media of entity.Based on the feature of entity and the connection between entity and Email,this paper help the personal users organize and management the personal information from Email more effectively.
     Furthermore,because of the feature of the Email,a information network is found automatically between the personal user of Email and the users who connect with him.How to extract knowledge from information network has become an important work with the popularity of information network. With the further research.this paper try to study Email data more deeply,then we can extract useful knowledge and information from the network between Email user and his contacts.
     During the knowledge extraction from the information network,the two most popular method are ranking and clustering.Both clustering and ranking can provide overall views on information network data,and each has been a hot topic. However, ranking objects globally without considering which clusters they belong to often leads todumbresults.Similarly,clustering a huge number of objects in one huge cluster without distinction is dull as well.This paper propose some new methods to management the importance of contacts and mails after learning the newest research of ranking and clustering.
     In this paper,we propose a email management system which based on entity mining,querying and processing and improve the system after deeply study and research.With the feature of the information network and the experience of the former,we improve the methods of ranking and clustering. Meanwhile,we point out our idea to carry out the key technique----Chinese word segmentation entity mining entity association management,the reveal in graph for results and network,ranking and clustering.Atlast.our system help users improve their working efficiency on Email.

引文

[1]Jianfeng Gao, Mu Li, Andi Wu.Chinese Word Segmentation and Named Entity Recognition:A Pragmatic Approach[J]. Cambridge:MIT Press,2005:531-574.
    [2]Fuchun Peng, Fangfang Feng, Andrew McCallum.Chinese Segmentation and New Word Detection using Conditional Random Fields[C]. Morristown:Association for Computational Linguistics,2004:562.
    [3]Zhang Huaping,LiuQun.Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method [J] Journal of Chinese Information Processing,2002,16(5): 1-7.张华平,刘群.基于N-最短路径方法的中文词语粗分模型[J].中文信息学报,2002,16(5)：1-7.
    [4]EKE http://www.drthomasjackson.com/eke.html[EB/OL]
    [5]Gmail http://www.gmail.com/[EB/OL]
    [6]Windows Live Mail http://mail.live.com/[EB/OL]
    [7]Stern M. Dates and Times in Email [C]//Proc of the 9th IntConf on Intelligent user interfaces.NewYork:ACM,2003:328-330.
    [8]Zhang Xiangyu.Chen Jidong,Li Yukun et al.TEXEM:An Entity-based Task Extraction Approach for Emails [J] Journal of Computer Research and Development, 2008,45(Suppl):269-274(in Chinese).张相於,陈继东,李玉坤等TEXEM:一种基于实体的邮件任务提取策略[J].计算机研究与发展,2008,卷45(增刊)：269-274.
    [9]WhittakerS.Supporting Collaborative Task Management in Email [J] Human Com-puter Interaction,2005,20(1):49-88.
    [10]Julie A. Black Nisheeth Ranjan.Automated Event Extraction from Email. Unpublished.
    [11]Ziv Bar-Yossef,Ido Guy, Ronny Lempel.Cluster Ranking with an Application to Mining Mailbox Networks[C].//Proceedings of the Sixth International Conference on Data Mining. Washington, DC:IEEE Computer Society,2006:63-74.
    [12]Venkatesh Ganti,Arnd Christian Konig,Rares Vernica. Entity Categorization Over Large Document Collections[C]//Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining.New York:ACM,2008:274-282.
    [13]Indrajit Bhattacharya,Shantanu Godbolc,Sachindra Joshi. Structured Entity Identification and Document Categorization:Two Tasks with One Joint Model[C].// Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York:ACM.2008:25-33.
    [14]Harith Alani.Sanghee Kim,David E.Millard.Automatic Ontology-Based Knowled-ge Extraction from Web Documents[J]. Piscataway, NJ, USA:IEEE Educational Activities Department,2003:14-21.
    [15]Cen Li, Biswas G, Dale M et al. Building Models of Ecological Dynamics Using HMM Based Temporal Data Clustering-A Preliminary Study [C]//Proc of the 4th Int Conf on Advances in Intelligent Data Analysis. London:Springer-Verlag,2001:53-62.
    [16]Bhattacharya I,Godbole S,Joshi S Structured entity identification and document categorization:two tasks with one joint model [C]//Proc of the 14th ACM SIGKDD Int Conf on Knowledge discovery and data mining.New York:ACM,2008:25-33.
    [17]Omiecinski E. Alternative interest measures for mining associations in databases [J] IEEE Trans on Knowledge and Data Engineering,2003,15(1):57-69.
    [18]Li Xin,Liu Bing.Yu P. Mining Community Structure of Named Entities from Web Pages and Blogs [C]//Proc of the AAAI Spring 2006 Symp on Computational Approaches to Analyzing Weblogs.Boston:AAAI,2006.
    [19]Ganesh M,Srivastava J,Richardson T.Mining Entity-Identification Rules for Database Integration [C]//Proc of the Second Int Conf on Data Mining and Knowledge Discovery.Portland:KDD,1996:291-294.
    [20]Li Wenjie,Qian Donglei,Yuan Chunfa et al. Detecting:categorizing and clustering entity mentions in Chinese text [C]//Proc of the 30th annual Int ACM SIGIR Conf on Research and development in information retrieval.New York:ACM,2007:647-654.
    [21]Surajit Chaudhuri,Venkatesh Ganti,Dong Xin.Exploiting Web Search to Generate Synonyms for Entities[C].//Proceedings of the 18th international conference on World wide web. New York:ACM,2009:151-160.
    [22]Eirinaios Michelakis,Rajasekar Krishnamurthy,Peter J. Haas.Uncertainty Management in Rule Based Information Extraction Systems[C].//Proceedings of the 35th SIGMOD international conference on Management of data. New York: ACM,2009:101-114.
    [23]Ganesh M,SrivastavaJ,RichardsonT.Mining Entity-Identification Rules for Database Integration [C]//Proc of the Second IntConf on Data Mining and Knowledge Discovery.Portland:KDD,1996:291-294.
    [24]LiWenjie,QianDonglei,Yuan Chunfa et al.Detecting:categorizing and clustering entity mentions in Chinese text [C]//Proc of the 30th annual Int ACM SIGIR Confon Research and development in information retrieval.New York:ACM.2007:647-654.
    [25]WordNet http://w-ww.wordnet.org/[EB/OL]
    [26]HowNet http://www.keenage.com/IEB/OL]
    [27]Shen Xuehua,Tan bin,Zhai Chengxiang. Exploiting Personal Search History to Improve Search Accuracy [C]//Proc of 2006 ACM Conf on Research and Development on Information Retrieval-Personal Information Management Workshop. Seattle:SIGIR,2006.
    [28]Elsweiler D,Ruthven I,Ma Linxiao. Considering Human Memory in PIM [C] //Proc of 2006 ACM Conf on Research and Development on Information Retrieval— Personal Information Management Workshop. Seattle:SIGIR,2006.
    [29]ChengTao,YanXifeng,Chen K. EntityRank:Searching Entities Directly and Holistically [C]//Proc of the 33rd IntConf on Very large data bases. Vienna:VLDB Endowment,2007:387-398.
    [30]Jun Zhu,Zaiqing Nie,Xiaojing Liu.StatSnowball:a Statistical Approach to Extracting EntityRelationships[C].//Proceedings of the 18th international conference on World wide web. New York:ACM,2007:101-110.
    [31]Wei Wang.Chuan Xiao,Xuemin Lin. Efficient Approximate Entity Extraction with Edit Distance Constraints[C].//Proceedings of the 35th SIGMOD international conference on Management of data. New York:ACM,2009:759-770.
    [32]Lakshmanan L,Ng R,Wang Ce tal. The generalized MDL approach for Summarization [C]//Proc of the 28th Int Conf on Very Large Data Bases. Hong Kong: VLDB Endowment,2002:766-777.
    [33]Tan P,Kumar V,Srivastava J. Selecting the right interestingness measure for association patterns [C]//Proc of the eighth ACM SIGKDD Int Conf on Knowledge discovery and data mining.New York:ACM,2002:32-41.
    [34]Mei Qiaozhu,Xin Dong,Cheng Hong et al.Generating Semantic Annotations for Frequent Patterns with Context Analysis [C]//Proc of the 12th ACM SIGKDD Int Conf on Knowledge discovery and data mining.New York:ACM,2006:337-346.
    [35]Byung-Won On, Ergin Elmacioglu,Dongwon Lee. An effective approach to entity resolution problem using quasi-clique and its application to digital libraries[C]. //Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries.New York:ACM,2006:51-52.
    [36]Bhattacharya I.Godbole S.Joshi S Structured entity identification and document categorization:two tasks with one joint model [C].//Proc of the 14th ACM SIGKDD Int Conf on Knowledge discovery and data mining.New York:ACM,2008:25-33.
    [37]Omiecinski E. Alternative interest measures for mining associations in databases [J]. IEEE Trans on Knowledge and Data Engineering.2003,15(1):57-69.
    [38]Li Xin,LiuBing,Yu P. Mining Community Structure of Named Entities from Web Pages and Blogs [C].//Proc of the AAAI Spring 2006 Sympon Computational Approaches toAnalyzingWeblogs.Boston:AAAI,2006.
    [39]Anthony Dekker. Visualisation of social networks using CAVALIER [C].//Proc of the 2001 Asia-Pacific symposium on Information visualisation-Volume 9 table of contents. Sydney, Australia:ACM International Conference Proceeding Series; Vol. 16 archive,2001:49-55.
    [40]Mogamat K H. Interactive Visualisation of online learning Social Networks and inferring its Bayesian Learnt Models[R].South Africa:Department of Computer Science,University of Cape Town,2008.
    [41]JUNG http://jung.sourceforge.net.
    [42]S. Brin,L. Page. The anatomy of a large-scale hypertextual web search engine.Comput. Netw. ISDN Syst..30(1-7):1998:107-117.
    [43]J. M. Kleinberg. Authoritative sources in a hyperlinked environment J]. ACM, 46(5):1999:604-632.
    [44](加)Jiawei Han, Micheline Kamber著.数据挖掘概念与技术[M].范明等译.北京：机械工业出版社,2001
    [45]G. Jeh and J. Widom. SimRank:a measure of structural-context similarity [C]. In Proceedings of the eighth ACM SIGKDD conference (KDD'02).Edmonton, Alberta, Canada:ACM,2002:pages 538-543.
    [46]Yizhou Sun,Jiawei Han,Peixiang Zhao.RankClus:Integrating Clustering with Ranking for Heterogeneous Information Network Analysis[C].//Proceedings of the 12th International Conference on Extending Database Technology:Advances in Database Technology. Saint Petersburg, Russia:ACM,2009:565-576.
    [47]Yizhou Sun,Yintao Yu,Jiawei Han.Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema[C]. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.Paris, France:ACM,2009:797-806.
    [48]Kwang D K,Liaquat H.Visualising and Interpreting Group Behavior through Social Networks [C] Proceeding of the 2008 conference on Collaborative Decision Making:Perspectives and Challenges.Amsterdam, The Netherlands,2008:199-210.
    [49]Y. Tian, R. A. Hankins. and J. M. Patel. Efficient aggregation for graph summarization[C].//Proceedings of the 2008 ACM SIGMOD international conference on Management of data. Vancouver, Canada:SIGMOD,2008:pages 567-580.
    [50]北京语言大学对外汉语研究中心对外汉语研究中心语料库查询系统[DB].[2009-04-25].http://www.dwhyyjzx.com/cgi-bin/yuliao/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700