概率潜在语义分析及其应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
信息检索的很多应用都需要探究隐藏在字、词背后的涵义,简单的字面匹配由于广泛存在的同义词多义词现象,往往得不到能够和查询在含义上精确匹配的检索结果。概率潜在语义分析(即PLSA, Probabilistic Latent Semantic Analysis)通过概率的形式建立了将隐含变量与共现数据对(如词汇与文档)联系起来的模型,使用统计的方法建立了“文档-潜在语义-词语”三者之间概率分布关系,并利用这种概率进行基于统计的语义分析,从中得到同一个主题下不同词的分布参数以及同一篇文档下不同主题的分布参数,从而能够从语义的层面上而不再是以往的单纯的字面意义上去表达和理解文档。在语义空间上,能够对文档做出更精准的匹配,排序,相关性查询等操作。本文主要研究概率潜在语义分析的稀疏表达框架以及并行化扩展,主要贡献有:
     ●提出了一种在PLSA框架下高效地引入稀疏表达的方法,通过添加稀疏度控制在两个模型参数上以解决传统的PLSA存在的过拟合以及无法提取局部特征的问题。本文实验证实本文所述方法在准确度上超越了已有的PLSA算法,并且在性能有杰出表现。
     ●提出了在分布式处理框架下高效地训练PLSA模型的方法,分别设计实现了基于多核处理器的多线程PLSA算法,以及基于Hadoop和基于MP工的的并行化PLSA算法,讨论了在实际应用中的具体细节和问题,最后在集群上进行了实验和性能评估。
     ●探索尝试了将PLSA用于个性化RSS文章排序的方法,通过记录用户阅读文章所消耗的时间评估用户对文章的兴趣。
Many of the applications related to information retrieval rely on discovering the hidden meanings behind the text itself. However, due to the existence of polysemy and synonym, the match of queries may not be accurate on literal terms. Probabilistic Latent Semantic Analysis is a topic modeling technique to discover the hidden structure by building the relation between observed data and the assumed hidden variables, which is "document-topic-term" for text corpus. It uses a statistical learning technique to estimate the model parameters, including the multinomial distribution of the terms belonging to a topic, and the multinomial distribution of the topics given a document. The documents are represented in a semantic space instead of the term space, so that matching, ranking and relevance can be done more accurately. This paper contributes on the following aspects:
     We present an efficient approach that provides direct control over sparsity during the expectation maximization process. Which resolved the problem that PLSA can not produce local features and the over fitting problem. Experiments on face databases are reported to show visual representations on obtaining local features, and detailed improvements in clustering tasks compared with the original process
     We designed the multithread PLSA training process in distributed systems under the MPI and the MapReduce framework, many details have been discussed for implementations, and evaluations have been analyzed for pros and cons.
     We proposed a method for RSS document ranking problem, using implicit feedback of reading time for user preference modeling.
引文
[1]机器学习.http://baike.baidu.com/view/7956.htm
    [2]Dean, J. and Ghemawat, S. MapReduce:Simplified data processing on large clusters[J], Communications of the ACM,2008,51(1):107-113
    [3]Ghemawat, S. and Gobioff, H. and Leung, S.T., The Google file system[J], ACM SIGOPS Operating Systems Review,2003,37(5):29-43
    [4]Chang, F. and Dean, J. Bigtable:A distributed storage system for structured data[J], ACM Transactions on Computer Systems (TOCS)
    [5]Allen, J.F. Natural language processing[M], Encyclopedia of Computer Science
    [6]Blei, D.M.,Probabilistic models of text and images[D]
    [7]T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis[J], machine learning,2001,42(1):177-196
    [8]D. Blei, A. Ng, and M. Jordan, Latent dirichlet allocation[J], The Journal of Machine Learning Research,2003,3:993-1022.
    [9]M. Shashanka, B. Raj, and P. Smaragdis, Sparse overcomplete latent variable decomposition of counts data[C], in Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS07)
    [10]Manning, C.D. and Raghavan, P. and Schutze, H.,An introduction to information retrieval [M]
    [11]S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, Indexing by latent semantic analysis[J], Journal of the American society for information science, 1990,41(6):391-407.
    [12]D. Lee and H. Seung, Learning the parts of objects by nonnegative matrix factorization[J], Nature,1999,401(6755):788-791,.
    [14]W. Xu, X. Liu, and Y. Gong, Document clustering based on non-negative matrix factorization[C], in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM,2003,273.
    [15]L. Cao and L. Fei-Fei, Spatially coherent latent topic model for concurrent object segmentation and classification[C], in Proc. ICCV,2007.
    [16]D. Cai, X. Wang, and X. He, Probabilistic dyadic data analysis with local and global consistency[C], in Proceedings of the 26th Annual International Conference on Machine Learning. ACM,2009,105-112.
    [17]E. Gaussier and C. Goutte, Relation between PLSA and NMF and implications[C], in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM,2005,602.
    [18]Y. Lu and C. Zhai, Opinion integration through semisupervised topic modeling[C], in Proceeding of the 17th international conference on World Wide Web. ACM,2008, 121-130.
    [19]P. Hoyer, Non-negative Matrix Factorization with Sparseness Constraints [J], the journal of machine learning research,2004,5:1469.
    [20]M. Heiler and C. Schnorr, Learning sparse representations by non-negative matrix factorization and sequential cone programming[J], The Journal of Machine Learning Research,2006,7:1385-1407.
    [21]V. Pauca, J. Piper, and R. Plemmons, Nonnegative matrix factorization for spectral data analysis[J], Linear Algebra and its Applications,2006,416(1):29-47
    [22]C. Muoh, Sparsification for Topic Modeling and Applications to Information Retrieval[D], Ph.D. dissertation, Kent State University,2009.
    [23]D. Nichols, Implicit rating and filtering[C], in Proceedings of 5th DELOS Workshop on Filtering and Collaborative Filtering. Citeseer,1997,31-36.
    [24]D. Oard and J. Kim, Modeling information content using observable behavior[C], in Proceedings of the Annual Meeting-American Society for Information Science, 2001,38:481-488.
    [25]D. Kelly and J. Teevan, Implicit feedback for inferring user preference:a bibliography[C], in ACM SIGIR Forum, ACM,2003,37(2):18-28.
    [26]M. Morita and Y. Shinoda, Information filtering based on user behavior analysis and best match text retrieval [C], in Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag New York, Inc.,1994,281.
    [27]R. Rafter and B. Smyth, Passive profiling from server logs in an online recruitment environment [C], in Proceedings of the IJCAI Workshop on Intelligent Techniques for Web Personalization (ITWP 2001). Citeseer,2001,35-41.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700