基于机器学习的信息过滤和信息检索的模型和算法研究

[地质云]滑坡

英文题名：Research of Machine Learning Models and Algorithms for Information Filtering and Information Retrieval
作者：张亮
论文级别：博士
学科专业名称：管理科学与工程
中文关键词：协同过滤 ; 无监督学习 ; 半监督分类 ; 半监督聚类 ; 有限混合模型
英文关键词：Collaborative Filtering ; Unsupervised Learning ; Semi-supervised Classification ; Semi-supervised Clustering ; Finite Mixture Models
学位年度：2007
导师：李敏强
学科代码：1201
学位授予单位：天津大学
论文提交日期：2007-06-01

摘要

随着Internet技术的飞速发展,信息网络在人们的工作生活中具有越来越重要的地位。从网络上的海量信息中快速、高效地获取人们真正需要的信息资源,已成为信息社会中的一个关键问题。信息过滤和信息检索技术是解决这一问题的有效方法,具有重要的学术意义和应用价值。本文基于统计机器学习方法,重点研究了信息过滤和信息检索模型与求解算法。主要研究内容包括:
     首先,介绍了信息过滤和信息检索的概念和意义,总结了它们的起步和发展情况。概括介绍了几种基于统计的机器学习方法的概念和特点以及它们在信息过滤和信息检索中的应用,作为本文的理论基础。
     其次,介绍了协同过滤问题的几种常见方法,提出了应用于协同过滤的一种概率模型,称为真实偏好高斯混合模型。新模型引入了两个隐含变量,分别用于描述用户类和项目类,用户和项目依概率可以同时属于多个类中。模型中考虑了用户评分习惯以及项目的公众评价对用户-项目最终评价的综合影响。与传统协同过滤模型相比,新模型更符合用户评价的实际情况。
     第三,研究了有限混合模型在大规模文本数据聚类问题中的应用,提出了用有限混合模型进行无监督文本聚类的一种规范的广义方法。它将模型选择,特征选择以及混合模型的参数估计纳入一个统一的框架。定义了一种改进的“特征显著性”方法,将特征对各混合成员的相关性作为隐变量引入混合模型,在估计模型参数的同时完成特征选择。发展了一种带特征选择的多项式混合模型,作为广义方法的实例做了详细的说明。
     第四,采用基于图的方法研究半监督学习问题。主要思想是定义样本间基于密度距离的相似度,得到数据集的内在结构信息,并将其引入学习器加以利用。对半监督分类,定义了一种基于密度的距离来反映数据点间的相似度,在此基础上以一种Laplacian核方法来构造整个特征空间上的超分类面。对半监督聚类问题,提出了一种基于密度的约束扩展方法。根据样本点间基于密度的距离和相似度关系,对已知约束集进行扩展,扩展后的约束集包含了数据集的内在结构信息。
     最后,对论文的主要研究工作进行总结,展望了今后的研究前景。
With the rapid development of the Internet technologies, information networks play more and more important roles in people's routine work and daily life. To obtain information that people really need from the massive information quickly and efficiently has become a key problem in our information research society. There are two main approaches to solve this problem: information filtering (IF) and information retrieval (IR), which are of important academic interest and valuable applications. The main research work of this thesis is based on statistical machine learning methods, especially the IF/IR models and algorithms. The main contents are as follows:
     First, a brief introduction to IF/IR is given, including the concept, structure and features as well as their origin and history. As the theory basis of this thesis, several statistical machine learning methods and their functions in IF/IR are also introduced. Second, on the basis of introduction on several popular collaborative filtering approaches, this thesis presents a new probabilistic model for collaborative filtering, named real preference Gaussian mixture model. It has two latent variables corresponding to classes of user and item. Each user or item may be probabilistically clustered to more than one groups. And it also consists of user rating style and item public praise. The new model is more actual and practical than the other methods.
     Third, another focus of this thesis is on using finite mixture models to cluster large scale document data. A generalized method for unsupervised text clustering is presented. It integrates the mixture model’s model selection, feature selection and parameter estimation into a general framework. Moreover, a modified version of“feature significance”is proposed such that the features’revalence to the mixture components is introduced to the mixture model as a set of latent variables and the component-relative features are selected when estimating the model’s parameters. As an example of the generalized framework, a multinomial mixture model with feature selection is discussed in detail.
     Fourth, this thesis use graph-based methods to deal with semi-supervised learning problems. The main idea is to investigate the similarities between data examples by defining some density-based distance over the graph. The inner structure information of the dataset is then obtained and utilized to compute the classifier. On semi-supervised classification, a kNN density-based distance form is presented to re-weight the graph, then the Laplacian kernel method is introduced to build classifiers over the whole feature space. On semi-supervised clustering, a density-based constraint expansion method is proposed. The constraint set is expanded by the similarity of the data samples. Then the expanded constraint set contains the manifold information of the dataset, and can be used in all semi-supervised clustering algorithms.
     Finally, the main research contents are summarized at the end of the thesis with an expectation for future study and research.

引文

[1] 中国互联网络信息中心, 《第 18 次中国互联网络发展状况统计报告》, 2006
    [2] Belkin, N. J., Croft, B. B., Information filtering and information retrieval: two sides of the same coin?, Communications of the ACM, 1992, 35: 29-38
    [3] Luhn, H. P., A business intelligence system, IBM Journal of Research and Development, 1958, 2(4): 314-319
    [4] Denning, P. J., Electronic junk, Communications of the ACM, 1982, 25(3): 163-165
    [5] Malone, T. W., Grant, K. R., Turbak, F. A., et al, Intelligent information sharing systems, Communications of the ACM, 1987, 30(5): 390-402
    [6] Yan, T., Garcia-Molina, H., SIFT – A tool for wide-area information dissemination, In: Proceedings.1995 USENIX Technical Conference, 1995, 177-186
    [7] Shardanand, U., Maes, P., Social information filtering: algorithms for automating "word of mouth", In: Proceedings of ACM CHI'95 Conference on Human Factors in Computing Systems, 1995, 1: 210-217
    [8] Schapire, R. E., Singer, Y., Singhal, A., Boosting and Rocchio applied to text filtering, In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, 1998, 215-223
    [9] Resnick, P., Iacovou, N., Suchak, M., et al, GroupLens: An open architecture for collaborative filtering of netnews, In: Proceedings of ACM 1994 Confernece on Computer Supported Cooperative Work, 1994, 175-186
    [10] Lieberman, H., Van Dyke, N. W., Vivacqua, A. S., Let's browse: A collaborative web browsing agent, In: Proceedings of the 1999 International Conference on Intelligent User Interfaces (IUI'99), 1999, 65-68
    [11] Chan, P., A non-invasive learning approach to building web user profiles, Workshop on web usage analysis and user profiling, Fifth International Conference on Knowledge Discovery and Data Mining, 1999
    [12] Bollacker, K., Lawrence, S., Giles, C. L., Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publications, In: Proceedings of the 2nd International Conference on Autonomous Agents, 1998, 116-123
    [13] Mooers, C. N., Zatocoding applied to mechanical organization of knowledge.American Documentation, 1951, 2, 20-32
    [14] Salton, G., The SMART retrieval system - Experiments in automatic document processing, Prentice-Hall, 1971
    [15] Sparck-Jones, K., A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation, 1972, 28(1):11-21
    [16] Goldberg, D., Nichols, D., Oki, B. M., et al, Using collaborative filtering to weave an information tapestry, Communications of the ACM, 1992, 35(12): 61～70
    [17] Sarwar, B., Karypis, G., Konstan, J., et al, Item-based collaborative filtering recommendation algorithm, In: Proceedings of the 10th International World Wide Web Conference, 2001, 285-295
    [18] Breese, J. S., Heckerman, D., Kardie, C., Empirical analysis of predictive algorithms for collaborative filtering, In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, 1998, 43-52
    [19] Pennock, D. M., Horvitz, E., Lawrence, S., et al, Collaborative filtering by personality diagnosis: A hybrid memory- and model-based approach, In: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, 2000, 473-480
    [20] Hofmann, T., Puzicha, J., Latent class models for collaborative filtering, In: Proceedings of the International Joint Conference in Artificial Intelligence, 1999, 688-693
    [21] Bellman, R., Adaptive control processes: A guided tour, Princeton University Press, 1961
    [22] Kohavi, R., John, G. H., Wrappers for feature subset selection, Artificial Intelligence, 1997, 97(1-2):273-324
    [23] Pearson, K., On lines and planes of closest fit to systems of points in space, Philosophical Magazine, 1901, 2: 559-572
    [24] Fisher, R. A., The use of multiple measurements in taxonomic problems, Annals of Eugenics, 1936, 7, 179-188
    [25] Herault, J., Jutten, C., Space or time adaptive signal processing by neural network models, In: Neural Networks for Computing, Proceedings of AIP Conference, 1986, 206-211
    [26] Sneath, P. H. A., Sokal, R. R., Numerical Taxonomy, Freeman, 1973
    [27] King, B., Step-wise clustering procedures, Journal of the American Statistical Association, 1967, 69: 86–101
    [28] McQueen, J., Some methods for classification and analysis of multivariateobservations, In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, 281–297
    [29] Agrawal, R., Gehrke, J., Gunopulos, D., et al, Automatic subspace clustering of high dimensional data for data mining applications, In: Proceedings of the 1998 ACMSIGMOD International Conference on the Management of Data, 1998, 94–105
    [30] Hinneburg, A., Keim, D., An efficient approach to clustering in large multimedia databases with noise, In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998
    [31] Ester, M., Kriegel, H. P., Sander, J., et al, A density-based algorithm for discovering clusters in large spatial databases with noise, In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996
    [32] Figueiredo, M., Jain, A. K., Unsupervised learning of finite mixture models, IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24: 381-396
    [33] Quinlan, J.R., Induction of decision trees, Machine Learning, 1986, (1), 81-106
    [34] Freund, Y., Schapire, R.E., Experiments with a new boosting algorithm, In: Proceedings of the Thirteenth International Conference on Machine Learning, 1996, 148-156
    [35] Minsky, M., Papert, S., Perceptrons, MIT Press, Cambridge, MA, 1969, 1988
    [36] Vapnik, V. N., Support vector method for function approximation, regression estimation and signal processing, Neural Information Processing Systems, 1996, 9: 281-287
    [37] Dempster, A. P., Laird, N. M., Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 1977, 39: 1-38
    [38] Nigam, K., McCallum, A. K., Thrun, S., et al, Text classification from labeled and unlabeled documents using EM, Machine Learning, 2000, 39(2/3): 103–134
    [39] Yarowsky, D., Unsupervised word sense disambiguation rivaling supervised methods, In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 1995, 189–196
    [40] Blum, A., Mitchell, T., Combining labeled and unlabeled data with co-training. In: Annual Conference on Computational Learning Theory, COLT'98, 1998, 92-100
    [41] Joachims, T., Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on MachineLearning, 1999, 200–209
    [42] Blum, A., Chawla, S., Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the 18th International Conference on Machine Learning, 2001, 19-26
    [43] Cohn, D., Caruana, R., McCallum, A., Semi-supervised clustering with user feedback, Technical Report TR2003-1892, Cornell University, 2003
    [44] Klein, D., Kamvar, S. D., Manning, C., From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering, In: Proceedings of the 19th International Conference on Machine Learning, 2002, 307-314
    [45] Basu, S., Banerjee, A., Mooney, R., Active semi-supervision for pairwise constrained clustering, In: Proceedings of the 2004 SIAM International Conference on Data Mining, 2004, 333-344
    [46] Demiriz, A., Bennett, K., Embrechts, M., Semi-supervised clustering using genetic algorithms. In: Intelligent Engineering Systems Through Artificial Neural Networks, 1999, 9: 809–814
    [47] Jain, A. K., Dubin, R., Mao, J., Statistical pattern recognition: A review, IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(1): 4-38
    [48] Jain, A. K., Dubes Algorithms for clustering data, Englewood Cliffs, New Jersey, Prentice Hall, 1988
    [49] McLachlan, G., Basford, K., Mixture models: Inference and application to clustering , New York, Marcel Dekker, 1988
    [50] McLachlan, G., Peel, D., Finite mixture models, New York, John Wiley & Sons, 2000
    [51] Xu, L., Jordan, M., On convergence properties of the EM algorithm for Gaussian mixtures, Neural Computation, 1996, 8: 129-151
    [52] Chrétien, S., Hero III, A., Kullback proximal algorithms for maximum likelihood estimation, IEEE Trans on Information Theory, 2000, 46: 1800-1810
    [53] Campbell, J., Fraley, C., Murtagh, F., et al, Linear flaw detection in woven textiles using model-based clustering, Pattern Recognition Letters, 1997, 18: 1539-1548
    [54] Dasgupta, A., Raftery, A., Detecting features in spatial point patterns with cluster via model-based clustering, Journal of American Statistical Association, 1998, 93: 294-302
    [55] Fraley, C., Raftery, A., How many clusters? Which clustering method? Answers via model-based cluster analysis, Technical Report 329, Dept.Statistics, Univ. Washington, Seattle, WA, 1998
    [56] Schwarz, G., Estimating the dimension of a model, Annals of Statistics, 1978, 461-464
    [57] Akaike, H., A new look at the statistical model identification, IEEE Transactions on Automatic Control, 1974, AC-19(6):716-723
    [58] Roberts, S., Husmeier, D., Rezek, I., et al, Bayesian approaches to Gaussian mixture modeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1133-1142
    [59] Hastie, T., Tibshirani, R., Discriminate analysis by Gaussian mixtures, Journal of Royal Statistical Society (B), 1996, 58: 155-176
    [60] Hofmann, T., Buhmann, J., Pairwise data clustering by deterministic annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(1): 1-14
    [61] Meinicke, P., Ritter, H., Resolution-based complexity control for Gaussian mixture models, Neural Computation, 2001, 13(2): 453-475
    [62] Blei, D. M., Jordan, M. I., Ng, A. Y., Hierarchical Bayesian models for applications in information retrieval, Bayesian Statistics, 2003, 7: 25-43
    [63] Hofmann, T., Probabilistic latent semantic analysis, In: Proceedings of Uncertainty in Artificial Intelligence, UAI'99, 1999
    [64] Hofmann, T., Probabilistic latent semantic indexing, In: Proceedings of the 22nd Annual International SIGIR Conference, 1999
    [65] Blei, D. M., Ng, A. Y., Jordan, M. I., Latent Dirichlet allocation, Journal of Machine Learning Research, 2003, 3:993-1022
    [66] Vapnik, V. N., The nature of statistical learning theory, Berlin, Springer-Verlag, 1995
    [67] Platt, J., Using analytic QP and sparseness to speed training of support vector machines, Advances in Neural Information Processing System, 1999, 11: 169-184
    [68] 邓爱林, 朱扬勇, 施伯乐, 基于项目评分预测的协同过滤推荐算法, 软件学报, 2003, 14: 1621-1628
    [69] Hofmann, T., Collaborative filtering via Gaussian probabilistic latent semantic analysis, In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, 259-266
    [70] Si, L., Jin, R., Flexible mixture model for collaborative filtering, In: Proceedings of the 20th International Conference on Machine Learning, 2003,704-711
    [71] Liu, X., Gong, Y., Xu, W., et al, Document clustering with cluster refinement and model selection capabilities, In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, 11-15
    [72] Zhong, S., Ghosh, J., A unified framework for model based clustering and its applications to clustering time sequences, Technical Report, ECE Dept., University of Texas at Austin, 2002
    [73] Yang, Y., Pedersen, J. O., A Comparative Study on Feature Selection in Text Categorization, In: Proceedings of ICML-97, 14th International Conference on Machine Learning, 1997, 412—420
    [74] Dash, M., Choi, K., Scheuermann, P., et al, Feature selection for clustering - A filter solution, In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 2002, 115-122
    [75] Dy, J. G., Brodley, C. E., Feature subset selection and order identification for unsupervised learning, In: Proceedings of the 17th International Conference on Machine Learning, 2000, 247-254
    [76] Law, M. H. C., Figueiredo, M. A. T., Jain, A. K., Simultaneous feature selection and clustering using mixture models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9): 1154-1166, 2004
    [77] Rissanen, J., Stochastic complexity and modeling, Annals of Statistics, 1986, 14(3), 1080-1100
    [78] Wallace, C. S., Freeman, P. R., Estimation and inference by compact coding, Journal of Royal Statistics Society (B), 1987, 49, 223-265
    [79] Kontkanen, P., Myllymaki, P., Tirri, H., Comparing Bayesian model class selection criteria by discrete finite mixtures, In: Proceedings of the ISIS'96 Conference, 1996, 364-374
    [80] Rigouste, L., Cappe, O., Yvon, F., Evaluation of a probabilistic method for unsupervised text clustering, In: Applied Stochastic Models and Data Analysis (ASMDA 2005), 2005
    [81] Van Rijsbergen, C. J., Information Retrieval , Butterworths, London,1979
    [82] Strehl, A., Ghosh, J., Cluster ensembles – A knowledge reuse framework for combining partitions, Journal of Machine Learning Research, 2002, 3:583–617
    [83] Ng, A. Y., Jordan, M. I., Weiss, Y., On spectral clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems, 2002, 14: 849-856
    [84] Biernacki, C., Celeux, G., Govaert, G., An improvement of the NEC criterion for assessing the number of clusters in a mixture model, Pattern Recognition Letter, 1999, 20: 267-272
    [85] Graham, M. W., Miller, D. J., Unsupervised learning of parsimonious mixtures on large feature spaces, Penn State EE Dept. Technical Report, 2004.
    [86] Rosenberg, C., Hebert, M., Schneiderman, H., Semi-supervised selftraining of object detection models, In: The 7th IEEE Workshop on Applications of Computer Vision, 2005
    [87] Szummer, M., Jaakkola, T., Partially labeled classification with Markov random walks, Advances in Neural Information Processing Systems, 2001
    [88] Zhu, X., Ghahramani, Z., Lafferty, J., Semi-supervised learning using Gaussian fields and harmonic functions, In: Proceedings of the 20th International Conference on Machine Learning, 2003, 20: 912-919
    [89] Sindhwani, V., Niyogi, P., Belkin, M., Beyond the point cloud: From transductive to semi-supervised learning, In: Proceedings of the 22nd International Conference on Machine Learning, 2005, 824-831
    [90] Fischer, B., Roth, V., Buhmann, J. M., Clustering with the connectivity kernel, Advances in Neural Information Processing Systems, 2004
    [91] Sajama, S., Orlitsky, A., Estimating and computing density based distance metrics. In: Proceedings of the 22nd International Conference on Machine Learning, 2005, 768-775
    [92] Wagstaff, K., Cardie, C., Rogers, S., et al., Constrained k-means clustering with background knowledge , In: Proceedings of the 18th International Conference on Machine Learning, 2001, 577-584
    [93] Wang, L., Bo, L., Jiao, L., A modified k-means clustering with a density-sensitive distance metric, RSKT 2006, LNAI 4062, 2006, 544–551
    [94] Bilenko, M., Basu, S., Mooney, R. J., Integrating constraints and metric learning in semi-supervised clustering, In: Proceedings of the 23rd International Conference on Machine Learning, 2004, 81-88
    [95] Chapelle, O., Zien, A., Semi-supervised classification by low density separation, AISTATS, 2005
    [96] Sch?lkopf, B., Smola, A. J., Learning with kernels, MIT Press, Cambridge, 2002
    [97] Herbster, M., Pontil, M., Wainer, L., Online learning over graphs. In: Proceedings of the 22nd International Conference on Machine Learning, 2005, 305-312