基于潜在语义索引和免疫学习的BIRCH聚类算法研究

英文题名：Research on Birch Clustering Algorithm Based on Latent Semantic Indexing and Immunological Learning
作者：岳淑珍
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：隐式兴趣挖掘 ; BIRCH聚类 ; 潜在语义索引 ; 人工免疫网络
英文关键词：Implicit Interest Mining ; BIRCH Clustering ; Latent Semantic Indexing ; Artificial Immune Network
学位年度：2009
导师：关毅
学科代码：081203
学位授予单位：哈尔滨工业大学
论文提交日期：2009-06-01

摘要

网络已经发展成为人们生活的重要部分,网络上存储的信息是海量的,而且处于不断变化中。网络用户期望得到个性化的服务,网络服务端需要为其推出个性化服务提供决策参考,用户兴趣挖掘技术也就应运而生了。
     用户兴趣挖掘技术对用户的兴趣进行有效地记录、分析,并围绕着描述用户兴趣的计算模型开发应用。考虑建立用户兴趣模型的可用性及准确度,我们选择隐式建模方式,即不需要用户中断网络浏览过程,通过收集反映用户兴趣的信息来建立用户模型,推断用户的兴趣。本文采用记录了用户的搜索和访问等信息的日志文件。处理过程主要分为三个阶段:预处理、用户兴趣建模、应用。
     为了更好地处理大量的,并且增量式加入的网络文档,系统的主要建模技术采用了处理时间为线性的BIRCH聚类。经过日志过滤、正文抽取等预处理之后,采用传统的向量空间模型的网络文档的文本表示特征往往呈现出高维而且稀疏的特点,本文提出了加入改进的潜在语义索引处理,对比实验证明,处理时间明显地缩短,恰当地选择BIRCH参数以及LSI中的k值,能够得到适应所用数据和应用领域的更好的聚类结果。验证了潜在语义索引技术可以在保留主要语义结构的基础上降低文本表示的维数,在形成的潜在语义空间中提取最有意义的维度作为特征表示。
     有效性度量是评价结果的关键,其中有效性函数的选择是一个关系到判定效果的关键。针对不同的数据,BIRCH聚类需要找到优化的参数才能得到更好的结果。本文研究了人工免疫网络算法,探索将其自适应机制引入BIRCH聚类的参数调节优化过程中,根据调节得到的参数设定最适合应用领域和数据特点的有效性函数。
     本文利用上述技术建立用户模型,以模型为基础,开发了用户聚类和好友推荐应用,人工校对证明,可以认为模型能够对不同用户的不同兴趣领域较好地描述和计算。
Network has become an important part of people's lives, network information presented to the users is vast and constantly changing. Internet users expect personalized services, and the network servers who want to supply personalized service need accurate description of the users’interest to do decision-making, users interest mining technology come into existence as the situation requires.
     Users interest mining technology records and analyses the user's interest effectively, then models it and developes applications based on the model. Considering the availability and accuracy of the model, we choose implicit modeling approach to predict users’interest, which will not alter normal patterns of browing and reading, and mining log files which recored searches and accesses. Implicit users interest mining systerm consists of three steps: preprocessing, modeling users interest, and applications.
     In order to better handle the massive and incremental network documents, implicit users interest system adopts BIRCH clustering method, whose processing time is linear, as the mining technique. After preprocessing step, characterized vector of network documents that adopts Vector Space Model is high-dimensional and sparse, so we combine latent semantic indexing with BIRCH, comparison experiments show that the method can get more effective clustering result, if we can select parameters of BIRCH and K in LSI suitably, and validate that latent semantic indexing technology can retain the main structure of semantic basis, and select the most efficient dimension.
     Measurement of the effectiveness evaluation of clustering results is one key to the clustering. Aiming at different data set, we need explore the most suitable parameters to get more better performance. this paper researchs artificial immune network algorithm, and explore to introduce the adaptive mechanism into BIRCH Clustering Optimization of the tune parameters to get the optimized effectiveness functions.
     We model the user interest based on above techniques, and develope the user clustering and friends recommendding applications, human check proves that the model is a promising way to describe different interest areas of different users.

引文

1 Etzioni O. The World Wide Web: Quagmire or Gold Mine. Commun. ACM. 1996, 39(11):65-68
    2 T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hypertext linking. Fifth International World Wide Web Conference, Paris, France, 1996
    3 Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. SIGKDD Explorations, ACM SIGKDD, Jan 2000
    4 Bamshad Mobasher. Web Usage Mining and Personalization Draft Chapter in Practical Handbook of Internet Computing, Munindar P.Singh(ed.), CRC Press. To appear in 2004. http://maya.cs.depaul.edu/~mobasher/- pubs-subject. html#usage_mining
    5 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001
    6 T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Cluster- ing Method for very Large Databases, In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data(SIGMOD’96), 1996:103-114
    7 G. Salton, A. Wong, C. S. Yang. On the specification of term values in automatic indexing. 1973
    8 S. Dumais, G. Furnas, T. Landauer, Using latent semantic analysis to improve access to textual information. In Proceedings of the Conference on Human Factors in Computing Systems, CHI:281-286
    9 S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. arshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990, 41(6):391-407
    10 M BERRY, S DUMAIS, G O’BRIEN. Using linear algebra for intelligent information retrieval SIAM Review, 1995, 37(4):573-595
    11冯项云. LSI潜在语义标引方法在情报检索中的应用.现代图书情报技术,1998(4):19-21
    12郑亚非.潜在语义分析与篇章理解.浙江王业大学学报(社会科学版), 2006,5(1):70-75
    13桂诗春.潜伏语义分析的理论及其应用.现代外语, 2003, 26(1):76-84
    14杨守捷,刘曼华.应用潜在语义分析.探析认知科学天津大学学报(社会科学版), 2001, 3(3)
    15 N. Jerne, Towards a network theory of the immune system. Annals of Immunology (Inst.Pasteur). 125C:373-389
    16 H. Bersini, F. Varela. Hints for Adaptive Problem Solving Gleaned from Immune Networks. Parallel Problem Solving from Nature, 1st Workshop. PPSW1, Dortmund, Germany. Pub. Springer-Verlag, 1990:343-354
    17 S. Forrest, A. Perelson, L. Allen, R. Cherukuri. Self, Non-self Discriminat- ion in a Computer. Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy. 1994:202-212
    18 J. E. Hunt, D. E. Cooke. Learning Using an Artificial Immune System. Journal of Network and Computer Applications. 1996, (19):189-212
    19 J. Timmis, M. Neal, J. Hunt. Data Analysis with Artificial Immune Systems and Cluster Analysis and Kohonen Networks: Some Comparisons. Proceedings of the IEEE SMC. Tokyo, Japan, 1999:922-927
    20 J. Timmis, M. Neal, J. Hunt. An Artificial Immune System for Data Analysis. Biosystems. 2000, 55(I-3):143-150
    21 J. Timmis, M. Neal. A Resource Limited Artificial Immune System for Data Analysis. Knowledge Based Systems. 2001, 14(3-4):121-130
    22 A. Watkins. A Resource Limited Artificial Immune Classifier. Dissertation for the Master’s Degree of Mississippi State University, USA, 2001
    23 O. Nasraoui, D. Dasgupta, F. Gonzalez. The Promise and Challenges of Artificial Immune System Based Web Usage Mining: Preliminary Results. Proceedings of the SIAM Workshop on Web Analytics. Arlington, VA, 2002:29-39
    24 J. Timmis, T. Knight, L. N. Decastro. An Overview of Artificial Immune Systems. Computation in Cells and Tissues: Perspectives and Tools for Thought. Springer. 2004:51-86
    25 Secker, Andrew, A. Freitas, J. Timmis. Towards a Danger Theory Inspired Artificial Immune System for Web Mining. Web Mining: Applications and Techniques. Idea Group. 2005:145-68
    26 D. W. Scott. Multivariate Density Estimation. John Wiley&Sons, 1992
    27 R. Kohave, G. H. John. Wrappers for feature subset selection.Artificial Intelligence. 1997, 97(1-2):273-324
    28 M. Devaney, A. Ram. Efficient feature selection in conceptual clustering. Proceedings of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1997:92-97
    29 Agrawal, R, et al. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conf. On Management of Data, ACM Press, 1998:94-105
    30 M. A. Hall, G. Holmes. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. IEEE Transactions on Knowledge and Data Engineering, Nov 2003, 15(6):1437-1447
    31 K. Chakrabarti, et al. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans on Database Systems, 2002, 27:188-228
    32 C. Faloutsos, M. Ranganathan, Y. Manolopoulos. Fastmap:A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. Proc. of ACM DIGMOD International Conference on Management of Data, ACM Press, New York, NY, USA, 1995:163-174
    33 E. Bingham, H. Mannila, Random projection in dimensionality reduction: Applications to image and text data. Proceedings of the Seventh ACM SIGKDD International Conf on Knowledge Discovery and Data Mining, ACM Press, 2001:245-250
    34 A. L. Blum, P. Langley. Selection of relevant features and examples in machine learning, Artificial Intelligence ,1997, 97:245-271
    35 K. Wettschereck, T. Mohori, D. W. Aha. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review, Feb 1997, 11(1-5):273-314
    36 S. K. Pal, R. K. De, J. Basak. Unsupervised feature evaluation: A Neuro-Fuzzy Approach, IEEE Trans. NN, 2000, 11:266-376
    37 P. Mitra, C. A. Murthy, S. K. Pal. Unsupervised feature selection using feature similarity, IEEE Trans on PAMI, Mar 2002, 24(3):301-312
    38 M. Halkidi, Y. Batistakis, M. Vazirgiannis. On Clustering ValidationTechniques. Intelligent Information Systems ,2001, 17(2-3):107-145
    39 S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6):391-407
    40 M. W. Berry, Z. Drmac, E. R. Jessup. Matrices, Vector Spaces, and Information Retrieval. SIAM Review, 1999, 41(2):335-362
    41 M. W. Berry, S. T. Dumais, G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 1995, 37(4):573-595
    42 P. Wiemer-Hastings, How Latent is Latent Semantic Analysis? Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, San Francisco: Morgan Kaufmann. 1999:932-937
    43 G. H. GOLUB, C. REINSCH. Singular value decomposition and least squares solutions. Numerische Mathematik, 1970, 14(5):403-420
    44 J. H. WILKINSON. The algebraic eigen value problem. Oxford: Clarendon Press, 1965
    45 O’Leary, D. P. Peleg. Digital Image Compression by Outer Product Expansion. IEEE Trans. on Communications. 1983, 31:441–444
    46 T. G. Kolda, D. P. O’Leary. Algorithm 805: Computation and Uses of the emidiscrete Matrix Decomposition. ACM Transactions on Mathematical Software. 2000, 26(3):415–435
    47 M. W. Berry, Theresa Do, Gavin O'Brien, Vijay Krishna, Sowmini Varadhan SVDPACKC: Version 1.0 User's Guide [M]. Tech. Report CS-93-194, University of Tennessee, Knoxville, TN, 1993
    48 C. D. Manning, H. Schütze.统计自然语言处理基础.苑春法,李庆中.电子工业出版社,2005:344-350
    49 L. N. Castro, F. J. Zuben. Artificial Immune Systems: Part I-Basic Theory and Applications. Technical Report DCA–RT 01/99. 1999, (01):13
    50 F. M. Burnet. The Clone Selection Theory of Acquired Immunity. Cambridge University Press. 1959
    51 N. Jerne. Towards a Network Theory of the Immune System. Annals of Immunology (Inst.Pasteur). 1974:373-389
    52 J. D. Farmer, N. H. Packard, A. S. Perelson. The Immune System, Adaptation, and Machine Learning. Physical 22D. 1986:187-204
    53 De Castro, J. Femando. An Evolutionary Immune Network for Data Clustering. Proc 6th Brazilian Symposium on Neural Networks. Riode Janeiro, Brazil, 2000:84-89
    54 Y. H. Wu, Y. C. Chen, A. L. P. Chen. Enabling Personalized Recommenda- tion on the Web Based on User Interests and Behaviors. Proceedings of the 11th International Workshop on Research Issues in Data Engineering. 2001:17-24
    55 A. Scime. Web Mining: Applications and Techniques. Idea Group Publish- ing, 2004
    56 S. Loh, L. K. Wives, J. P. M. de Oliveira. Concept-Based Knowledge. Discovery in Texts Extracted from the Web, ACM SIGKDD Explorations Newsletter, 2000, 2(1)
    57 T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: a new data clustering algorithm and its applications. Journal of Data Mining and Knowledge Discovery. 1997, 1(2):141—182

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700