用户名: 密码: 验证码:
文本聚类分析若干问题研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
面对大规模的、高维的文本数据,如何建立有效的、可扩展的文本聚类算法是数据挖掘领域的研究热点。针对这些问题,本文对文本聚类分析所涉及的若干问题进行了较深入的研究,主要包括如下几个方面:
     提出了一种基于投影寻踪的文本聚类新算法,该方法利用遗传算法寻找最优投影方向,将文本特征空间投影到一维空间上,从而以直观的方式显示出数据的结构特征,实现文本聚类分析的可视化。
     针对文本特征向量维数高和k-means等方法需要预先确定聚类数的问题。提出了基于LSA、CI、RP及NMF的RPCL文本聚类算法,先运用LSA等方法对文本特征矩阵进行降维处理,再运用RPCL算法进行文本聚类,这些新方法不仅可以有效地降维,还可克服k-means等方法需要预先确定聚类数的困难。
     基于向量空间模型,提出了一种基于双词关联的文本特征选择新模型,这种模型在向量空间模型的基础上,增加了文本的双词关联信息,使得向量空间模型中所包含的文本特征信息更加丰富、更加准确,结合隐含语义分析方法降维后,不仅有效地降低了维数,还进一步减少噪声凸现文本的语义特征,从而提高文本挖掘的质量。
     基于文档标引图特征模型,提出了一种新的基于短语的相似度计算方法,并采用变换函数对文档相似度值进行调整以使其获得了更好的可区分特性,从而更加有利于文本的聚类分析、分类等处理。
     将基于后缀树的聚类方法用于中文文本聚类中,这种方法将文本看成是一些短语的集合,通过后缀表达文本的相似关系,实现文本聚类。这种方法可以解决多主题的文本聚类问题,并克服了k-means等硬聚类算法将文本严格划分类问题,实现文本的软聚类。
Facing the massive volume and high dimensional text data, how to build effec-tive and scalable algorithm for text clustering is one of research directions of data mining. Aiming at above issues, some basic problems of text clustering have been studied substantially as follows.
     A new pursuit projection based text clustering algorithm is proposed. It looks for the optimal projection direction by using genetic algorithm, projects text feature vector in high dimensional into a low dimensional space. The structure features of the texts can be shown intuitionisticly and the results of text clustering can be visu-alized.
     Aim at the problems of high dimensional and predetermined cluster number, several LSA, CI, RP, NMF based RPCL text clustering algorithms are also proposed, which reduce dimension with LSA etc. and cluster texts with RPCL. It can not only reduce dimension effectively, but also overcome the problem of partitoning cluster in advance.
     Based on Vector Space Model, a new double-word relation based text feature selection model is proposed in this dissertation. This model adds double-word rela-tion information of texts to Vector Space Model so that it contains more abundant and more exact text feature information. Combining with Latent Semantic Analysis, it not only reduces dimension effectively, but also cuts down some noises and stands out the semantic feature in the text. So, it can improve the quality of text mining greatly.
     Based on Document Index Graph feature expression model, a new text similar-ity calculating method is proposed, in which text similarity can be adjusted to get better distinguishability by using a proper transformation function and to be in favor of text clustering analysis and classification.
     Suffix Tree Clustering is used in Chinese text clustering, in which text is re-garded as a set of phrases and the similarity of texts is denoted by suffix tree. This can solve the problems of multi thematic text clustering, overcome the problem of predefined cluster number, and realize soft text clustering.
引文
[1] 史忠植. 知识发现. 北京: 清华大学出版社, 2002.1.
    [2] 郭萌, 王珏. 数据挖掘与数据库知识发现:综述. 模式识别与人工智能, 1998, 11(3)
    [3] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Mor-gan Kaufmann, USA (2001) (范明, 孟小峰译. 数据挖掘概念与技术, 北京: 机械工业出版社, 2001).
    [4] Ming-Syan Chen, Jiawei Han and Philip S. Yu. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and Data Enlgineering, 8(6):866–883, December 1996.
    [5] T. Hastie, R. Tibshirani, J. Friedman. The Elemants of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, USA. 2001. 范明, 柴玉梅, 昝红英译. 统计学习基础:数据挖掘、推理与预测. 北京: 电子工业出版社, 2004.
    [6] Andrew R. Webb. Statistical Pattern Recognition, Second Edition. John Wiley & Sons, Inc., USA. 2002. 王萍, 杨培龙, 罗颖昕译. 统计模式识别. 北京: 电子工业出版社, 2004.
    [7] S. Theodoris, K. Koutroumbas. Pattern Recognition (Second Edition). Aca-demic Press, USA. 2003.
    [8] 高隽. 智能信息处理方法导论. 北京: 机械工业出版社, 2004.6.
    [9] Nello Cristianini, John Shawe-Taylor. An Introduction to Support Vector Ma-chines and Other Kernel-based Learning Methods. Cambridge University Press, England, 2000. 李国正, 王猛, 曾华军译. 支持向量机导论. 北京: 电子工业出版社, 2004.3.
    [10] 刘清. Rough 集及 Rough 推理. 北京: 科学出版社, 2001.8.
    [11] Tom M. Mitchell. Machine Learning. McGraw-Hill Companies, Inc. 1997. 曾华军, 张银奎等译. 机器学习. 北京: 机械工业出版社. 2003.1.
    [12] 阎平凡, 张长水. 人工神经网络与模拟进化计算. 北京: 清华大学出版社, 2000.11.
    [13] 朱明. 数据挖掘. 合肥: 中国科学技术大学出版社, 2002.5.
    [14] 约翰.奈比斯特. 大趋势——改变我们生活的十个新方向. 北京: 中国社会科学出版社, 1985.
    [15] R. Feldman and I. Dagan. KDT - Knowledge Discovery in Texts. In Proc. of the First Int. Conf. on Knowledge Discovery (KDD), 1995, 112–117.
    [16] 王国胤. Rough 集理论与知识获取. 西安: 西安交通大学出版社, 2001.05.
    [17] 钟晓, 马少平, 张钹等. 数据挖掘综述. 模式识别与人工智能, 2002, 14(1): 48~55.
    [18] 王小平, 曹立明. 遗传算法——理论、应用与软件实现. 西安: 西安交通大学出版社, 2002.
    [19] Studer, R.; Fensel, D.; Decker, S. and Benjamins, V.R. 1999. Knowledge En-gineering: Survey and Future Directions. In Proc. of the 5th German Conf. on Knowledge-based Systems, Wurzburg, Germany, 1999.
    [20] Studer R., Benjamins V.R. and Fensel D. Knowledge Engineering: Principles and Methods. Data and Knowledge Engineering. 1998, 25(1-2):161-197.
    [21] 王伟强, 高文. Internet 上的文本数据挖掘. 计算机科学, 2000. 27(4): 32-37.
    [22] Hearst M A, Pedersen J. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: Proc of the 19th Annual Int' l ACM/SIGIR Conf. Zu-rich, 1996. 76~84.
    [23] A.-H. Tan. Text mining: The State of the Art and the Challenges. In Proc of the Pacic Asia Conf on Knowledge Discovery and Data Mining PAKDD'99 workshop on Knowledge Discovery from Advanced Databases, 1999, 65-70.
    [24] Ah-Hwee Tan. Text Mining; Promises and Challenges. Proc. South East Asia Research Computer Confederation (SEARCC’99), Singapore City, Singapore, 1999, 15-21.
    [25] S. Chakrabarti. Data Mining for Hypertext: a Tutorial Survey. ACM SIGKDD Explorations Newsletter, 1(2):1-11, January 2000.
    [26] 薛为民, 陆玉昌. 文本挖掘技术研究. 北京联合大学学报(自然科学版), 2005, 19(4):59-63.
    [27] 刘少辉, 董明楷, 张海俊等. 一种基于向量空间模型的多层次文本分类方法. 中文信息学报. 2002.16(3), 8-14,26.
    [28] 王继成, 潘金贵, 张福炎. Web 文本挖掘技术研究. 计算机研究与发展, 2000.5
    [29] 蒋良孝, 蔡之华. 文本挖掘及其应用.现代计算机. 2003.2: 29-31,48.
    [30] 袁军鹏, 朱东华, 李毅等. 文本挖掘技术研究进展. 计算机应用研究, 2006.2: 1-4.
    [31] 湛燕, 陈昊, 袁方等. 文本挖掘研究进展. 河北大学学报(自然科学版), 2003, 23(2): 221-226.
    [32] 林鸿飞. 基于概念的文本结构分析方法. 计算机研究与发展, 2000, 37(3): 324-328.
    [33] 高新波, 谢维信. 模糊 c-均值聚类算法中加权指数 m 的研究. 电子学报, 2000. 28(4): 80-83.
    [34] Pascual-Marqui R.D., Pascual-Montano A.D., Kochi K., etc. Smoothly dis-tributed fuzzy c-means: a new self-organizing map. Pattern Recognition, 2001: 2395-2402.
    [35] Nabil Belacel, Pierre Hansen, Fuzzy J-Means. A new heuristic for fuzzy clus-tering. Pattern Recognition, 2002. 35(10): 2193-2200.
    [36] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 1998, 2(2): 283-304.
    [37] L. Kaufman, P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990.
    [38] M. Ester, H. P. Kriegel, J. Sander, X. Xu.. A density-based algorithm for dis-covering clusters in large spatial databases. Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD’96), 226-231.
    [39] M. Ankerst, M. Breunig, H. P. Kriegel and J. Sander. OPTICS: Ordering points to identify the clustering structure. Proc. 1999 ACM-SIGMOD Int. Conf. Management of data (SIGMOD’99), 49-60.
    [40] R. Agrawal, J. Gehrke, D. Gunopulos, P. raghavan. Automatic subspace clus-tering of high dimensional data for data mining applications. Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’98), 94-105.
    [41] 征荆, 丁晓青. 基于最小代价的多分类器的动态集成. 计算机学报, 1999, 22(2): 182-197.
    [42] 吴佑寿, 徐宁. 一种用于神经网络汉字识别系统的自组织聚类方法. 电子学报, 1994, 22(5): 1-8.
    [43] 吴佑寿, 丁晓青. 树分类器性能分析. 电子学报, 1991, 19(4): 1-7.
    [44] 李涓子, 黄昌宁. 语言模型中一种改进的最大熵方法及其应用. 软件学报, 1999, 10(3): 257-263.
    [45] 周强, 黄昌宁. 汉语结构优先关系的自动获取. 软件学报, 1999, 10(2): 149-154.
    [46] 赵军, 黄昌宁. 汉语基本名词短语结构分析模型. 计算机学报, 1999, 22(2): 141-146.
    [47] 屈刚, 陆汝占. 基于特征的汉语词性标注模型. 计算机研究与发展, 2003, 40(4): 556-561.
    [48] 秦进, 陆汝占. 文本分类中的特征提取. 计算机应用, 2003, 23(2): 45-46.
    [49] 张益民, 陆汝占. 一种混合型的汉语篇章结构自动分析方法. 软件学报, 2000, 11(11): 1527-1533.
    [50] 刘挺, 王开铸. 基于篇章多级依存结构的自动文摘研究. 计算机研究与发展, 1999, 36(4): 479-488.
    [51] 刘挺, 王开铸. 自动文摘的四种主要方法. 情报学报, 1999, 018(001): 10-19.
    [52] 马颖华, 王永成, 苏贵洋等. 一种基于字同现频率的汉语文本主题抽取方法. 计算机研究与发展, 2003, 40(6): 874-878
    [53] 马颖华, 王永成等. 自动标引中基于概念层次树的主题词轮排选择的算法实现. 高技术通讯, 2003, (6): 18-21.
    [54] 宋聚平, 王永成. 对网页 PageRank 算法的改进. 上海交通大学学报, 2003, 37(3): 397-400.
    [55] 胡舜耕, 钟义信. 基于多 Agent 技术的自动文摘研究. 计算机工程与应用, 2000, 36(9): 45-46,116.
    [56] 郭燕慧, 钟义信. 统计语言模型中句子的语义连贯性判别. 情报学报, 2003, 022(004): 472-475.
    [57] Yu Nong, Wu Li-de etal. Morphological neural networks for automatic target detection by simulated annealing learning algorithm. 中国科学:F 辑-信息科学(英文版), 2003, 46(4): 262-278.
    [58] 黄萱菁, 吴立德. 现代汉语熵的计算及语言模型中稀疏事件的概率估计. 电子学报, 2000, 28(8): 110-112.
    [59] 朱靖波, 姚天顺. 文本内容主题的识别方法. 东北大学学报(自然科学版), 2002, 23(5): 425-427.
    [60] 朱靖波, 姚天顺. 基于FIFA算法的文本分类. 中文信息学报, 2002, 16(3): 20-26.
    [61] 朱靖波, 李珩等. 基于对数模型的词义自动消歧. 软件学报, 2001, 12(9): 1405-1412.
    [62] 吴云芳 段慧明 俞士汶. 动词对宾语的语义选择限制. 语言文字应用, 2005, (2): 121-128.
    [63] 李素建, 王厚峰, 俞士汶. 关键词自动标引的最大熵模型应用研究. 计算机学报, 2004, 27(9): 1192-1197.
    [64] Zong Cheng-qing, Chen Zhao-xiong. Parsing with Dynamic Rule Selection, Journal of Computer Science & Technology, 1997, 12(1): 90-96.
    [65] 黄河燕, 陈肇雄. 基于多策略分析的复杂长句翻译处理算法. 中文信息学报, 2002, 16(3): 1-7.
    [66] 宫秀军, 史忠植. 基于 Bayes 潜在语义模型的半监督 Web 挖掘. 软件学报, 2002, 013(008): 1508-1514.
    [67] Li xiao-li, Shi zhong-zhi. Innovating Web Page Classification Through Re-ducing Noise. Journal of Computer Science & Technology, 2002, 17(1): 9-17.
    [68] 郝占刚, 王正欧. 基于社会演化算法的聚类新算法. 情报杂志, 2006, (5): 5-6, 10.
    [69] 钱晓东, 王正欧. 基于 SOM 网络的随机映射文本降维方法. 计算机应用, 2004, 24(5): 56-58.
    [70] 王明春, 王正欧. 基于粗集与遗传算法相结合的文本模糊聚类方法. 电子与信息学报, 2005(4): 548-551.
    [71] 王明春, 王正欧, 张楷等. 一种基于 CHI 值特征选取的粗糙集文本分类规则抽取方法. 计算机应用, 2005(5): 1026-1028.
    [72] 高茂庭, 王正欧. 基于 LSA 的 RPCL 文本聚类算法. 计算机工程与应用, 2006. 42(23): 138-140.
    [73] Willet P. Recent trends in hierarchical document clustering: A critical review. Information Processing and Management, 1988, 24: 577-597.
    [74] Pirolli P, Schank P, etal. Scatter/Gather Browsing Communicates the Topic Structure of a Very Large Text Collection. Proc. of the ACM Sig. Chi. Conf. on Human Factors in Computing Systems, 1996.
    [75] Rocchio J J. Document retrieval systems——Optimization and evaluation [Ph D dissertation]. Harvard University, Cambridge, MA, 1966.
    [76] Cutting Det al. Scatter/gather: A cluster-based approach to browsing large document collections. In: Proc of the 15th Annual Int'l ACM/SIGIR Conf. Copenhagen, 1992. 318-329.
    [77] Brin S. Extracting patterns and relations from the World Wide Web. In: Proc of WebDB Workshop at EDBT' 98. Valencia, 1998.
    [78] Wang Ke, Liu Huiqing. Schema discovery from semi-structured data. In: Proc of the 3rd Int' l Conf on Knowledge Discovery and Data Mining. Newport Beach, 1997.
    [79] Wüthrich B, Permunetilleke D, Leung Set al. Daily prediction of major stock indices from textual WWW data. In: Proc of the 4th Int'l Conf on Knowledge Discovery. New York, 1998.
    [80] 李晓黎. 网上信息检索与分类中的数据采掘研究. 博士论文, 2001.
    [81] M. Steinbach, Ge. Karypis, and V. Kumara. A Comparison of Document Clustering Techniques. KDD-2000 Workshop on Text Mining, August 20-23, 2000, Boston MA USA. 109–110
    [82] Oren Zamir and Oren Etzioni. Web Document Clustering. Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998.
    [83] M. Steinbach, L. Ertoz, and V. Kumar. Challenges of Clustering High Dimen-sional Data. In L. T. Wille, editor, New Vistas in Statistical Physics – Appli-cations in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag, 2003.
    [84] P. Coupet and M. Hehenberger. Text Mining Applied to Patent Analysis. In Annual Meeting of American Intellectual Property Law Association (AIPLA) Airlington, 1998.
    [85] Shi Zhong and Joydeep Ghosh. A Comparative Study of Generative Models for Document Clustering. In SDM Workshop on Clustering High Dimen-sional Data and Its Applicatons, San Francisco, CA. May 2003.
    [86] D. Jimenez, CF Enguix, and V. Vidal, A Comparison of Experiments with the Bisecting-Spherical K-means Clustering and SVD Algorithms. presented at the Workshop on Processing and Information Retrieval Valencia, Spain, 2002.
    [87] S. Guha, R. Rastogi, K. Shim. CURE: an efficient clustering algorithm for large database. Information Systems. 26(1): 35-58, Jan 2001.
    [88] G. Karypis, E. H. Han, V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. COMPUTER, 1999, 32(8): 68-75.
    [89] T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An efficient data clustering method for very large databases. Proc. 1996 ACM-SIGMOD Int. Conf. Ma-gement of data (SIGMOD’96), 103-114.
    [90] G. Sheikholeslami, S. Chatterjee, A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial database. Proc. 1998 Int. Conf. Very Large Data Bases (VLDB’98), 428-439.
    [91] Lei Xu, Adam Krzyzak, Erkki Oja. Rival Penalized Competitive Learning for Clustering Analysis, RBF Net, and Curve Detection. IEEE Transactions on Neural Networks, 4(4) (1993) 636–649
    [92] Lei Xu, Adam Krzyzak, Erkki Oja. Unsupervised and Supervised Classifica-tion by Rival Penalized Competitive Learning. In Proc 11th International Conference on Pattern Recogni-tion, The Hague The Netherlands (1992) 492–496
    [93] Zheng-ou Wang, Tao Zhu. An Efficient Learning Algorithm for Improving Generalization Performance of Radial Basis Function Neural Networks. Neural Networks, 13(4–5) (2000) 545–553
    [94] Irwin King, Tak-Kan Lau. Non-Hierarchical Clustering with Rival Penalized Competitive Learning for Information Retrieval. In Proceedings of the First International Workshop on Machine Learning and Data Mining in Pattern Recognition (MLDM'99), Leipzig Germany (1999) 116–130
    [95] Oren Eli Zamir, Oren Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proc. ACM SIGIR'98, 1998.
    [96] Oren Eli Zamir. Clustering Web Documents: A Phrase-based Method for Grouping Search Engine Results. Doctor, University of Washington, 1999.
    [97] Mark Nelson. Fast String Searching with Suffix Trees. Dr. Dab's Journal, August, 1996
    [98] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD)'2002.
    [99] Dhillon, I., Fan, J., and Guan, Y. Efficient Clustering of Very Large Document Collections. In R. Grossman, G. Kamath, and R. Naburu, editors, Data Min-ing for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001.
    [100] Hotho A, Maedche A, Staab S. Ontology-based Text Document Clustering. Klopotek Ma, Wierzchon St, Trojanowski K,eds. Proc. of the Conf. on Intel-ligent Information Systems. Zakopane: Springer-Verlag, 2003.
    [101] D. Boley. Principal Direction Divisive Partitoning. Data Mining and Knowl-edge Discovery, 2(4): 325-344, 1998.
    [102] 姜宁, 史忠植. 文本聚类中的贝叶斯后验模型选择方法. 计算机研究与发展, 2002.39(5): 580-587.
    [103] Wei Wang, Jiong Yang, Muntz R., STING+: an approach to active spatial data mining, 15th International Conference on Data Engineering, 1999, 116-125.
    [104] J. W. Shavlik, T. G. Dietterich, Readings in Machine Learning, San Mateo, CA: Morgan Kaufmann, 1990.
    [105] 陈福集, 杨善林. 一种基于 SOM 的中文 Web 文档层次聚类方法. 情报学报, 2002.21(2): 173-176.
    [106] 王莉, 王正欧. TGSOM:一种用于数据聚类的动态自组织映射神经网络. 电子与信息学报, 20003. 25(3): 313-319.
    [107] 徐建锁, 王正欧, 王莉. 一种基于自组织神经网络的中文文本聚类新方法. 情报学报, 2003. 22(6): 676-680.
    [108] Xu Jian-suo, Wang Zheng-ou, TCBLSA: A New Method of Text Clustering, In: Proceedings of 2003 International Conference on Machine Learning and Cybernetics, Xi'an: IEEE, 2003, 63-66.
    [109] GAO Mao-ting, WANG Zheng-ou. RPCL Text Clustering based on Concept Indexing. The fourth International Conference on Machine Learning and Cy-bernetics, Guangzhou: IEEE, 2005.8. 2331-2335.
    [110] GAO Mao-ting, WANG Zheng-ou. A Conscientious Rival Penalized Com-petitive Learning Text Clustering Algorithm. The Third International Sympo-sium on Neural Networks, Chengdu, 2006.5, Lecture Notes in Computer Science, Volume 3971, Apr 2006, Pages 1256-1260.
    [111] Dhillon I., Kogan J., Nicholas C. Feature Selection and Document Clustering in Survey of Text Mining. Springer-Verlag, New York. (2004) Chapter 4.
    [112] Fodor I. K. A Survey of Dimension Reduction Techniques. LLNL technical report, June 2002, UCRL-ID-148494. URL: http://www.llnl.gov/CASC/sap-phire/pubs.html.
    [113] Yang Yiming, Pedersen JO. A Comparative Study on Feature Selection in Text Categorization. Proceedings of the 14th International Conference on Machine learning, Nashville: Morgan Kaufmann, 1997: 412-420.
    [114] Mlademnic D., Grobelnik, M. Feature Selection for Unbalanced Class Dis-tribution and Naive Bayees. Proceedings the Sixteenth International Confer-ence on Machine learning, Bled: Morgan Kaufamnn, 1999: 258-267.
    [115] Bin Tang, Xiao Luo, Malcolm I. Heywood and Michael Shepherd: A Com-parative Study of Dimension Reduction Techniques for Document Clustering. Technical Report CS-2004-14, Dalhousie University. December 6th, 2004.
    [116] Umit Y. Ogras, Hakan Ferhatosmanoglu. Dimensionality Reduction using Magnitude and Shape Approximations. CIKM 2003: 99-107.
    [117] EE, ?mer Egecioglu, Hakan Ferhatosmanoglu, ümit Y.Ogras: Dimensional-ity Reduction and Similarity Computation by Inner-Product Approximations. IEEE Trans. Knowl. Data Eng. 16(6): 714-726 (2004)
    [118] Sahami M. Using Machine Learning to Improve Information Access. Com-puter Science Department, Stanford University, USA, 1999.
    [119] KollerD, SahamiM. Hierarchically Classifying Documents Using Very Few Words. ICML’97, 1997, 170-178.
    [120] 徐妙君, 顾沈明. 面向Web的文本挖掘技术研究. 控制工程, 2003, 10(5): 44-46.
    [121] 李凡, 鲁明羽, 陆玉昌. 关于文本特征抽取新方法的研究. 清华大学学报, 2001, 41(7): 98-101.
    [122] 陆玉昌, 鲁明羽, 李凡等. 向量空间法中单词权重函数的分析和构造. 计算机研究与发展, 2002, 39(10): 1205-1210.
    [123] 陈涛, 谢阳群. 文本分类中的特征降维方法综述. 情报学报, 2005, 24(6): 690-695.
    [124] 于秀林, 任雪松. 多元统计分析. 北京:中国统计出版社. 1999.5.
    [125] 姜宁, 宫秀军, 史忠植. 高维特征空间中文本聚类研究. 计算机工程与应用, 2002.10: 63-67.
    [126] S. Dumais, G. Furnas, T. Landauer, S. Deerwester, and R. Harshman. Using Latent Semantic Analysis to Improve Access to Textual Information. In Pro-ceedings of the Conference on Human Factors in Computing Systems CHI'88, Washington, DC, USA (1988) 281-285.
    [127] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R.A. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407, (1990).
    [128] 周水庚, 关佶红, 胡运发. 隐含语义索引及其在中文文本处理中的应用研究. 小型微型计算机系统, 2001.22(2), 239-243.
    [129] Peter W. Foltz. Latent Semantic Analysis for Text-Based Research. Behav-ior Research Methods, Instruments and Computers. 1996.28(2), 197-202.
    [130] J. Lin, D. Gunopulos, Dimensionality Reduction by Random Projection and Latent Semantic Indexing. Text Mining Workshop, at the 3rd SIAM Interna-tional Conference on Data Mining, (2003).
    [131] Kaski, S. Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering. Proceedings of International Joint Conference on Neural Networks (IJCNN'98). IEEE Service Center, Piscataway, NJ, 1998. 413-418.
    [132] Bingham E., Mannila H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data. Proc. SIGKDD (2001), 245-250.
    [133] Inderjit S.D., Dharmendra S.M. Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning, 2001, 42(1): 143-175.
    [134] George Karypis, Eui-Hong (Sam) Han. Concept Indexing: A Fast Dimen-sionality Reduction Algorithm with Applications to Document Retrieval & Categorization. In ACM CIKM Conference, 2000.
    [135] Lee D., Seung H. Algorithms for Non-negative Matrix Factorization. Adv. Neural Info. Proc. Syst., 2001, 13:556-562.
    [136] Lee D., Seung H. Learning the Parts of Objects by Nonnegative Matrix Fac-torization. Nature 401(21), 1999. 788–791.
    [137] Friedman J. H., Tukey J. W. A Projection Pursuit Algorithm for Exploratory Data Analysis. IEEE Trans. Computer, 1974, 23(9): 881-890.
    [138] M. Mizuta. Dimension Reduction Methods. URL: http://www.case.hu-ber-lin.de/Publikationen/papers/papersKatalog/ 15_mm.pdf.
    [139] 王顺久, 张欣莉, 丁晶等. 投影寻踪聚类模型及其应用. 长江科学院院报, 2002, 19(6): 53-55.
    [140] 张欣莉, 丁晶, 李祚泳等. 投影寻踪新算法在水质评价模型中的应用. 中国环境科学, 2000, 20(2): 187-189.
    [141] D. E. Rumelhart, D. Zipser, Feature discovery by competitive learning, Congnitive Science, 1985,9(1): 75-112.
    [142] Kohonen T., Improved versions of learning vector quantizatton. International joint Conference on Networks,San Diego 1990,1:545-550.
    [143] Kohonen Teuvo, The self-organizing map, Neurocomputing, 1998, 21(1-3): 1-6.
    [144] D. Fisher. Improving inherence through conceptual clustering. Proc. 1987 AAAI Conf. 461-465.
    [145] J. Gennar, P. Langley, D. Fisher. Models of incremental concept formation. Artificial Intelligence, 40(1): 11-61.
    [146] P. Cheeseman, J. Stutz. Bayesian Classification: Theoty and Result. In: Ad-vances in Knowledge Discovery and Data Mining, 153-180.
    [147] Pizzuti Clara, Talia Domenico, P-AutoClass: Scalable parallel clustering for mining large data sets, IEEE Trans. on Knowledge and Data Engineering, 2003,15(3): 629-641.
    [148] D. Alahakoon, S. K. Halgamuge, Dynamic self-organizing maps with con-trolled growth for knowledge discovery, IEEE Trans. on Neural Networks, 2000,11(3): 601-614.
    [149] D. Choi, S. Park., Self-creating and organizing neural networks, IEEE Trans. on Neural Networks, 1994,5(4):561-575.
    [150] B. Fritzke, Growing cell structure: A self organizing network for supervised and un-supervised learning, Neural Networks, 1994, 7(10): 1441-1460.
    [151] Khaled M. Hammouda, Mohamed S. Kamel. Efficient Phrase-Based Docu-ment Indexing for Web Document Clustering. IEEE Transactions on Knowl-edge and Data Engineering, Vol. 16, No. 10, October 2004
    [152] K. Hammouda and M. Kamel. Phrase-Based Document Similarity Based on an Index Graph Model. Proc. 2002 IEEE Int’l Conf. Data Mining (ICDM ’02), pp. 203-210, Dec. 2002.
    [153] Khaled M. Hammouda, Mohamed S. Kamel. Incremental Document Clus-tering Using Cluster Similarity Histograms
    [154] Dekang Lin. An information-theoretic definition of similarity. In Proc. 15th International Conf. on Machine Learning, pages 296-304. Morgan Kaufmann, San Francisco, CA, 1998.
    [155] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, July 1980.
    [156] 郭莉, 张吉, 谭建龙. 基于后缀树模型的文本实时分类系统的研究和实现. 中文信息学报, 2005, 19(5): 16-23.
    [157] Vijay V. Raghavan, S.K.M.Wong. A Critical Analysis of Vector Space Model for Information Retrieval. Journal of the Americal Society for Information Science, 1986, 37(2):279-287.
    [158] Salton G, Wong A, Yang C Sa. Vector Space Model for Automatic Indexing. Communications of the ACM, 1975, 18 (5): 613-620.
    [159] 高茂庭, 王正欧. Ontology 及其应用. 计算机应用, 2003, 23(Z2): 31-33, 53.
    [160] Gruber TR. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 1993, 5(2): 199-221.
    [161] Guarino, N. 1998. Formal Ontology and Information Systems. Proceedings of the First Conference (FOIS'98). Trento, Italy, 6-8 June 1998. Amsterdam, IOS Press, pp. 3-15.
    [162] Guarino, N. 1995. Formal Ontology, Conceptual Analysis and Knowledge Representation. International Journal of Human and Computer Studies, 43(5/6): 625-640.
    [163] Gómez-Pérez, A., Fernández, M. and De Vicente, A.J. 1996. Towards a Method to Conceptualize Domain Ontologies, ECAI-96 Workshop on Onto-logical Engineering, Budapest.
    [164] Uschold, M.; King, M.; Moralee, S. and Zorgios, Y. 1998. The Enterprise Ontology. The Knowledge Engineering Review 13(1).
    [165] Uschold, M. and Gruninger, M. 1996. Ontologies: Principles, Methods and Applications. The Knowledge Engineering Review, 11(2): 93-136.
    [166] Uschold, M. 1996. Building Ontologies: Towards A Unified Methodology. In Proc. Expert Systems 96, Cambridge, December 16-18.
    [167] 高茂庭, 王正欧. 一种基于双词关联的文本特征选择模型. 计算机工程与应用. (已录用)
    [168] Nicholas J. Belkin, W. Bruce Croft. Information filtering and information retrieval: two sides of the same coin?. Communications of the ACM, 35(12): 29-38, Dec. 1992.
    [169] A. Raube, D. Merkl, M. Dittenbach, The Growing Hierarchical Self-Organ-izing Map: Exploratory Analysis of High-Dimensional Data, IEEE Trans. on neural networks, 2002:13(6): 1331-1340.
    [170] Jain A.K., M.N. Murty. Data Clustering: A Review, ACM Computing Sur-veys, VOL.31, Issue 3, 1999, Page(s):264 – 323.
    [171] Y. Zhao, G. Karypis. Criterion Functions for Document Clustering: Experi-ments and Analysis. Technical Report #01-34, University of Minnesota, MN 2001.
    [172] J. Kogan. Means Clustering for Text Data. In M.W.Berry, editor, Proceed-ings of the Workshop on Text Mining at the First SIAM International Con-ference on Data Mining, pages 57–54, 2001.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700