数据挖掘技术在文本分类和生物信息学中的应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

数据挖掘技术在文本分类和生物信息学中的应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Applications of Data Mining Techniques to Text Classification and Bioinformatics
作者：裴志利
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 文本分类 ; 生物信息学 ; 特征选择 ; 特征权重 ; 粗糙集 ; 基因功能注释 ; 种群进化
英文关键词：data mining ; text classification ; bioinformatics ; feature selection ; feature weight ; rough set ; gene function annotation ; population evolution
学位年度：2008
导师：梁艳春
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2008-04-01
答辩委员会主席：孙铁利

摘要

数据挖掘就是从大量的、不完全的、有噪声的、模糊的、随机的数据库中提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程。它是一个涉及面很广的交叉学科,包括机器学习、数理统计、人工智能、神经网络、数据库、模式识别、粗糙集和模糊数学等相关技术。本文基于数据挖掘的一些相关技术,做了如下几个方面的工作:(1)针对标准互信息和tf.idf特征权重公式的缺点提出了改进方法,仿真实验表明,改进的方法明显提高了宏观准确率、宏观召回率和宏观F1值;(2)针对标准tf.idf方法估算特征权重的盲目性,提出了基于实数域粗糙集理论的特征频率重要度加权方法,仿真实验表明,这种加权方法改善了样本空间的分布状态,使同类的样本更加紧凑,不同类样本更加松散,仿真实验表明,明显提高了文本分类的效果;(3)针对文本分类存在的高维特征空间和高度特征冗余,提出了一种基于互信息和信息熵对的特征选择方法,仿真实验表明,基于该方法的文本分类效果比MI方法和CHI方法都更有效,利用该方法进行特征选择的分类效果接近代表分类水平的支持向量机;(4)针对使用计算机为新测序的生物序列进行功能注释的效果较差的实际,基于GO数据库和BLAST程序,提出了一种基于可变精度粗糙集理论为新的生物序列进行功能注释的方法,仿真实验表明,提出的方法具有较高的准确率、召回率和调和均值;(5)针对目前人类种群进化研究方法的局限性,提出了基于Y染色体SNP基因型频率数据建立人类种群进化关系的新方法,仿真实验表明,本文方法支持“走出非洲”假说,为人类种群进化研究提供了一个新思路。
Data Mining is the process to abstract hidden, potentially useful information and knowledge from massive, incomplete, noisy, fuzzy and random data base. It is inter disciplinary subject including: machine learning, statistics, AI, ANN, data base, pattern reorganization, rough set, fuzzy math, and so on. In this paper, some applications of the techniques of data mining in text classification and bioinformatics are studied. For text classification, there are three mainly contributed works in the paper: developed an integration method of feature selection and weight evaluation; proposed a feature selection method considered redundancy features; developed a feature frequency weighting method based on Variable Precision Rough Set. For bioinformatics, there are 2 mainly contributed works as well: proposed a gene annotation method based on Variable Precision Rough Set; developed a method to construct the evolution tree of human populations according to the SNP frequency data set of Y chromosome of humans. The details are as follows:
     (1) Considered the fact that most of low requency words are noise data, a filtering low frequency words method is proposed. The experiment results show that this method could improve the effectiveness of text classification. Focused on the Mutaul Information based feature selection method and tf.idf feature weight evaluation method, two improved methods are developed, respectively. By using Rocchio,kNN and SVM classifiers, the improved methods are applied to the banchmark text set Reuters-21578 Top10. Numerical results show that the combination of the two improved methods are effective, the macro accuracy, macro recall rate and the macro F1 value are all superior to those of other methods.
     (2) Define an important concept, namely that the importance degree of feature frequency based on the real rough set theory. Based on this concept, a novel weighting method for feature frequency is proposed, which considers the decisive information when we evaluate the contribution of feature frequency, and therefore it could obtain more objective evaluation results. Experimental results show that the proposed method could improve the distribution the samples’space and make the samples of the same kind more compact, and those ones of different kinds more loose; and the values of macro accuracy, macro recall rate and the macro F1 are all significantly improved.
     (3) Focused on the high dimensions of the feature space and the high feature redundancy of text classification problems, a Mutual Information and Information Entropy Pair Based Feature Selection Method is developed. Using developed relationship between information construction feature and the classes, the redundant features could be reduced greatly according to the mutual entropy of feature pairs. Two different machine learning methods, namely native Bayes Networks and kNN methods, are applied to the banchmark data sets of Reuters-21578 Top10 and WebKB. Experimental results show that the proposed method is more efficient than MI and CHI.
     (4) Using experimental methods to determine the sequence funcitons is too much expensive, and couldn’t be used for the large scale annotation. TOP BLAST method is a simple and commanly used computational method. Compared with other compational methods, the precision, recall rate and harmonic mean are all higher, but the absolute values are still low. In this paper, a sequence function annotation method using the variable precision rough set theory based on the GO data base and BLAST software is proposed. The numerical results show that the proposed method could obtain higher macro accuracy than TOP BLAST, and similar macro recall rate and the macro F1 value with TOP BLAST.
     (5) The different order of genome nucleotides reflects the distance between the different population’s evolution relationships. To construct the phylogenetic tree according to the level of differences between DNA molecules, it can approve the evolution relationships between different populations set by the traditional taxonomy. Since single nucleotide polymorphism data conserved most of the DNA molecule information, and most of the chromosome Y is none-recombination area, low mutation rate, it is able to record the evolution incident dutifully. Therefore a new method to construct the evolution tree of human populations according to the SNP frequency data set of Y chromosome of humans is developed in the paper. The numerical results show that the proposed method is supportive to the theory of“walking out of Africa”. The method offers a new idea for the research of human evolution.
     To sum up, this paper develops an integration method of improved MI and improved feature weighting methods, a feature selection method for small redundancy features and a novel weighting method for feature frequency based on the real rough set theory, respectively. The work enriches the methods of feature selection and feature weight evaluation, also brings some new ideas to the text classification key techniques. This paper also proposes a gene annotation method based on the variable precision rough set, which has better performance for noisy data, and promote the realization of automatic annotation method. At last, a new method to construct the evolution tree of human populations according to the SNP frequency data set of Y chromosome of humans is developed, which supports the well known theory of“walking out of Africa”, and offers a novel idea for the research of human evolution.

引文

[1] Jennifer Widom. Research Problems in Data Warehousing, Proceeding of the 4th International Conference on Information and Knowledge Management, Baltimore, Maryland, 1995:25-30.
    [2] V. Harinarayan, A. Rajaraman and J.D. Ullman. Implementing Data Cubes Efficiently, Proceedings of the 1996 ACM SIGMOD Int'l Conf, Montreal: ACM Press, 1996:205-216.
    [3] U. Fayyad, G. Piatetsky-Shapiro and P. Smyth. Knowledge Discovery and Data Mining: Towards a Unifying Framework, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), CA, AAAI Press, 1996: 82-88.
    [4] U.M. Fayyad, R. Uthurusamy. Data Mining and Knowledge Discovery in Databases, Communications of the ACM: Data Mining and Knowledge Discovery (special issue), 1996, 39(11):24-26.
    [5] R. Agrawal, T. Imielinski, A. Swami. Database Mining: A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering, Special issue on Learning and Discovery in Knowledge-Based Databases, 1993, 15(6):914-925.
    [6] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. U. Usama. Advances in Knowledge Discovery and Data Mining, CA, AAAI/MIT Press, 1996.
    [7] G. Piatetsky-Shapiro. The Data-Mining Industry Coming of Age, IEEE Intelligent Systems, 1999, 14(6):32-34.
    [8] H. Lu, R. Setiono, H. Liu. Effective Data Mining Using Neural Networks, IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6):957-961.
    [9] L. Fu. Knowledge Discovery Based on Neural Networks, Communications of the ACM, 1999, 42(11):47-50.
    [10] J. Han, M. Kamber. 数据挖掘:概念与技术,机械工业出版社,北京,2001.
    [11] J. Han, M.Kamber. Data Mining: Concepts and Techniques(影印版),高等教育出版社,北京,2001.
    [12] David Hand, Heikki Mannida, Padhraic Smyth. Principles of Data Mining, 机械工业出版社,北京,2003.
    [13] 史忠植.知识发现,清华大学出版社,北京,2002.
    [14] V.N. Vapnik 著,张学工译.统计学习理论的本质,清华大学出版社,北京,2000.
    [15] A. McCallum, K. Nigam. A comparison of event models for Navie Bayes text classification, AAAI’98 Workshop on Learning for Text Categorization, Madison, Wisconsin: AAAI Press, 1998:509-516.
    [16] E.H. Han, G. Karypis, V. Kumar. Text categorization using weight adjusted K-nearest neighbor classification, Computer Science Department, University of Minnesota, 2000.
    [17] L. Breiman, J.H. Friedman, R.A. Olshen. Classification and regression trees, Belmont, California: Wadsworth International Group, 1984.
    [18] M.E. Ruiz, P. Srinivasan. Hierarchical text categorization using neural networks, Information Retrieval, 2002, 5(1):87-118.
    [19] E. Leopold, J. Kindermann. Text Categorization with Support Vector Machines, How to represent texts in input space, Machine Learning, 2002, 46(1):423-444.
    [20] S. Chakrabarti, S. Roy, M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections, Int’l Journal on Very Large Data Bases, 2003, 12(2):170-185.
    [21] H. Wu, T.H. Phang, B. Liu, X. Li. A refinement approach to handling model misfit in text categorization. In: Davis H, Daniel K, Raymoind N, eds. Proc. of the 8th ACM Int’l Conf, on Knowledge Discovery and Data Mining (SIGKDD-02), Edmonton: ACM Press, 2002:207-216.
    [22] J. Wang, H. Wang, S. Zhang, Y. Hu. A simple and efficient algorithm to classify a large scale of text, Journal of Computer Research and Development, 2005, 42(1):85-93.
    [23] S. Tan, X. Cheng, B. Wang, H. Xu, M.M. Ghanem, Y. Guo. Using dragpushing to refine centroid text classifiers. In: A.B.C. Ricardo, Z. Nivio, M. Gary, M. Alistair, T. John, eds. Proc. of the ACM SIGIR-05, Salvador: ACM Press, 2005:653-654.
    [24] F. Debole, F. Sebastiani. An analysis of the relative hardness of reuters-21578 subsets, Journal of the American Society for Information Science and Technology, 2004, 56(6):584-596.
    [25] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In: C. Nedellec, C. Rouveirol, eds. Proc. of the 10th European Conf. on Machine Learning (ECML-98), Chemnitz: Springer-Verlag, 1998:137-142.
    [26] Y. Yang, X. Liu. A re-examination of text categorization methods. In: F. Gey, M. Hearst, R. Rong, eds. Proc. of the 22nd ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-99), Berkeley: ACM Press, 1999:42-49.
    [27] D.D. Lewis, F. Li, T. Rose, Y. Yang. RCV1: A new benchmark collection for text categorization, research. Journal of Machine Learning Research, 2004, 5(3):361-397.
    [28] G. Forman, I. Cohen. Learning from little: Comparison of classifiers given little training. In: F.B. Jean, E. Floriana, G. Fosca, P. Dino, eds. Proc. of the 8th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD-04), Pisa: Springer-Verlag, 2004:161-172.
    [29] J. Kazama, J. Tsujii. Maximum entropy models with inequality constraints: A case study on text categorization, Machine Learning, 2005, 60(1-3):159-194.
    [30] 李荣陆王建会陈晓云陶晓鹏胡运发.使用最大熵模型进行中文文本分类, 计算机研究与发展,2005, 42(1):94-101.
    [31] W.Y. Liu, N. Song. A fuzzy approach to classification of text documents, Journal of Computer Science and Technology, 2003, 18(5):640-647.
    [32] D.H. Widyantoro, J. Yen. A fuzzy similarity approach in text classification task. In: Proc. of the 9th IEEE Int’l Conf. on Fuzzy Systems (Fuzz-IEEE 2000), Vol.s 1 and 2. San Antonio: IEEE Computer Society, 2000:653-658. http://citeseer.ist.psu.edu/692028.html
    [33] B. Bigi. Using Kullback-Leibler distance for text categorization, In: Sebastiani F, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03), Pisa: Springer-Verlag, 2003:305-319.
    [34] G.M.D. Nunzio. A bidimensional view of documents for text categorisation. In: S. McDonald, J. Tait, eds. Proc. of the 26th European Conf. on Information Retrieval Research (ECIR-04), Sunderland: Springer-Verlag, 2004:112-126.
    [35] N.V. Chawla, N. Japkowicz, A. Kotcz. Editorial: Special issue on learning from imbalanced data sets, Sigkdd Explorations Newsletters, 2004, 6(1):1-6.
    [36] M. Ruiz. Combining machine learning and hierarchical structures for text categorization, Ames: Graduate College of University of Iowa, 2001.
    [37] M. Ruiz, P. Srinivasan. Hierarchical text classification using neural networks, Information Retrieval, 2002, 5(1):87-118.
    [38] A. Sun, E.P. Lim, W.K. Ng. Hierarchical text classification methods and their specification, In: A.T. Chan, S.C. Chan, H.V. Leong, V.T.Y. Ng, eds. Cooperative Internet Computing, Dordrecht: Kluwer Academic Publishers, 2003:236-256.
    [39] A. Sun, E.P. Lim. Hierarchical text classification and evaluation, In: N. Cercone, T.Y. Lin, X. Wu, eds. Proc. of the 1st IEEE Int’l Conf. on Data Mining (ICDM-01), San Jose: IEEE Computer Society, 2001:521-528.
    [40] A. Sun, E.P. Lim, W.K. Ng. Performance measurement framework for hierarchical text classification, Journal of the American Society for Information Science and Technology, 2003, 54(11):1014-1028.
    [41] S. Zhou, Y. Fan, J. Hua, F. Yu, Y. Hu. Hierachically classifying Chinese Web documents without dictionary support and segmentation procedure, In: H. Lu, A. Zhou, eds. Proc. of the 1st Int’l Conf. on Web-Age Information Management (WAIM-00), Shanghai: Springer-Verlag, 2000:215-226.
    [42] M. Ceci, D. Malerba. Hierarchical classification of HTML documents with WebClassII, In: F. Sebastiani, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03), Pisa: Springer-Verlag, 2003:57-72.
    [43] F. Sebastiani. Machine learning in automated text categorization, ACM Computing Surveys, 2002, 34(1):1-47.
    [44] 陈润生.生物信息学,生物物理学报,1999, (15): 5-12.
    [45] 赵国屏等.生物信息学,科学出版社,北京,2002.
    [46] G. Berbard. The human genome: Organization and evolutionary history, Ann. Rev. Genetics, 1995, 29:445-476.
    [47] 郝柏林,张淑誉.生物信息学手册,科学技术出版社,上海,2002 第二版.
    [48] 杨金水.基因组学,高等教育出版社,北京,2002.
    [49] K.L. Williams, A.A. Gooley and N.H. Packer. Proteome: Not just a made-up name, Today’s Life Science, 1996, (6):16-21.
    [50] 贺林.解码生命,科学出版社,北京,2000.
    [51] 阎隆飞,孙之荣.蛋白质分子结构,清华大学出版社,北京,1999.
    [52] J. setubal and J. Meidanis. 朱浩等译.计算分子生物学,科学出版社,北京,2003.
    [53] 沈银柱主编.进化生物学,高等教育出版社,北京,2002.
    [54] 李文雄,戈劳尔著,陈建华译.分子进化基础,高教出版社,北京,2001.
    [55] K.H. wolfe and W.H. Li. Molecular evolution meets the genomics revolution, Nature genetics, Supplement, 2003, 33:255-265.
    [56] C. Seoighe and K.H. Wolfe. Yeast genome evolution in the Post-genome era, Genomics, 1999, 2(5):548-554.
    [57] M. Kellis, N. Patterson, M. Endrizzi, B. Birren and E.S. Lander. Sequencing and comparison of yeast species to identify genes and regulatory elements, Nature, 2003, 423: 241-254.
    [58] M. Long, E. Betran, K. Thornton and W. Wang. The origin of new genes: glimpses from the young and old, Nature review genetics, 2003, 4:865-875.
    [59] Z.L. Gu, L.M. Steinmetz, X. Gu, C. Scharfe, R.W. Davis and W.H. Li. Role of duplicate genes in genetic robustness against null mutations, Nature, 2003, 421:63-66.
    [60] L. Fedorova and A. Fedorov. Intron in gene evolution, Genetica, 2003, 118:123-131.
    [61] S.W. Roy. Recent evidence for the exon theory of genes, Genetica, 2003, 118:251-256.
    [62] S.B. Primrose. Principles of Genome analysis and genomics, Blackwell, 2003.
    [63] C.F. Wong and A.J. McCammon. Protein simulation and drug design, Adv Protein Chem, 2003, 66:87-121.
    [64] 赵丽琴,肖军海,李松.分子对接在基于结构药物设计中的应用,生物物理学报,2002, (18):263-270.
    [65] 宋云龙,陆倍倍,张万年.基于结构的计算机辅助药物设计方法学与应用研究,药学进展,2002, (26):359-364.
    [66] 陈凯先,将华良,稽汝运.计算机辅助药物设计:方法,原理和应用,上海科学技术出版社,上海,2000.
    [67] 马立人,将中华.生物芯片,化学工业出版社,北京,2002 第二版.
    [68] G.D. Stormo, T.D. Schneider, L. Gold and A. Ehrenfeucht. Use of the perceptron algorithm to distinguish translational initiation in E.coli, Nuclei Acids Research, 1982, 10(9):2997-3011.
    [69] S. Salzberg. Locating protein coding regions in human DNA using a decision tree algorithm, Journal of Computational Biology, 1995, 2(3):473-485.
    [70] J. Selbig, T. Mevissen and T. Lengauer. Decision tree-based formation of consensus protein secondary structure prediction, Bioinformatics, 1999, 15:1039-1046.
    [71] D. Cai, A. Delcher, B. Kao and S. Kasif. Modeling splice sites with Bayes networks, Bioinformatics, 2000, 16:152-158.
    [72] S.C. Schmidler, J.S. Liu, D.L. Brutlag. Bayesian segmentation of protein secondary structure, Journal of computational biology: a journal of computational molecular cell biology, 2000, 7(1/2): 233-248.
    [73] D.R. Swanson. Fish-oil, Raynaud’s Syndrome, and Undiscovered Public Knowledge, Perspectives in Biology and Medicine, 1986, 30(1):7-18.
    [74] B.B. Chang, J. Kremer, et al. Effects of Fish Oil Fatty Acid Ingestion in Patients with Raynaud’s Syndrome, Surgical Forum,1988, 39:324-326.
    [75] B.J. Stapley, G. Benoit. Biobibliometrics: Information Retrieval and Visualization from Co-occurrences of Gene Names in Medline Abstracts, Proceedings of the Pacific Symposium on Bio-computing, 2000:529-540.
    [76] E.M. Marcotte, I. Xenarios, D. Eisenberg. Mining Literature for Protein-Protein Interactions, Bioinformatics, 2001, 17(4):359-363.
    [77] S. Usuzaka, K.L. Sim, M. Tanaka, H. Matsuno and S. Miyano. A Machine learning Approach to Reducing the Work of Experts in Article Selection Form Database: a Case Study for Regulatory Relations of s. Cerevisiae Genes in Medline, Genome Inform ser Workshop Genome Iinform, 1998, 9:91-101.
    [78] F. Eisenhaber and P. Bork. Evaluation of Human-Read able Annotation inBiomolecular Sequence Databases with Biological Rule Libraries, Bioinformatics, 1999, 15:528-535.
    [79] I. Lliopoulos, A.J. Enright, C.A. Ouzounis. Textquest: Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology, Proceedings of the Pacific Symposium on Bio-computing, 2001:384-395.
    [80] H. Shatkay, S. Edwards, W.J. Wilbur, M. Boguski. Genes, Themes and Microarrays: Using Information Retrieval for Large-scale Gene Analysis, In Intelligent Systems for Molecular Biology, 2000, 8:317-328.
    [81] W.J. Wilbur, G.F. Hazard. Analysis of Biomedical Text for Chemical Names: a Comparison of Three Methods, Proc AMIA Symp, 1999:176-180.
    [82] T.C. Rindflesch, L. Tanabe, J. Weinstein, L. Hunter. EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature, Proceedings of the Pacific Symposium on Bio-computing, 2000:517-528.
    [83] K. Fukuda and A. Tamura. Toward Information Extraction: Identifying Protein Names form Biological Papers, Proceedings of the Pacific Symposium on Bio-computing, 1998:707-18.
    [84] V. Hatzivassiloglou, P.A. Duboue, et al. Disambiguating Proteins, Genes, and RNA in Text: a Machine Learning Approach, Bioinformatics, 2001, (suppl.1):s97-s106.
    [85] C. Blaschke et al. Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions, International Conference on Intelligent Systems for Molecular Biology, 1999, 7:60-67.
    [86] S.K. Ng and M. wong. Toward Routine Automatic Pathway Discovery from on-line Scientific Text Abstract, Genome Inform ser Workshop Genome Inform, 1999, 10:104-112.
    [87] D.D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task, Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR-92), 1992:37-50.
    [88] N. Fuhr and C. Buchley. A probabilistic learning approach for document indexing, ACM Transactions on Information Systems, 1991, 9(3):223-248.
    [89] S. Dumais, J. Platt, D. Heckerman, M. Sahami. Inductive learning algorithms and representations for text categorization, Proceedings of the seventh internationalconference on information and knowledge management, 1998:148-155.
    [90] T. Joachims. A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization, Proceedings of the 14th International Conference on Machine Learning (ICML-97), 1997:143-151.
    [91] Y. Yang. Chute C G. An example-based mapping method for text categorization and retrieval, ACM Transaction on Information Systems (TOIS), 1994, 12(3):252-277.
    [92] Y. Yang. Expert network: effective and efficient learning from human decisions in text categorization and retrieval, Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SIGIR’94), 1994:13-22.
    [93] V. Vapnik. The Nature of Statistical Learning Theory, New York, Springer-Verlag, 1995.
    [94] G.W. Mineau. A simple KNN algorithm for text categorization, International Conference on Data Mining, San Jose, California, USA: IEEE Computer Society, 2001:647-648.
    [95] H.T. Ng, W.B. Goh and K.L. Low. Feature selection, perceptron learning and a usability case study for text categorization, Proceedings of the 20th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997:67-73.
    [96] D. Mladenic, M. Grobelnk. Feature selection for unbalanced class distribution and Na?ve bayes, Proceedings of the 16th Int’1 Conf on Machine Learning (ICML’99), San Francisco: Morgan Kaufmann Publishers, 1999:258-267.
    [97] 周茜,赵明生等.中文文本分类中的特征选择研究,中文信息学报,2004, 18(3):17-23.
    [98] Y. Yang, J.P. Pedersen. A comparative study on feature selection in text categorization, Proceedings of the 14th Int’1 Conference Machine Learning (ICML’97), 1997:412-420.
    [99] H.F. Li, T. Jiang, K.S. Zhang. Efficient and robust feature extraction by maximum margin criterion, Proceedings of the Advances in Neural Information Processing Systems, Vancouver, Canada: MIT, 2003:97-104.
    [100] 秦进,陈笑蓉等.文本分类中的特征抽取,计算机应用,2003, 23(2):45-46.
    [101] 黄萱菁,吴立德等.独立于语种的文本分类方法,中文信息学报,2000,14(6):1-7.
    [102] D. Franca, S. Fabrizio. Supervised term weighting for automated text categorization, Proceedings of the 2003 ACM Symposium on Applied Computing. Melbourne, Florida, USA: ACM, 2003:784-788.
    [103] P. Pekar, M. Krkoska, P. Staab. Feature weighting for co-occurrence-based classification of words, Proceedings of the 20th international conference on Computational Linguistics, Geneva, Switzerland, USA: ACL, 2004:799-es.
    [104] P. Soucy, G.W. Mineau. Beyond TFIDF weighting for text categorization in the vector space model, International Joint Conference on Artificial Intelligence. Edinburgh, Scotland: UK, 2005:1130-1135.
    [105] A. Blansche, P. Gancarski, J. J. Korczak. MACLAW: a modular approach for clustering with local attribute weighting, Pattern Recognition Letters, 2006, 27(11):1299-1306.
    [106] G. Salton, B. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 1998, 24(5):513-523.
    [107] Y.S. Dong, K.S. Han. Text classification based on data partitioning and parameter varying ensembles, Proceedings of the 2005 ACM symposium on Applied computing, Santa Fe, New Mexico, USA: ACM, 2005:1044-1048.
    [108] Y.S. Tae, J.W. Son, M.H. Kong, J.S. Lee, S.B. Park, S.J. Lee. A Hybrid Approach to Error Reduction of Support Vector Machines in Document Classification, Third International Conference on Information Technology: New Generations (ITNG'06), Las Vegas, Nevada, USA: IEEE-CS, 2006:501-506.
    [109] T. Yamada, K. Yamashita, N. Ishii, K. Iwata. Text Classification by Combining Different Distance Functions with Weights, Seventh ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD'06), Las Vegas, Nevada, USA:IEEE-CS, 2006: 85-90.
    [110] X.Q. Zeng, M.W. Wang, J.Y. Nie. Text classification based on partial least square analysis, Proceedings of the 2007 ACM symposium on Applied computing, Seoul, Korea, USA: ACM, 2007:834-838.
    [111] D.L. David, M. Ringuette. A comparison of two learning algorithms for textcategorization, Third Annual Sym-posium on Document Analysis and Information Retrieval, Las Vegas, NV: ISRI, 1994:81-93.
    [112] J.B. Zhu, T.S. Yao. FIFA: a simple and effective approach to text topic automatic identification, Proceedings of International Conference On Multilingual Information Processing 2002, Shenyang, China, 2002:207-215.
    [113] C. Apte, F. Damerau and S. Weiss. Text mining with decision rules and decision trees, Proceedings of the Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.
    [114] J. Rocchio. Relevance feedback in information retrieval.In Salton, The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971:313-323.
    [115] B. Widrow, S.D. Stearns. Adaptive Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1979.
    [116] 王国胤.Rough 集理论与知识获取,西安交通大学出版社,西安,2001.
    [117] M. Ashburner, C.A. Ball, J.A. Blake, et al. Gene Ontology: tool for the unification of biology, Nature Genetics, 2000, 25(1):25-29.
    [118] S. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman. Basic Local Alignment Search Tool, Journal of Molecular Biology, 1990, 215:403-410.
    [119] S.F. Altschul, T.L. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D.J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, 1997, 25(17):3389-3402.
    [120] S. Hennig, D. Groth and H. Lehrach. Automated Gene Ontology annotation for anonymous sequence data, Nucleic Acids Research, 2003, 31(13):3712-3715.
    [121] D. Groth, H. Lehrach and S. Hennig. GOblet: a platform for Gene Ontology annotation of anonymous sequence data, Nucleic Acids Research, 2004, 32(suppl_2):W313-W317.
    [122] S. Khan, G. Situ, K. Decker and C.J. Schmidt. GoFigure: Automated Gene Ontology annotation, Bioinformatics, 2003, 19(18):2484-2485.
    [123] D.M. Martin, M. Berriman and G.J. Barton. GOtcha: a new method for prediction of protein function assessed by the annotation of seven gen omes, BMC Bioinformatics, 2004, 5:178.
    [124] C. Joslyn, S. Mniszewski, A. Fulmer and G. Heaton. The Gene Ontology Categorizer, Bioinformatics, 2004, 20(Suppl. 1):i169-i177.
    [125] K.M. Verspoor, J.D. Cohn, S.M. Mniszewski and C.A. Joslyn. A Categorization Approach to Automated Ontological Protein Function Annotation, Protein Science, 2006, 15:1544-1549.
    [126] E.J. Craig, U. Baumann and A.L. Brown. Automated methods of predicting the function of biological sequences using GO and BLAST, BMC Bioinformatics, 2005, 6:272.
    [127] 张文修,吴伟志.粗糙集理论介绍和研究综述,模糊系统与数学,2000, 14(4):1-12.
    [128] W. Ziarko. A variable precision rough set model, Journal of Computer and System Sciences, 1993, 46:39-59.
    [129] B. Malcolm. Reducts within the variable precision rough sets model: A further investigation, European Journal of Operational Research, 2001, 134(3):592-605.
    [130] U. Karaos, T.M. Murali, et al. Whole-genome annotation by using evidence integration in functional-linkage networks, Proceedings of the National Academy of Sciences, 2004, 101(9): 2888-2893.
    [131] A. Gibbons. Which of our genes make us human?, Science, 1998, 281(5382):1432-1434.
    [132] M.R. Nelson, G. Marnellos, S. Kammerer, et al. Large-scale validation of single nucleotide polymorphisms in gene regions, Genome Research, 2004,14:1664-1668.
    [133] R. Jiang, J. Duan, A. Windemuyh, et al. Genome-wide evaluation of the public SNP databases, Pharmacogenomics, 2003, 4(6):779-789.
    [134] L. Jin, B. Su. Natives or Immigrants: Modern Human Origin in East Asia, Nature Reviews Genetics, 2000, 1(2):126-133.
    [135] M.A. Jobling, C. Tyler-Smith. Fathers and Sons: the Y chromosome and Human Evolution, Trends in Genetics, 1995, 11(11):449-455.
    [136] P.A. Underhill, L. Jin, R. Zemans, et al. A Pre-Columbian Y chromosome-specific Transition and its Implications for Human Evolutionary History, The Proceedings of the National Academy of Sciences, 1996,93(1):196-200.
    [137] P.A. Underhill, L. Jin, A.A. Lin, et al. Detection of Numerous Y ChromosomeBiallelic Polymorphisms by Denaturing High-performance Liquid Chromatography, Genome Research , 1997, 7(10):996-1005.
    [138] D. Michel, P. Andrzej. Role of Evolution by Natural Selection in Population Dynamics, Physical Reviewe, 2004, 69(5):1-6.
    [139] S.W. Katarzyna, P. Andrzej. Evolution of Populations in a Changing Environment, Physica A: Statistical Mechanics and its Applications, 1999, 269(2):527-535.
    [140] M. Broom, Q. Tang, D. Waxman. Mathematical Analysis of a Model Describing Evolution of an Asexual Population in a Changing Environment, Mathematical Biosciences, 2003, 186(1):93-108.
    [141] A.M. Farley. Population Structure and Artificial Evolution, Lecture Notes in Computer Science, 2006, 3871:213-225.
    [142] A. Pekalski. Effect of Eugenics on the Evolution of Populations, European Physical Journal B, 2000, 17(2):329-332.
    [143] 李婧,潘玉春,李亦学,石铁流.人类基因组单核苷酸多态性和单体型的分析及应用,遗传学报,2005, 32(8):879-889.
    [144] L. Vigilant, M. Stoneking, H. Harpending, et al. African Populations and the Evolution of Human Mitochondrial DNA, Science, 1997, 253(5027):1503-1507.
    [145] J. 塞图宝,J. 梅丹尼斯.计算分子生物学导论,科技出版社,2003 年第一版,133-158.
    [146] N. Saitou, M. Nei. The neighbor-joining method: A new method for reconstruction phylogenetic trees, Mol. Biol. Evol., 1987, 4:406-425.
    [147] O. Fred. Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA), Publication online. URL:http://www.icp.ucl.ac.be /～opperd/private/upgma.html.
    [148] M. Ingman, K. Henrik, S. Paabo. Mitochondrial Genome Variation and the Origin of Modern Humans, Nature, 2000, 408:708-713.
    [149] J.S. Farris. Distance data in phylogenetic analysis, In: V.A. Funk, D.R. Brooks, eds. Advances in Cladistics, Proceedings of the First Meeting of the Willi Hennig Society, 1981:3-23.
    [150] D. Penny. Towards a basis for classification: the incompleteness of distance measures, incompatibility analysis and phonetic classification, J. Theor. Biol., 1982,96:129-142.
    [151] J. Felsenstein. Distance Methods: a reply to farris, Cladistics, 1986, 2:130-143.
    [152] J. Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading, Syst. Zool., 1987, 27:401-410.
    [153] J. Sourdis, C. Krimbas. Accuracy of phylogenetic trees estimated from DNA sequence data, Mol. Biol. Evol., 1987, 4:159-166.
    [154] D.L. Swofford, G.J. Olsen. Phylogeny reconstruction. In: D.M. Hillis, C. Moritz, eds. Molecular Systematics, Sunderland: Sinauer Associates Inc., 1990:411-501.
    [155] S.T. Sherry, M. Ward, K. Sirotkin. DBSNP Database for Single Nucleotide Polymorphisms and Other Classes of Minor Genetic Variation, Genome Research, 1999, 9:670-677.
    [156] S.T. Sherry, et al. DBSNP: the NCBI Database of Genetic Variation, Nucleic Acids Research, 2001, 29:308-311.
    [157] The International HapMap Consortium. The International HapMap Project, Nature, 2003, 426: 789-796.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700