航空领域术语定义抽取关键技术及其应用研究

英文题名：Research on Definition Extraction in Aviation Domain and its Application
作者：潘湑
论文级别：博士
学科专业名称：载运工具运用工程
中文关键词：定义抽取 ; 信息抽取 ; 语料库 ; 不平衡数据分类 ; 过采样 ; 特征选择 ; 多层次特征 ; 组合特征 ; 自动试题生成
英文关键词：definition extraction ; information extraction ; corpus ; unbalanced data classification ; over-sampling ; feature selection ; Multi-level feature ; combined feature ; automatic item generation
学位年度：2011
导师：顾宏斌
学科代码：082304
学位授予单位：南京航空航天大学
论文提交日期：2011-10-01

摘要

CBT（Computer Based Training）系统作为先进培训技术的重要组成部分，在民航业的飞行员培训和机务培训中具有重要作用。飞行CBT在国内外航空公司已有大量的应用，而部署机务CBT系统也是是国内二级维修单位的必备条件。本文的工作围绕CBT系统开发过程中，利用术语定义抽取技术从专业文献中获取专业相关知识所需的关键技术展开，并探索了将定义知识应用于智能CBT系统中的方法。本文的主要研究内容如下：
     （1）建设术语定义抽取实验用语料库。语料库是所有自然语言处理研究必须要解决的问题，但是目前国内外并没有现成的专供航空领域中文术语定义抽取研究的语料库，所以本文的第一项工作就是建设一个实验用语料库。根据实验要求，确立了第一阶段语料库的建设规模，并建立了本文语料库的开发规范并开发了相应的配套软件；还对语料库的各种信息进行了详细统计，以此作为本文后续研究的基础。
     （2）确定进行术语定义抽取的基本方法。由于研究目的不同，以往用于解决自动问答和搜索引擎排序问题的方法在本文中并不适用。针对术语定义在语料中分布极不平衡的情况，提出以平衡随机森林方法来解决术定义抽取问题；针对构建平衡训练集时随机产生合成样本的方法无法有效巩固是少数类密集分布区域边界的问题，提出了采用基于实例距离分布信息定义的重采样策略，相比随机重采样方法，提高了定义抽取的F1-measure和F2-measure。
     （3）改进术语定义抽取的特征选择方法。针对术语定义抽取语料中，数据分布不平衡以及定义句内部存在小析取项这两个问题，从特征选择角度提出基于类间分布差异和类内分布差异的特征选择方法。该方法改进了传统特征选择函数依赖词频统计结果主要衡量特征的类间分布差异的缺点。实验证明在应用于平衡随机森林方法时可以以更少的特征达到与传统filter方法同样的F1-measure和F2-measure。
     （4）利用多层次语言学特征进行定义抽取。本文对在信息抽取不同子课题中使用多层次语言学特征的情况进行了总结，针对定义抽取领域中由于缺乏可定量计算的方法，导致无法在进行定义抽取时充分利用语言学特征的问题，以信息熵为基础提出使用不同层次间的特征组合的组合熵来计算不同层次的特征组合对定义抽取的影响，并结合前文的特征选择框架用于多层次特征的筛选。该方法为研究不同层次的语言学特征在定义抽取中的作用和利用这些特征进行定义抽取提供了一种可计算的方法。实验证明了该方法的正确性和有效性。
     （5）设计并实现了CBT智能考核系统。针对现有AIG（Automatic Item Generation）技术不利于生成专业领域的试题而且干扰项的迷惑性也较弱的问题。本文以加工定义知识得到的多种知识表达为基础，设计了利用句型模板库和知识点库生成考核试题的题面，从领域本体生成干扰项的自动试题生成和评价系统。该方法可以有效满足CBT系统中对于专业知识的自动考核和评价的需求，同时能够大幅减轻开发题库和组卷所需的工作量。
CBT（Computer Based Training） system plays an important role in pilot training andmaintenance training in civil aviation as a part of advanced training technology.Productions ofCBT have been widely used in airline from home and abroad, and deployment of maintenanceCBT system is a prerequisite for intermediate maintance units. The work in this paper startedaround critical technologies in obtaining professional knowledge from professional literaturesusing term definition extraction techniques. In this paper, we also explore the approach ofapplication of knowledge extracted from professional literatures in intelligence CBT systemdevelopment.The contributions of this dissertation are mainly summarized asfollows:
     Firstly, Corpus is basic resource of all natural language processing research, but noready-made available for the study of term definition extraction at home and abroad. So theprimary task of this paper is to construct a corpus for experiments. According to the experimentalrequirements, this paper establishes construction scale and standard of corpus of first stage, anddevelops corresponding software. This paper also carries out detailed statistical information on thecorpus as the basis for further study.
     Secondly, the basic method of definition extraction is unbalanced data classification.Because of different research purpose, solutions for getting definitions for question answer orranking as search engine do not apply in this paper. In view of imbalance distribution of termdefinitions in corpus, a method based on balanced random forests is proposed to extract definitionsfrom corpus. A novel over-sampling strategy based on distance distribution information ofinstances is proposed to solve the problem that randomly synthetic instances cannot effectivelyconsolidate regional border of minority class instances in building a balanced training set.Experiments show that it improves the results of F1-measure and F2-measure in extractingdefinitions.
     Thirdly, improving feature selection method in definition extraction using distancedistribution information of instances. Inorder to address the imbalance distribution of data andsmall disjuncts in definition sentences, the new feature selection method is defined based onbetween-class distribution difference and within-class distribution difference of features. The newmethod improves the shortcoming of traditional methods that evaluation methodology relies onword frequency statistics. Experiments show that the BRF classifier using new method achievesthe same results with fewer features in extracting definitions.
     Fourthly, extracting definitions using multi-level linguistic features. Situation of usingmulti-level linguistic features in different sub-topics of information extaction is summarized firstly. Because of lacking of quantitative method, multi-level linguistic features can not be used inextracting definitions. In this paper, a feature combinations entropy based method is proposed tocalculte impact of different combinations in extracting definitions. The method provides acomputable way to evaluate linguistic features of different level in extracting definitions.Experiments show the correctness and validity of this method.
     Finally, designing and implementing an inteligent assessment system for CBT. Existing AIGtechnology is not conductive to generate questions for professional field and distractors are lessconfusing. In this paper, a novel AIG system is designed to solve this problem. The systemgenerates items using a variety of knowledge and sentence templates, and generates distractorsusing domain ontology. These resources are achieved from extracted definitions. The new designmeets the demond of CBT system for automatic assessment and evaluation of professionalknowledge effectively, and eases workload of developing item bank and examination papers.

引文

[1] W.David, P.Glenn. Computer Based Training,is it worth the money?.IEEE ConferenceRecord of Annual Pulp and Paper Industry Technical, IEEE, Piscataway, NJ, USA,95CH3572-5,54-59
    [2]张榕,术语定义抽取、聚类与术语识别研究,[博士学位论文]，北京：北京语言大学，2006
    [3]冯志伟，现代术语学引论，北京：语文出版社，1997
    [4] Adam Przepiorkowski, ukasz Degorski, Miroslav Spousta. Towards the automaticextraction of definitions in Slavic, Balto-Slavonic Natural Language Processing2007, Prague,2007:43-50
    [5] Adrian Iftene, Ionut Pistol, Diana Trandab at Grammar-based Automatic Extraction ofDefinitions,10th International Symposium on Symbolic and Numeric Algorithms forScientific Computing, Iasi, Romania,2008:110-115
    [6] Omar Trigui, Lamia Hadrich Belguith, and Paolo Rosso, An Automatic Definition Extractionin Arabic Language. Natural Language Processing and Information Systems. Lecture Notes inComputer Science, Volume6177/2010,2010:240-247
    [7] César Aguilar, Gerardo Sierra, A formal scope on the relations between definitions and verbalpredications, Workshop On Definition Extraction&RANLP2009, Borovets, Bulgaria,2009:1-6
    [8] Rodrigo Alarcón, Gerardo Sierra, Carme Bach. Description and Evaluation of a DefinitionExtraction System for Spanish language. Workshop On Definition Extraction&RANLP2009,Borovets, Bulgaria,2009:7-13
    [9] Stephan Walter, Manfred Pinkal. Automatic Extraction of Definitions from German CourtDecisions. Proceedings of the Workshop on Information Extraction Beyond The Document,Sydney,2006:20–28,
    [10] Marc Bertin, Iana Atanassova and Jean-Pierre Descles. Extraction of Author’s DefinitionsUsing Indexed Reference Identification. Workshop On Definition Extraction&RANLP2009,Borovets, Bulgaria,2009:21-25
    [11]Suzan Verberne, Lou Boves, Nelleke Oostdijk. What Is Not in the Bag of Words for Why-QA?Computational Linguistics. Volume36, Number2.2010:229-245
    [12] Suzan Verberne, Lou Boves, Nelleke Oostdijk. Using Syntactic Information for ImprovingWhy-Question Answering. Proceedings of the22nd International Conference onComputational Linguistics. Manchester,2008:953-960
    [13] Hang Cui, Min-Yen Kan, Tat-Seng Chua: Soft pattern matching models for definitionalquestion answering, ACM Transactions on Information Systems (TOIS), v.25n.2,2007:1-30
    [14] H. Cui, M. Kan, and T. Chua: Generic soft pattern models for definitional question answer. InProceedings of the28th annual international ACM SIGIR conference on Research anddevelopment in information retrieval
    [15] Hang Cui, Min-Yen Kan, Tat-Seng Chua: Unsupervised learning of soft patterns forgenerating definitions from online news, Proceedings of the13th international conference onWorld Wide Web, New York, NY, USA,2004:90-99
    [16]Eugene Agichtein and Luis Gravano: Snowball: Extracting relations from large plain-textcollections. In Proceedings of the Fifth ACM International Conference on Digital Libraries.2000:85-94
    [17] Chunxia Zhang, Peng Jiang. Automatic Extraction of Definitions. Computer Science andInformation Technology.2009:364-368
    [18] Jun Xu, Yunbo Cao, Hang Li, Min zhao: Ranking Definitions with Supervised LearningMethods. In Proceedings of14th International World Wide Web Conference Committee,Industrial and Practical Experience Track, Chiba, Japan,2005:811-819
    [19] Degórski,., Marcinczuk, M., and Przepiórkowski: A. Definition extraction using asequential combination of baseline grammars and machine learning classifiers. In Proceedingsof the Sixth International Conference on Language Resources and Evaluation, LREC2008,Marrakech. ELRA.2008:
    [20] Przepi ó rkowski, A., Marci ń czuk, M., Deg ó rski,.: Dealing with small, noisy andimbalanced data: Machine learning or manual grammars? In: Sojka, P., Kope ek, I., Pala, K.(eds.) Text, Speech and Dialogue:9th International Conference (TSD2008), Brno, CzechRepublic, September2008. LNCS (LNAI). Springer, Berlin2008
    [21] Ismail Fahmi and Gosse Bouma: Learning to identify definitions using syntactic features. In,Proceedings of the EACL workshop on Learning Structured Information in Natural LanguageApplications, Trento, Italy.2006
    [22] ElineWesterhout. Extraction of definitions using grammar-enhanced machine learning.Proceedings of the EACL2009Student Research Workshop. Athens, Greece,2009:88-96
    [23] Rosa Del Gaudio and Ant′onio Branco. Extraction of Definitions in Portuguese: AnImbalanced Data Set Problem. Proceedings of the aritficial intelligence13th Portugueseconference on Progress in artificial intelligence.2007:689-670
    [24] Rosa Del Gaudio Ant′onio Branco. Language Independent System for Definition Extraction:First Results Using Learning Algorthms. Workshop On Definition Extraction&RANLP2009,Borovets, Bulgaria,2009:33-3996
    [25] ukasz Kobyliński, Adam Przepiórkowski. Definition Extraction with Balanced RandomForests. The6th international conference on Advances in Natural Language Processing,Gothenburg, Sweden:2008:237-247
    [26] ukasz Degórski, ukasz Kobyli′nski, Adam Przepiórkowski. Definition Extraction:Improving Balanced Random Forests. Proceedings of the International Multiconference onComputer Science and Information Technology2008:353–357
    [27] Gerard de Melo, Gerhard Weikum. Extracting Sense-Disambiguated Example SentencesFrom Parallel Corpora. Workshop On Definition Extraction&RANLP2009, Borovets,Bulgaria,2009:40-46
    [28] María A. Barrios, Guadalupe Aguado de Cea, José ángel Ramos. Enriching a lexicographictool with domain definitions: Problems and solutions. Workshop On Definition Extraction&RANLP2009, Borovets, Bulgaria,2009:14-20
    [29] Esperanza Valero, Amparo Alcina. Linguistic realization of conceptual features interminographic dictionary definitions. Workshop On Definition Extraction&RANLP2009,Borovets, Bulgaria,2009:54-60
    [30]国家技术监督局. GB12200.1-9,equ ISO/DIS1087-2-2.汉语信息处理词汇01部分:基本术语..北京：中国标准出版社，1990
    [31] McEnery T, Wilson A. Corpus Linguistics. Edinburgh:Edinburgh University Press,1996
    [32] Bellizzi D. What is acorpus. http://corpus.wikispaces.com/What+is+a+Corpus%3F
    [33]Wynne M. Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books,2005
    [34]何婷婷，语料库研究[博士学位论文]，武汉：华中师范大学，2003
    [35]黄曾阳，《HNC（概念层次网络）理论》，北京：清华大学出版社，1998，
    [36]靳光瑾，肖航，富丽，国家语委十五重大项目《现代汉语语料库建设及深加工》研究成果汇报，《语言文字应用》2005年第2期：111-120
    [37] Greene, Barbara B. and Gerald M. Rubin. Automatic grammatical tagging of English.Technical report, Department of Linguistics, Brown University, Providence, Rhode Island.1971
    [38] Garside, R. The CLAWS Word-tagging System. The Computational Analysis of English: ACorpus-based Approach. London: Longman.1987
    [39]郑家恒，张虎，谭红叶等著.智能信息处理—汉语语料库加工技术及应用北京：科学出版社，2010年10月第一版：6-10
    [40] Foster Provost. Machine Learning from Imbalanced Data Sets101. In Proceedings of theAAAI’2000Workshop on Imbalanced Data Sets,2000
    [41] M Maloof. Learning When Data Sets Are Imbalanced and When Costs Are Unequal andUnknown. Working Notes of the ICML’03workshop on Learning from Imbalanced Data Sets.Washington, DC.2003
    [42]缪志敏，基于单分类器的数据不平衡问题研究，[博士学位论文]，南京：中国人民解放军理工大学指挥自动化学院，2008
    [43] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study.Intelligent Data Analysis,20026(5):429-450
    [44] Jo T, Japkowicz N. Class imbalances versus small disjuncts. SIGKDD ExplorationsNewsletter,2004,6(1):40-249
    [45] Prati R C, Batista G E A P A, Monard M C. Learning with class skews and small disjuncts.Proc of the17th Brazilian Symposium on Artificial Intelligence. Sao Luis.2004:296-306.
    [46]谷琼，面向非均衡数据集的机器学习及其在地学数据处理中的应用，[博士学位论文]，武汉：中国地质大学，2009
    [47] N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial：special issue on learning fromimbalanced data sets. ACM SIGKDD Explorations,2004:6(1):1–6.
    [48] Japkowicz, N. The Class Imbalance Problem: Significance and Strategies. Proc of ICAI2000,LasVegas, NV, USA:2000:111–117
    [49] Kubat, M., Matwin, S. Addressing the Curse of Imbalanced Training Sets: One-SidedSelection. Proc of ICML1997, Morgan Kaufmann, Nashville.1997:179–186
    [50] Lewis, D., Catlett, J. Uncertainty Sampling for Supervised Learning. Proc of ICML1994,Morgan Kaufmann, New Brunswick:1994:148–156.
    [51] N. V. Chawla, K. W. Bowyer. L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic MinorityOver-sampling Technique. Journal of Artificial Intelligence Research,2002(16):321-357.
    [52] G. M. Weiss and F. Provost. The effect of class distribution on classifier learning: Anempirical study[Tech Report]. Computer Science Department, Rutgers University.2001.
    [53] Domingos, P. Metacost: A General Method for Making Classifiers Cost-sensitive. Proc ofACM SIGKDD1999, San Diego.1999:155–164.
    [54] Fan, W., Salvatore, S., Zhang, J., Chan, P. AdaCost: misclassification cost-sensitive boosting.Proc of ICML1999, Bled, Slovenia.1999:97–105
    [55] Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C. Reducing MisclassificationCosts. Proc of ICML1994, Morgan Kaufmann, San Francisco.1994:217–225.
    [56] Han, H., Wang, W., Mao, B. Borderline-SMOTE: A New Over-Sampling Method inImbalanced Data Sets Learning. Proc of ICIC2005, LNCS3644, Springer, Heidelberg.2005:878–887.
    [57] Chumphol Bunkhumpornpat, Krung Sinapiromsaran and Chidchanok Lursinsap.Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handlingthe Class Imbalanced Problem. Proc of PAKDD2009, Springer Berlin Heidelberg.2009:475-482
    [58] Jianping Zhang, Inderjeet Mani. kNN Approach to Unbalanced Data Distributions: A CaseStudy involving Information Extraction. Workshop on Learning from Imbalanced Datasets II,ICML, Washington DC,2003
    [59] B Raskutti, A Kowalczyk. Extreme Re-balancing for SVMs: a case study. Proceedings ofEuropean Conference on Machine Learning, Pisa, Italy,2004:60–69.
    [60] Piotr Juszczak, Robert P.W. Duin. Uncertainty sampling methods for one-class classifiers.Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC,2003
    [61] Ricardo Barandela, Rosa M. Valdovinos, J. Salvador Sánchez et al. The imbalanced trainingsample problem: Under or over sampling?. In Joint IAPR International Workshops onStructural, Syntactic, and Statistical Pattern Recognition, Lecture Notes in Computer Science3138,2004:806-814.
    [62] Hart, PE. The Condensed Nearest Neighbor Rule. IEEE Transactions on InformationTheory.1968(14):515-516
    [63] Laurikkala, Jorma. Improving Identification of Difficult Small Classes by Balancing ClassDistribution[Tech Report]. Department of Computer and Information Science, University ofTampere, Finland.2001.
    [64] Tomek, I. Two Modifications of CNN.IEEE Transactions on Systems Man andCommunications.1976,6(6):769-772
    [65] Dietterich TG. Machine learning research: Four current directions. AI Magazine,1997,18(4):97-136.
    [66] Freund Y., Schapire R E. A Decision-Theoretic Generalization of Online Learning and anApplication to Boosting. Journal of Computer and System Sciences,1997:55(1):119-139
    [67] Breiman, L, Bagging predictors. Machine Learning,2002,26(2):123–140
    [68] Schapire R E. The Strength of Weak Learnability. Machine Learning,1990(5):197-227
    [69] Breiman, L.(2001). Random forest. Machine Learning,45:5–32.
    [70] Chao Chen, Andy Liaw, Leo Breiman. Using Random Forest to Learn Imbalanced Data.
    [Technical Report]. Statistics Department, University of California at Berkeley,2004
    [71] Jingyang Li, Maosong Sun, Xian Zhang. A Comparison and Semi-Quantitative Analysis ofWords and Character-Bigrams as Features in Chinese Text Categorization. Proc ofCOLING-ACL06, Sydney, Australia.2006:545–552
    [72]潘湑，顾宏斌，孙婵娟.使用分类方法的航空领域术语定义识别. Proc of CCPR2009,Nanjing China:2009:663-669
    [73]ladenic D, Grobelnik M. Feature selection for unbalanced class distribution and Naive Bayes.Proc of ICML99. San Francisco: Morgan Kaufmann,1999:258-267
    [74]周茜,赵明生,扈旻.中文文本分类中的特征选择研究.中文信息学报,2004,18(3):17-23
    [75] Liu H,Yu L.Toward Integrating Feature Selection Algorithms for Classification andClustering.IEEE Transactions on Knowledge and Data Engineering,2005,17(4):491~502.
    [76] Mao Y,Zhou X.Multi-class Cancer Classification by Using Fuzzy Support Vector Machineand Binary Decision Tree with Gene Selection. Journal of Biomedicine and Biotechnology.2005:160~171.
    [77] Koller D,Sahami M.Toward Optimal Feature Selection. Proceedings of the13th InternationalConference on Machine Learning.San Francisco, USA:Morgan Kaufmann,1996:284~292
    [78] Schapire R E, Freund Y, Bartlett Y Boosting the Magin: A New Explanation for theEffectiveness of Voting Methods. Annals of Statistics,1998,26(5):1651-1686
    [79] Drucker H, Schapire R. Improving Performance in Neural Networks Using a BoostingAlgorithm. Advances in Neural Information Processing Systems5, Denver, CO, MorganKaufinann, San Mateo, CA,1993:42-49.
    [80] Mitra, P., Murthy, C. A., Pal, S. K. Unsupervised feature selection using feature similarity.IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(3):301-312.
    [81] Guyon I, Elisseeff A. An Introduction to Variable and Feature Selection. Journal of MachineLearning Research,2003,3:1157-1182.
    [82] Jain A, Zongker D. Feature selection: evaluation, application and small sample performance.IEEE Transactions on Pattern Analysis and Machine Intellifence. Vol.19, No.21997:153-158
    [83] Fan, W., Miller, M., Stolfo, S., Lee, W., Chan, P. Using Artificial Anomalies to DetectUnknown and Known Network Intrusions. Proc. ICDM2001, San Jose, CA, USA,2001:123–130
    [84] Kubat, M., Holte, R., Matwin, S. Machine Learning for the Detection of Oil Spills in SatelliteRadar Images. Machine Learning.1998Vol.30,2-3,195–215
    [85] Forman G. A pitfall and Solution in Multi-Class Feature Selection for Text Classification.proc. of the21st International Conference on Machine Learning. SanFrancisco, USA:MorganKaufmann,2004:38-50
    [86] Zheng Z H, Wu H W. An Effective Gene Selection method Based on Relevance Analysis andDiscernibility Matrix. PAKDD2007.LNAI4426.2007:1088-1095100
    [87]Forman G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification.The Journal of Machine Learning Research,2003(8):1289-1305
    [88]靖红芳，王斌，杨雅辉等.基于类别分布的特征选择框架.计算机研究与发展2009,46(9)：1586-1593
    [89]徐燕，李锦涛，王斌等.基于区分类别能力的高性能特征选择方法.软件学报.2008Vol.19, No.1:82-89
    [90] Guyon L, Gunn S., Nikravesh M., L. A. Zadeh. Feature Extraction, Foundations andApplications. Springer,2006
    [91] Liu H., Dougherty E. R., Dy J. G, Torkkola K., Tuv E., Peng H., Ding C., Long F., Berens M.,Parsons L., Zhao Z., Yu L., Forman G Evolving feature selection. IEEE Intelligent Systems,200520(6):64-76
    [92] Schutze H, Hull D A, Pedersen J O. A Comparison of Classifiers and DocumentRepresentations for the Routing Problem. Proc. of the18th ACM International Conference onResearch and Development in Information Retrieval. New York, NY, USA: ACM Press,1995:29-237
    [93] Jolliffe I. T. Principal Component Analysis. Springer,2002
    [94] Landauer T. K., Foltz P. W., Laham D. An Introduction to Latent Semantic Analysis.Discourse Processes,1998(25):259-284.
    [95] Hyvarinen A., Oja E. Independent component analysis: Algorithms and applications. NeuralNetworks,2000(13):411-430
    [96] Scholkopf B. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. NeuralComputation,1998(10):1299-1319
    [97] Tenenbaum J. B., Silva V, Langford J. C. A Global Geometric Framework for NonlinearDimensionality Reduction. Science,2000290(12):2319-2323
    [98] Roweis S. T., Saul L. K. Nonlinear Dimensionality Reduction by Locally Linear Embedding.Science,2000290(5500):2323-2326.
    [99] Sun Z. H., George B., Miller R. Object detection using feature subset selection. PatternRecognition,2004(37):2165-2176.
    [100] Kohavi R, John G H. Wrappers for Feature Subset Selection. Artificial Intelligence,1997,97(12):273-324.
    [101]吴迪,张亚平,殷福亮等.基于类别分布差异和VPRS特征选择的文本分类方法.电子与信息学报,2007,29(12):2880-2884
    [102]刘桃,刘秉权,徐志明等.领域术语自动抽取及其在文本分类中的应用.电子学报,2007,35(2):328-332
    [103] Li S, Zong C. A new approach to feature selection for text categorization. Proc of IEEENLP2KE.2005:626-630
    [104] How B C, Narayanan K. An empirical study of feature selection for text categorizationbased on term weightage. Proc of IEEE PWICPACM WI. Washington: IEEE,2004:599-602
    [105]崔自峰，徐宝文，张卫丰等.一种近似Markov Blanket最优特征选择算法，计算机学报，2007，Vol.30，No.12：2074-2081
    [106]潘湑，顾宏斌，赵芷晴.基于实例距离分布的过采样方法的定义抽取研究. Proceeding ofCCPR2010, ChongQing China.2009:168-173
    [107] Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and NaiveBayes. Proc of ICML99. San Francisco: Morgan Kaufmann,1999:258-267
    [108] Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data.Proc of ACM SIGKDD Explorations Newsletter. New York: ACM,2004:80-89
    [109] Z.Zheng, R.Srihari, and S. Srihari. A feature selection framework for text filtering.Proceedings of the Third IEEE International Conference on Data Mining,19-22Nov.2003:705-708.
    [110] S Li, R Xia, C Zong,et al. A Framework of Feature Selection Methods for TextCategorization. Proceedings of the Joint Conference of the47th Annual Meeting of the ACLand the4th International Joint Conference on Natural Language Processing of the AFNLP.Suntec, Singapore.2009:692-700
    [111]林智勇,郝志峰,杨晓伟.不平衡数据分类的研究现状.计算机应用研究.2008. Vol.25,No.2：332-336
    [112] Nanda Kambhatla. Combining lexical, syntactic, and semantic features with maximumentropy models for extracting relations. Proceedings of the ACL2004on Interactive posterand demonstration sessions. Stroudsburg, PA, USA.2004:22-es
    [113] S Miller, H Fox, L Ramshaw, et al. A novel use of statistical parsing to extract informationfrom text. Proceedings of the1st North American chapter of the Association forComputational Linguistics conference. Morgan Kaufmann Publishers Inc. San Francisco, CA,USA.2000:
    [114] T. Wang, Y. Li, K. Bontcheva, H. Cunningham, and J. Wang. Automatic Extraction ofHierarchical Relations from Text. In The Semantic Web: Research and Applications.3rdEuropean Semantic Web Conference, ESWC2006, number4011in Lecture Notes inComputer Science,2006:215-229
    [115] Min Zhang, Jie Zhang, Jian Su, Guodong Zhou. A composite kernel to extract relationsbetween entities with both flat and structured features. Proceedings of the21st International102Conference on Computational Linguistics and the44th annual meeting of the Association forComputational Linguistics. Sydney, Australia.2006:825-832
    [116] Shubin Zhao, Ralph Grishman, Extracting relations with integrated information usingkernel methods, Proceedings of the43rd Annual Meeting on Association for ComputationalLinguistics, Ann Arbor, Michigan,2005:419-426
    [117] Zhu Zhang, Weakly-supervised relation classification for information extraction,Proceedings of the thirteenth ACM international conference on Information and knowledgemanagement, Washington, D.C., USA,2004:
    [118] ZHOU GuoDong, SU Jian, ZHANG Jie, et al. Exploring various knowledge in relationextraction. Proceedings of the43rd Annual Meeting of the ACL, Ann Arbor, June2005:427–434
    [119]奚斌，钱龙华，周国栋等.语言学组合特征在语义关系抽取中的应用.中文信息学报.2008Vol.22,No.3:44-49
    [120]黄鑫,朱巧明,钱龙华等.基于特征组合的实体关系抽取.微电子学与计算机.2010.Vol.27, No.4:198-204
    [121]钱龙华.命名实体间语义关系抽取研究.[博士学位论文].苏州：苏州大学.2009
    [122]胡燕.基于Web信息抽取的专业知识获取方法研究.[博士学位论文].武汉：武汉理工大学.2007
    [123]张海雷，曹菲菲，陈文亮等.基于多层次特征集成的中文实体指代识别.中文信息学报.2007，Vol.21，No.5:126-130
    [124]刘斌，黄铁军，程军等.一种新的基于统计的自动文本分类方法.中文信息学报.2002，Vol.16，No.6：18-24
    [125] Paul Deane, Kathleen Sheehan. Automatic Item Generation via Frame Semantics: NaturalLanguage Generation of Math Word Problems.2003. Education Testing Service:http://www.ets.org/research/dload/ncme03-deane.pdf.
    [126] I.Dennis, S. Handley, P. Bradon, et al. Approaches to modeling item generative tests. ItemGeneration for Test Development,2002:53-72
    [127] Collin F. Baker, Charles J. Fillmore, John B. Lowe. The Berkeley FrameNet Project.Proceedings of the17th international conference on Computational linguistics, August10-14,1998, Montreal, Quebec, Canada
    [128] Charles J. Fillmore, Christopher Johnson. The FrameNet tagset for frame-semantic andsyntactic coding of predicate-argument structure. Proceedings of the first conference on NorthAmerican chapter of the Association for Computational Linguistics. Seattle,Washington.2000:56-62
    [129] CL Liu, CH Wang, ZM Gao, SM Huang, Applications of Lexical Information forAlgorithmically Composing Multiple-Choice Cloze Items, in: Proceeding of the2ndWorkshop on Building Educational Applications Using NLP,2005:1-8
    [130] E Sumita, F Sugaya, S Yamamoto, Measuring non-native speaker’s proficiency of Englishbyusing a test with automatically generated Fill-in-Blank Questions. Proceedings of the2ndWorkshop on Building Educational Applications Using NLP,2005.6:61-68
    [131] Ruslan Mitkov, Le An Ha, Computer-aided generation of multiple-choice tests. Proceedingsof the HLT-NAACL03workshop on Building educational applications using naturallanguageprocessing,2003:17-22
    [132] I. Aldabe, M. Lopez de Lacalle, M. Maritxalar, E.Martinez, L. Uria, ArikIturri: AnAutomaticQuestion Generator Based on Corpora and NLP Techniques, Proceedings of theEightInternational Conference on Intelligent Tutoring Systems (ITS’06),2006.6:584-594
    [133] WordNet a lexical database for the English language, http://wordnet.princeton.edu/
    [134] Gruber, T. R.,"Toward Principles for the Design of Ontologies Used for KnowledgeSharing". In: International Journal Human-Computer Studies,199543(5-6):907-928
    [135] Natalya F.Noy and Deborah L.McGuinness, Ontology Development101: A Guide toCreating Your First OntoLogy, Stanford University, Stanford,2004
    [136]王厚峰，何婷婷.汉语中人称代词消解的研究.计算机学报.2001，24（2）：136-143
    [137]周俊生，黄书剑，陈家骏，曲维光.一种基于图划分的无监督汉语指代消解算法.中文信息学报.2007，2(2):77-82
    [138]杨勇，黄艳翠，周国栋，朱巧明.指代消解中距离特征的研究.中文信息学报.200822(5):39-44
    [139] JC Brown, GA Frishkoff, M Eskenazi. Automatic question generation for vocabularyassessment. Proc of HLT2005:819-826

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700