面向英语学习的文本难度判定
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
英文文本难度判定是应用语言学和信息处理领域的重要课题,正广泛应用于教学、出版和搜索引擎等领域。现在的网络资源非常丰富,如何高效准确地为不同水平的英语学习者提供适合自身水平的阅读材料,是文本难度判定面临的最大挑战。
     本文首先介绍了一种在国际上广泛使用的文本难度判定方法:基于易读性公式判定文本的难度。通常易读性公式使用文本的词汇难度和句法难度来判定文本的难度,词汇难度以词频和词长来衡量,句子的难度以句子的长度来衡量。目前易读性公式已有上百个,本文选择了三个典型的易读性公式傅莱区易读性公式(Flesch Reading Ease)、迷惑指数(Gunning Fog Index)和自动易读性指数(Automated Readability Index)在一定数量的文本上进行了验证。尽管通过易读性公式进行文本难度判定比较容易施行,但是计算值过于集中,无法进行等级划分。
     本文试图建立一种有广泛应用价值的模型判定文本的难度,向量空间模型是一种典型的文本表示方法,它不考虑词汇之间的顺序,把文本表示为向量空间中的一个向量,文本的相似度可以通过内积或者夹角余弦值来计算,实现起来比较方便。本文基于向量空间模型进行文本难度判定,把文本难度判定问题当成是一个分类问题来解决。这种方法有很多的优点,其中之一就是它的结果不是二元值,而是它的整个训练集上的概率值,第二就是提供额外的信息。本文对几种常用的特征选择方法如文档频率、信息增益、互信息、X 2统计量、期望交叉熵、文本证据权、几率比等进行了分析,并进行了实验验证,结果表明几率比效果最好,互信息效果最差。分析了TF-IDF权重算法的不足,考虑结合TF-IDF与类间、类内分布信息的改进了权重算法,实验结果表明改进的权重算法提高了分类的F1值。
     最后主要考察了Rocchio’s算法,K-近邻法、朴素贝叶斯法这三种分类算法,通过实验检测了这三种算法的性能,结果表明多项式贝叶斯方法的分类F1值最高,达到了80%以上。
English text difficulty measurement is an important conception in applied linguistics and information processing. It is used in teaching, publishing, search engines and other fields widely. Because there are very rich reading materials in network, how to efficiently find different level of reading materials is a challenge to the text difficulty measurement.
     This paper introduces an international widely used method witch based on readability formula to measure text difficulty. Usually, the widely used readability formula only have two varies, word length/word frequent and the average sentence length. In this paper, we chose three formulas: Flesch Reading Ease, Gunning Fog Index, Automated Readability Index and we tested them on different levels data, but the results of this method are very poor, so we can’t measure text’s difficulty using it.
     Therefore, we focus on building a broadly applicable model of text to measure text difficulty. Vector space model is a typical example of the text expressing witch does not consider the terms’order and expresses a text as vector space of a vector. The text will gain a value through calculation the similarity to the samples by cosine of angle, so it was easier to achieve. This paper bases on the vector space model to measure the text difficulty, solves text difficulty measurement as a question of classification problems. This method has a lot of advantages, one of these is that it is not the result of the dual value but the probability of the entire training set. The second is to provide additional information, such as the terms of distribution. In Feature selection, this paper analyzes several commonly used methods of feature selection such as document frequency, information gain, mutual information, statistics CHI. expect cross entropy, the weight of evidence for text, the odds ratio. The results show the odds ratio is the best method than others, the worst is mutual information. This paper discusses the traditional algorithm of term weighting: TF-IDF, the introduction of among class and inside class factor in term weighting is presented. Experimental results show that the improved algorithms outperformed the traditional methods in F1.
     At last, this paper inspected three classification algorithms: Rocchio's algorithm, K-Nearest-Neighbor and Naive Bayes. Experimental results of these algorithms indicate polynomial Bayesian method classification F1 is the highest value, reached more than 80%.
引文
1. E. Fry. Readability versus Leveling: Both of these Procedures Can Help Teachers Select Books for Readers at Different Stages. The Reading Teacher. 2002, (56):96~103
    2. W. H. Dubay. The Principles of Readability. Impact Information. 2004:23~32
    3.卢婉红.阅读理解测试的变量研究.广州大学学报(社会科学版). 2002, 1(9):45~51
    4.刘士勤.关于命制阅读理解试题的几点思考.考试研究文集(1).经济科学出版. 2002:94~105
    5. F. Ronnie, K. Betty and C. Soule. Privacy Policies: Cloze Test Reveals Readability Concerns. Issues in Information Systems. 2004, 5(1):116~123
    6. O. Thomas and B. Holly. Language, Reading, and Readability Formulas: Implication for Developing and Adapting Test. International Journal of Testing. 2004, 4(3):239~252
    7. A. Scott and A. Gregory. The Reading Grade Level of Common Measures in Child and Adolescent Clinical Psychology. Psychological Assessment. 2006, 18(3):346
    8.吕中舌.可读性理论与英语教材.世界知识出版社. 2003:35~92
    9. T. G. Gunning. The Role of Readability in Today’s Classroom. Topics in Language Disorders. 2003, (23):175~89
    10. J. Redish. Drafting Documents in Plain Language. New York: Practicing Law Institute. 1979:88~176
    11. A. Bailin and A. Grafstein. The Linguistic Assumptions Underlying Readability Formula. A Critique Language and Communication. 2001, (21):285~301
    12.林铮.英文易读性的测定.外语教学与研究. 1995, (4):38~42
    13.钱毓芳,顾群超.大学英语易读性的调查.浙江师大学报(社会科学版). 1999, (3):115~118
    14.邓昭春,段方,张萍.大学英语教材难度比较研究.中国大学教学. 2002, (7-8):57~59
    15.辜向东,关晓仙. CET阅读测试和大学英语阅读材料易读度抽样研究.西安外语学院学报. 2003, (3):39~41
    16.余美根.论可读性程式设计的不充分性.国外外语教学. 2005, (3):7~10
    17. J. Palmer, R. Williams and H. Dreher. Automated Essay Grading System Applied to a First Year University Subject-How Can We Do It Better. Proceedings of the Information Science and IT Education(InSITE) Conference, Cork, Ireland. 2002:1221~1229
    18. E. Fry. Fry’s Readability Graph: Clarifications, Validity, and Extension to Level 17. Journal of Reading. 1977, (38):243-50.
    19. G.. R. Klare. Readability Handbook of Reading Research. New York Longman. 1984: 89~175
    20. J. Redish. Readability Formulas Have even More Limitations than Klare Discusses. ACM Journal of Computer Documentation. 2000, 24(3):132~140
    21. E. Dale and J. S. Chall. A Formula for Predicting Readability. Educational Research Bulletin.1948, (27):11~20
    22. J. S. Chall and E. Dale. Readability Revisited: The New Dale-Chall Readability Formula. Cambridge, MA: Brookline Books. 1995:89~96
    23.晏生宏.英文易读度测量程序开发探索,重庆大学学报(社会科学版). 2005, 11(2):92~97
    24.庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.计算机应用研究. 2001, 18(9):23~26
    25. K. Thompson and J. Callan. Predicting Reading Difficulty with Statistical Language Models. Journal of the American Society for Information Science and Technology. 2005, 56(13):1448~1462
    26. J. L. Callan. A Statistical Model for Scientific Readability. Proceedings of the 10th International Conference on Information and Knowledge Management. 2001:574~576
    27.陈治刚,何丕廉,孙越恒.基于向量空间模型的文本分类方法的研究与实现.计算机应用. 2004, (24):277~280
    28. F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys. 2002, 34(1):1-47
    29. L. J. Rog and B. Wilfred. Matching Texts and Readers: Leveling and ReadingMaterials for Assessment and Instruction. Reading Teacher. 2002, (5):52~55
    30. N. Nesselhauf. The Use of Cellocations by Advanced Learners of English and Some Implications for Teaching. Journal of Applied Linguistics.2003, 24(2):223~242
    31. D. D. Qian. Investigating the Relationshop between Vocabulary Knowledge and Academic Reading Performance: an Assessment Perspective. Language Learning. 2002, (52):513~36
    32. L. M. Rudner and T. Lang. Automated Essay Scoring Using Bayes’Theorem. The Journal of Technology, Learning and Assessment. 2002, 1(2):3~21
    33.陈涛,谢阳群.文本分类中的特征降维方法综述.情报学报. 2005, 24(6):690~ 695
    34.秦进,陈芙蓉.文本分类中的特征提取.计算机应用. 2003, 23(2):45~46
    35.王秀娟,郭军,郑康锋.文本分类中一种新的特征选择方法.计算机应用. 2005, 25(3):661~663
    36.陆玉昌,鲁明羽,李凡.向量空间法中单词权重函数的分析与构造.计算机研究与发展. 2002, 39(10):1205~1210
    37. T. Liu, S. P. Liu and Z. Chen. An Evaluation on Feature Selection for Text Clustering. Proceedings of the 20th International Conference on Machine Learning. 2003
    38.鲁宋,李晓黎,白硕.文档中词语权重计算方法的改进.中文信息学报. 2000, 14(6) :62~13
    39.程军.基于统计的文本分类技术研究.中国科学院研究生院博士论文. 2003:54~68
    40. V. lertnattee and T. Theeramunkong. Improving Centroid-Based Text Classification Using Term-Distribution-Based Weighting and Feature Selection. The 2nd International Conference on Intelligent Technologies. 2001:349-355
    41. H. Ragas and C. H. Koster. Four Text Classification Algorithms Compared on a Dutch Corpus. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval .1998:369~370
    42. E. H. Han and G. Karypis. Centroid-Based Document Classification: Analysis Experimental Results. In European Conference on Principles of Data Mining and Knowledge Discovery(PKDD). 2000:424~431
    43. P. Soucy and G. Mineau . A Simple KNN Algorithm for Text Categorization.Data Mining. ICDM 2001, Proceedings IEEE International Conference. 2001:647~648
    44.陈振洲,李磊,姚正安.基于SVM的特征加权KNN算法.中山大学学报(自然科学版). 2005, 44(1):17~20
    45.钱晓东,王正欧.基于改进的KNN的文本分类方法.情报科学. 2005, 23(4):550~554
    46.罗海飞,吴刚,杨金生.基于贝叶斯的文本分类方法.计算机工程与设计. 2006, 27(24):4744~4744
    47.胡于进,周小玲.基于向量空间模型的贝叶斯文本分类方法.计算机与数字工程. 2004, 32(6):28~32
    48.苏金树,张博锋,徐昕.基于机器学习的文本分类研究进展.软件学报. 2006, 17(9):1848~1859
    49. Y. M. Yang and J. P. Pedersen. A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning. 1997:412~420

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700