英语专业四级考试等值化的优化设计
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
大规模语言考试经常面临的一个难题,就是如何保证不同试卷的考试分数具有可比性,不会因其难度、信度和分数分布等方面的差别而导致使用不同试卷的考生受到不公平的对待。等值是测验公平性的保证,也是克服考试测量局限性、实现不同考卷分值可比性和互换性、确保测试结果一致性和稳定性、保证测试相关决策客观性以及体现对高危考生的公平性等客观因素的需要。然而,目前等值方法在我国许多大规模考试中尚未得到应有的推广,相关研究也仍然处于十分薄弱的状况。
     本论文首先介绍了当前国际测试学界主流的考试等值化理论、主要的考试等值化设计和重要的考试等值化方法,并简要介绍了在中国和世界其他国家大规模语言考试中实际使用的代表性等值实践,如托福、GRE、汉语水平考试、英语水平考试、大学英语四六级考试等大规模语言测试已采用多年的等值化设计。文中还重点介绍了建立在传统考试理论和项目反应理论基础之上的主要等值方法,包括平均值等值方法、线性等值方法、等百分位法、IRT等值方法等。
     论文着眼于考试等值化在英语专业四级考试设计中的有用性、可行性和适用性,通过一项试验研究试图找到英语专业四级考试值得借鉴的测试等值方法。通过设计一项小规模的类比测试,作者选取了一定数量的共同题作为两次试验考试的共同部分,使用非等组共同题设计对两批学生进行了两次不同的试验考试。由此得出的原始数据经过计算、分析和研究,分别在平均值等值方法、Tucker和Levine非等组观察分等值方法、等百分位等值方法和项目反应理论中的Rasch单参数等值方法基础上,构建了不同的等值数学模型,开展了等值化数据分析。由此获得的等值数据对于英语专业四级考试的等值设计很有启发和借鉴意义,也为我们找到符合英语专业四级考试实情的恰当等值方法奠定了基础、提供了启示。
     在这项试验研究的基础上,本论文通过数据分析和对比得出以下结论:为提高英语专业四级考试的客观性、一致性、可靠性和不同试卷分值的可比性,我们有必要做出努力,在英语专业四级考试中引入测试等值方法,实现不同试卷得分的可比性和可互换性,确保考试信度的稳定性和考试结果的一致性;试验研究证明,英语专业四级考试完全可以借助测试等值化方法实现更佳的考试信度和效度,也完全可以借鉴国际和国内大规模语言测试现有的等值方法;尽管适合英语专业四级考试的最佳等值方法仍需进一步的研究努力,将等值方法引入英语专业四级考试的必要性和可行性却是毋庸置疑的;试验结果表明,建立在传统测试理论基础上的Tucker观察分线性等值方法和等百分位方法在诸多方面的表现都不逊于建立在项目反应理论基础上的Rasch单参数等值方法,非等组共同题设计表现出了很高的可靠性,因而值得在英语专业四级考试的等值化设计中予以考虑;尽管如此,单靠传统测试理论本身还无法构成判断一项试题是否适合英语专业四级考试试卷的唯一基础,建议英语专业四级考试在等值设计中有效结合传统测试方法和项目反应理论的方法,从而扬长避短,趋利避害。
Large-scale language tests are constantly confronted with difficulties in guaranteeing the comparability or interchangeability of scores on different test forms. Test equating, the statistical process used to adjust scores on different test forms so that the scores derived from the two forms will be directly equivalent after conversion, is thus deemed necessary to overcome measurement limitations, to make different test forms interchangeable, to ensure test consistency and decision-making objectivity, as well as to be fair to high-stake examinees.
     This thesis starts with an introduction to prevalent test equating theories, typical equating designs and representative equating practices applied in large-scale language tests in China and other countries. In particular, special emphasis is laid upon the application of CTT (Classical Testing Theory) and IRT (Item Response Theory) in the major equating approaches, including mean equating, linear equating, the equipercentile method and the IRT equating method.
     With an eye on the usefulness, feasibility and applicability of test equating approaches in TEM Band 4 (Test for English Majors Band 4), the thesis sets out to conduct a tentative experiment with the recommendable test equating design for TEM Band 4 through an empirical research. Two groups of students in a relatively small sample population took part in two separate experimental tests with common items, and the scoring results were analyzed, computed and discussed in the different statistical models constructed on the basis of the mean equating approach, the Tucker and Levine Observed Score Methods in Non-equivalent Groups, the equipercentile equating method and the IRT Rasch Single-parameter Equating Approach. The equating results thus obtained are illuminating and shed light on the appropriate equating design that caters to the realities of the TEM Band 4 test.
     On the basis of the empirical study, the thesis concludes that, to improve the validity, interchangeability, objectivity and consistency of the TEM Band 4 test, efforts to make the TEM Band 4 test forms interchangeable are worthwhile and long due. Although the optimum approach still merits further empirical studies, TEM Band 4 Test can be equated well by borrowing from existing equating practices widely accepted by the test measurement community. The paper also recommends that, in designing the equating method for TEM Band 4 Test, common-item non-equivalent groups design should be contemplated due to its reliability, since experiment data reveal the CTT-based Tucker Observed Score Linear Equating Method and Equipercentile Method are both as effective as, if not better than, the Rasch single-parameter equating method in a number of aspects. The paper also contends that CTT alone cannot constitute the sole basis for judging a particular item as a suitable TEM Band 4 formal test items, and both CTT and IRT equating approaches should be contemplated in the equating design for the TEM Band 4 Test. It is therefore quite essential for TEM Band 4 Test to combine the two approaches effectively, minimizing the shortcomings of both approaches while maximizing their respective strong points.
引文
[1]Angoff,W.H.(1971).Scales,Norms,and Equivalent Scores.In R.L.Thomdike (Ed.),Educational Measurement(2~(nd)ed.):508-600.Washington,DC:American Council on Education.
    [2]Angoff,W.H.(1991).The Determination of Empirical Standard Errors of Equating the Scores on SAT-Verbal and SAT-Mathematical.Paper provided by Educational Testing Service,Princeton,N.J.
    [3]Bachman,L.F.& Palmer,A.S.(1999).Language Testing in Practice.上海:上海外语教育出版社.
    [4]Bachman,L.F.(1999).Fundamental Considerations in Language Testing.上海:上海外语教育出版社.
    [5]Bachman,L.F.(1990).Fundamental Considerations in Language Testing.Oxford:OUP
    [6]Baghi & Heibatollah.(1995).A Comparison of the Results from Two Equating Designs for Performance-Based Student Assessments.Paper presented at the NCME annual conference held in San Francisco in April,1995.
    [7]Baker,F.B.& AI-Karni,Ali.(1991).A Comparison of Two Procedures for Computing IRT Equating Coefficients.Journal of Educational Measurement 28(2):147-162.
    [8]Braun,H.I.& Holland,P.W.(1982).Observed-Score Test Equating:A Mathematical Analysis of Some ETS Equating Procedures.In P.W.Holland & D.B.Rubin(Eds.),Test Equating:9-50.New York:Academic Press.
    [9]Budescu,D.V.(1987).Selecting an Equating Method:Linear or Equipercentile?.Journal of Educational Statistics 12(1):33-43.
    [10]Camilli,Gregory,Wang,Ming-mei & Fesq,Jacqueline.(1995).The Effects of Dimensionality on Equating the Law School Admission Test.Journal of Educational Measurement 32(1):79-96.
    [11]Cohen,A.D.(2005).Assessing Language Ability in the Classroom.北京:外语教学与研究出版社.
    [12]Crocker,L.& Algina,J.(1986).Introduction to Classical and Modern Test Theory.New York:Holt,Rinehart and Winston.
    [13]Davies,A.,Brown,A.,Elder,C.,Hill,K.,Lumley,T.& McNamara,T.2002.Dictionary of Language Testing.北京:外语教学与研究出版社.
    [14]De Champlain,Andre F.(1996).The Effect of Multidimensionality on IRT True-Score Equating for Subgroups of Examinees.Journal of Educational Measurement 33(2):181-201.
    [15]Divgi,D.R.(1986).Does the Rasch Model Really Work for Multiple Choice Items? Not If You Look Closely.Journal of Educational Measurement 23(4):283-298.
    [16]Domaleski,C.S.(2006).Exploring the Efficacy of Pre-Equating-A Large Scale Criterion-Referenced Assessment with Respect to Measurement Equivalence.[Dissertation]Deposited in the Georgia State University Library,Georgia.
    [17]Downing,S.M.(2003).Itern Response Theory:Applications of Modern Test Theory.Medical Education 37(5):739-745.
    [18]Gao,Hua.(2004).The Effect of Different Anchor Tests on the Accuracy of Test Equating for Test Adaptation.[Dissertation].Deposited in the College of Education,Ohio University.
    [19]Gulliksen,H.(1950).Theory of Metal Tests.New York:Wiley.
    [20]Gustafsson,Jan-Eric.(1979).The Rasch Model in Vertical Equating of Tests:A Critique of Slinde and Lirm.Journal of Educational Measurement 16(3):153-158.
    [21]Hambleton,R.K.& Jones,R.W.(1993).Comparison of Classical Test Theory and Item Response Theory and Their Applications to Test Development.Educational Measurement:Issues and Practice:38-47.
    [22]Hambleton R.K.,Swaminathan H.& Rogers H.J.(1991).Fundamentals of Item Response Theory.Newbury Park,California:Sage Publications.
    [23]Hanick,P.L.& Huang,C.Y.(2002).Effects of Decreasing the Number of Common Items in Equating Link Item Sets.Paper presented at the Annual Meeting of the American Educational Research Association(New Orleans,LA,April 1-5,2002).
    [24]Hanson,B.A.(1991).A Note on Levine's Formula for Equating Unequally Reliable Tests Using Data from the Common Item Nonequivalent Groups Design.Journal of Educational Statistics 16(2):93-100.
    [25]Harnisch,D.L.(1983).Item Response Patterns:Applications for Educational Practice.Journal of Educational Measurement 20(2):191-206.
    [26]Harnisch,D.L.(1981).Analysis of Item Response Patterns:Questionable Test Data and Dissimilar Curriculum Practices.Journal of Educational Measurement 18(3):133-146.
    [27]Heaton,J.B.(2000).Writing English Language Tests(New edition).北京:外语 教学与研究出版社.
    [28]Holland,P.W.& Thayer,D.T.(1985).Section Pre-Equating in the Presence of Practice Effects.Journal of Educational Statistics 10(2):109-120.
    [29]Holmes,S.E.(1982).Unidimensionality and Vertical Equating with the Rasch Model.Journal of Educational Measurement 19(2):139-141.
    [30]Huynh,Huynh & Ferrara,Steven.(1994).A Comparison of Equal Percentile and Partial Credit Equatings for Performance-Based Assessments Composed of Free-Response Items.Journal of Educational Measurement 31(2):125-141.
    [31]Jaeger,R.M.(1981).Some Exploratory Indices for Selection of A Test Equating Method.Journal of Educational Measurement 18(1):23-38.
    [32]Jesus,Tanguma.(2000).Equating Test Scores Using the Linear Method:A Primer.Paper presented at the Annual Meeting of the Southwest Educational Research Association(Dallas,TX,January 27-29,2000).
    [33]Kamata,Akihito & Tate,Richard.(2005).The Performance of a Method for the Long-Term Equating of Mixed-Format Assessment.Journal of Educational Measurement 42(2):193-213.
    [34]Kim,Seonghoon.(2006).A Comparative Study of IRT Fixed Parameter Calibration Methods.Journal of Educational Measurement 43(4):355-381.
    [35]Kim,Seonghoon &Lee,Won-Chan.(2006).An Extension of Four IRT Linking Methods for Mixed-Format Tests.Journal of Educational Measurement 43(1):53-76.
    [36]Klein,L.W.& Jarjoura,D.(1985).The Importance of Content Representation for Common-Item Equating with Nonrandom Groups.Journal of Educational Measurement 22(3):197-206.
    [37]Kolen,M.J.(1984).Effectiveness of Analytic Smoothing in Equipercentile Equating.Journal of Educational Statistics 9(1):25-44.
    [38]Kolen,M.J.(2004).Population Invariance in Equating and Linking:Concept and History.Journal of Educational Measurement 41(1):3-14.
    [39]Kolen,M.J.(2001).Linking Assessments Effectively:Purpose and Design.Educational Measurement:Issues and Practice Springer:5-9.
    [40]Kolen,M.J.& Brennan,Robert L.(1995).Test Equating:Methods and Practices.New York:Springer-Verlag Inc.
    [41]Kolen,M.J.& Harris,D.J.(1990).Comparison of Item Preequating and Random Groups Equating Using IRT and Equipercentile Methods.Journal of Educational Measurement 27(1):27-39.
    [42] Kolen, M. J. & Whiteny, D. R. (1982). Comparison of Four Procedures for Equating the Tests of General Educational Development. Journal of Educational Measurement 19 (4): 279-293.
    [43] Kolen, M. J. (1981). Comparison of Traditional and Item Response Theory Methods for Equating Tests. Journal of Educational Measurement 18 (1): 1-11.
    [44] Kolen, M. J. & Brennan, Robert L. (1987). Linear Equating Models for the Common-Item Nonequivalent-Populations Design. Applied Psychological Measurement 11(3): 263-277.
    [45] Levinston, S. A. (1996). Book Reviews: Test Equating. Journal of Educational Measurement 33 (3): 35-48.
    [46] Livingston, S. A. (1993). Small-Sample Equating with Log-Linear Smoothing. Journal of Educational Measurement 30 (1): 23-39.
    [47] Linda, T. K., Way, W. D. & Carey, P. A. (1993). The Effect of Small Calibration Sample Sizes on TOFEL IRT-Based Equating. Reports provided by Educational Testing Service, Princeton, New Jersey.
    [48] Linden, W. J. (2000). A Testing-Theoretic Approach to Observed-Score Equating. Psychometrika 65 (4): 437-456.
    [49] Lindsay, C. A. & Prichard, M. A. (1971). An Analytical Procedure for the Equipercentile Method of Equating Tests. Journal of Educational Measurement 8 (3): 203-207.
    [50] Little, J. A. & Rubin, D. B. (1994). Test Equating From Biased Samples, With Application to the Armed Services Vocational Aptitude Battery. Journal of Educational and Behavioral Statistics 19 (4): 309-335.
    [51] Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates, Publishers, Hillsdale, NY.
    [52] Lord, F. M. (1982). The Standard Error of Equipercentile Equating. Journal of Educational Statistics 7 (3): 165-174.
    [53] Loyd, B. H. & Hoover, H. D. (1980). Vertical Equating Using the Rasch Model. Journal of Educational Measurement 17 (3): 179-193.
    [54] Lunz, Mary E. & Bergstrom, B. A. (1995). Equating Computerized Adaptive Certification Examinations: The Board of Registry Series of Studies. Paper presented at the Annual Meeting of the National Council on Measurement in Education (San Francisco, CA, April 19-21, 1995).
    [55] Ma'ssel, L. C, Allen D., Wilson, M. & Williams, G (2006). Introducing equating methodologies to compare test scores from two different self-regulation scales. Test Equating 21 (Supplement 1): i110—i120.
    [56] Marco, GL. (1977). Item characteristic curve score equating methods that cater to populations differing in ability. Journal of Educational Measurement 14 (3): 139-160.
    
    [57] Mckinley, R. L. & Schaeffer, G A. (1989). Reducing Test Form Overlay of the GRE Subject Test in Mathematics Using IRT Triple-Part Equating. GRE Board Professional Report presented by Educational Testing Service, Princeton, N. J.
    [58] Michaelides, M. P. (2003). Sensitivity of IRT Equating to the Behavior of Test Equating Items. Paper presented at the Annual Meeting of the American Educational Research Association (Chicago, IL, April 21-25, 2003).
    [59] Mislevy, R. J., Sheehan, K. M. & Wingersky, M. (1993). How to Equate Tests with Little or No Data. Journal of Educational Measurement 30 (1):55-78.
    [60] Morrison, C. A. & Fitzpatrick, S. J. (1992). Direct and Indirect Equating: A Comparison of Four Methods Using the Rasch Model. Paper deposited in the Measurement and Evaluation Center, the University of Texas at Austin.
    [61] Motika, R. T. & Chason, W. M. (1995). Performance of Angoff Model IV Linear Test Equating Using Total Test and Content Dimensional Sub-Test Designs in Small Groups of Examinees. Paper presented at the Annual Meeting of the American Educational Research Association (San Francisco, CA, April 18-22, 1995).
    [62] Ogasawara, Haruhiko. (2003). Aymptotic Standard Errors of IRT Observed-Score Equating Methods. Psychometrika 68 (2): 193-211.
    [63] Oslfima, T. C. (1994). The Effect of Speededness on Parameter Estimation in Item Response Theory. Journal of Educational Measurement 31 (3): 200-219.
    [64] Parshall, C. G, Du Bose Houghton, Pansy & Kromrey, J. D. (1995). Equating Error and Statistical Bias in Small Sample Linear Equating. Journal of Educational Measurement 32 (1): 37-54.
    [65] Petersen, N.S., Marco, GL., & Stewart, E.E. (1982). A Test of the Adequacy of Linear Scores Equating Models. In P.W. Holland & D.B. Rubin (Eds.). Test Equating:71-136. New York: Academic Press.
    [66] Phillips, S. E. (1986). The Effects of the Deletion of Misfitting Persons on Vertical Equating via the Rasch Model. Journal of Educational Measurement 23(2):107-118.
    [67]Rasch G.(1960).Probabilistic Models for Some Intelligence and Attainment Tests.Chicago:University of Chicago Press.
    [68]Ree,J.M.,Carrettab,T.R.& Earlesc,J.A.(2003).Salvaging Construct Equivalence through Equating.Personality and Individual Differences 35(2003):1293-1305.
    [69]Skaggs,G.& Lissistz,R.W.(1986).IRT Testing Equating:Relevant Issues and a Review of Recent Research.Review of Educational Research 56(4):495-529.
    [70]Skaggs,Gary.(2005).Accuracy of Random Groups Equating with Very Small Samples.Journal of Educational Measurement 42(4):309-330.
    [71]Slinde,J.A.& Linn,R.L.(1977).Vertically Equated Tests:Fact or Phantom?Journal of Educational Measurement 14(1):23-32.
    [72]Stansfield,C.M.,Kenyon,D.M.& Jiang,X.X.(1992).The Preliminary Chinese Proficiency Test(Pre-CPT):Development,Scaling and Equating to the Chinese Proficiency Test(CPT).Paper deposited in the Center for Applied Linguistics,Washington,D.C.
    [73]The TEM Testing Centre,Shanghai International Studies University.(1997).The Test for English Majors(TEM)Validation Study.上海:上海外语教育出版社.
    [74]Van der Linden W,Hambleton R,(Eds).(1997).Handbook of Modern Item Response Theory.New York:Springer.
    [75]Van der Linden,W.J.& Luecht,R.M.(1998).Observed Ved-Score Equating as A Test Assembly Problem.Psychometrik 63(4):401-418.
    [76]Van der Linden,W.J.(2000).A Test-Theoretic Approach to Observed-Score Equating.Psychometrik 65(4):437-456.
    [77]Von Davier,A.A.,Holland,P.W.& Thayer,D.T.(2003).The Kernel Method of Test Equating.New York:Springer-Verlag Inc.
    [78]Wainer,Howard,Wang X.B.& Thissen,David.(1994).How Well Can We Compare Scores on Test Forms That Are Constructed by Examinees' Choice?Journal of Educational Measurement 31(3):183-199.
    [79]Wang,T.Y.& Kolen,M.J.(2001).Evaluating Comparability in Computerized Adaptive Testing:Issues,Criteria and An Example.Journal of Educational Measurement 38(1):19-49.
    [80]Wang,Wen-Chung,Cheng,Ying-Yao & Wilson,Mark.(2005).Local Item Dependence for Items across Tests Connected by Common Stimuli.Educational and Psychological Measurement 65(5):5-27.
    [81]Woldbeck,T.(1998).Basic Concepts in Modern Methods of Test Equating.Paper presented at the Annual Meeting of the Southwest Psychological Association (New Orleans,LA,April 1998).
    [82]Woodruff,David.(1986).Derivations of Observed Score Linear Equating Methods Based on Test Score.Journal of Educational Statistic 11(4):245-257.
    [83]Wright,Benjamin D.& Stone,Mark H.(1979).Best Test Design.Chicago:Mesa Press.
    [84]Wright,Benjamin D.(1977).Solving Measurement Problems with the Rasch Model.Journal of Educational Measurement 14(2):97-116.
    [85]Yang,Wen-Ling.(1997).The Effects of Content Mix and Equating Method on the Accuracy of Test Equating Using Anchor-Item Design.Paper presented at the Annual Meeting of the American Educational Research Association (Chicago,IL.March 24-28,1997).
    [86]Yang,Wen-Ling & Houang,R.T.(1996).The Effect of Anchor Length and Equating Method on the Accuracy of Test Equating:Comparisons of Linear and IRT-Based Equating Using an Anchor- Item Design.Paper presented at the Annual Meeting of the American Educational Research Association(New York,NY,April 8-12,1996).
    [87]Zickar,M.J.(1998).Modeling Item-Level Data with Item Response Theory.Current Directions in Psychological Science 7(4):104-109.
    [88]Zhu,Zheng-cai,Yang,Hui-zhong & Yang,Hao-ran.(2003).Rasch Model Applied to Score Equating in the College English Test.Modern Foreign Languages(Quarterly)26(1):69-75.
    [89]陈庆良、鲁直、王光翔.(1997).普通高考分数转换方案对比统计分析试探.载《海峡两岸学术研讨会论文集:心理与教育测量》.杭州:浙江教育出版社.
    [90]戴海崎、刘启辉.(2002).锚题题型与等值估计方法对等值的影响.《心理学报》.34(4):367-370.
    [91]丁树良、熊健华.(2003).项目反应理论框架下几个等值问题的探讨.《中国考试(理论研究)》.12(1):14-15.
    [92]李筱菊.(1998).《语言测试科学与艺术》.长沙:湖南教育出版社.
    [93]桂诗春、李崴.(1990).项目反应理论及其在考试等值上的应用.《1989年广 东省英语高考标准化实验报告》.
    [94]韩宝成.(2000).《外语教学科研中的统计方法》.北京:外语教学与研究出版社.
    [95]焦丽亚、辛涛.(2006).基于CTT的锚测验非等组设计中四种等值方法的比较研究.《心理发展与教育》.1:97-102.
    [96]刘润清、韩宝成.(2000).《语言测试和它的方法》.北京:外语教学与研究出版社.
    [97]谢小庆.(1998).关于HSK等值的试验研究.《世界汉语教学》.1998(3):35-51.
    [98]谢小庆.2005.《中国汉语水平考试HSK研究报告精选》.北京:北京语言大学出版社.
    [99]谢小庆.(2000).对15种测验等值方法的比较研究.《心理学报》.32(2):217-223.
    [100]杨惠中,Weir,C.1998.《大学英语四、六级考试效度研究》.上海:上海外语教育出版社
    [101]许祖慰(1992).《项目反应理论及其在测验中的应用》.上海:华东师范大学出版社.
    [102]张凯.(2002).《语言测验理论与实践》.北京:北京语言大学出版社.
    [103]张光旭、杨志明.(1999).高中会考等值方法的比较研究.《心理学探新》.19(4):47-55.
    [104]张红霞、王同顺(2003).TEM4平行模拟测试信度及差异检验。《教育与现代化》.69(4):23-29.
    [105]朱正才、杨惠中(2004).大学英语四、六级考试分数的机助百分位等值研究.《现代外语(季刊)》.27(1):70-108.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700