大规模教育考试作文评分中的严厉度漂移研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Severity Drift during the Process of Essay Ratings for Large-scale Educational Examinations
  • 作者:赵海燕 ; 辛涛 ; 田伟
  • 英文作者:ZHAO Haiyan;XIN Tao;TIAN Wei;Beijing Education Examinations Authority;Beijing Normal University;
  • 关键词:主观题评分 ; 作文评分 ; 评分者效应 ; 评分者漂移 ; 严厉度漂移
  • 英文关键词:rating of constructive items;;essay rating;;rater effects;;rater DRIFT;;severity drift
  • 中文刊名:KSYJ
  • 英文刊名:China Examinations
  • 机构:北京教育考试院;北京师范大学;
  • 出版日期:2019-02-01
  • 出版单位:中国考试
  • 年:2019
  • 期:No.322
  • 语种:中文;
  • 页:KSYJ201902001
  • 页数:8
  • CN:02
  • ISSN:11-3303/G4
  • 分类号:4-11
摘要
严厉度漂移是指在主观题评分过程中,评分员的严厉度效应的跨时间、场合或任务的波动。本研究基于某高利害性大规模教育考试的作文评分现场收集的操作性数据,借助传统检测方法侦测严厉度漂移,并比较不同模型变式和效应指标的结果。研究结果表明,在当前评分任务上,评分员在整体上并未发生明显的严厉度漂移,但有相当比例的个体评分员显示出波动现象,并且分离模型的检出率要明显高于交互作用模型。静态和动态严厉度效应间并不存在简单的加合或对应关系。评分员是否发生严厉度漂移并不取决于其静态效应的强度。
        Severity drift refers to fluctuations of severity of raters across time, occasions, and tasks during the rating process of constructive items. Based on operational data collected from the live essay rating process of a highstakes, large-scale educational examination, the present study applied MFRM and traditional differences testing methods to detect severity drift and compared efficiency of different models and effect indexes. The results show that the present rater group do not drift over time, but considerable individual raters fluctuate during the process.The Separation Model is more sensitive to severity drift than the Interaction Model. Moreover, the relationship between static and dynamic effects is not simple parts and the whole that whether severity fluctuated over time do not depend on raters' static status.
引文
[1] CONGDON P J, MEQUEEN J. The Stability of Rater Severity in Large-Scale Assessment Programs[J]. Journal of Educational Measurement, 2000, 37(2):163-178.
    [2] WOLFE E W. Identifying Rater Effects Using Latent Trait Models[J]. Psychology Science, 2004, 46(1):35-51.
    [3] WOLFE E W, MOULDER B C. Examining Differential Reader Functioning over Time in Rating Data:An Application of the Multifaceted Rasch Rating Scale Model[C]. Montreal:The Annual Meeting of the American Educational Research Association, 1999.
    [4] MYFORD C M, WOLFE E W, ENGELHARD G, et al. Monitoring Reader Performance and DRIFT in the AP English Literature and Composition Examination Using Benchmark Essays[R]. College Board, 2007.
    [5]赵海燕,辛涛,田伟.主观题评分中的评分者漂移及其传统检测方法析[J].中国考试, 2018(8):20-27.
    [6] MYFORD C M, WOLFE E W. Monitoring Rater Performance over Time:A Framework for Detecting Differential Accuracy and Differential Scale Category Use[J]. Journal of Educational Measurement,2009(46):371-389.
    [7]田清源. HSK主观考试评分的Rasch实验分析[J].心理学探新,2007(27):65-68
    [8] LUNZ M E, STAHL J A. Judge Consistency and Severity Across Grading Periods[J]. Evaluation and the Health Professions, 1990(13):425-44.
    [9] LIM G S. The Development and Maintenance of Rating Quality in Performance Writing Assessment:A Longitudinal Study of New and Experienced Raters[J]. Language Testing, 2011, 28(4):543-560.
    [10]教育部考试中心.国家教育考试网上阅卷统计暂行规范[R].北京, 2010.
    [11] LINACRE J M. Many-facet Rasch Measurement[M]. Chicago, IL:MESA Press, 1989.
    [12] LINACRE J M. Facets:Rasch Measurement Computer Program[M]. Chicago:MESA Press, 2003.
    [13] GARNER M, ENGELHARD G. Gender Differences in Performance on Multiple-choice and Constructed Response Mathematics Items[J]. Applied Measurement in Education, 1999, 12(10):29-51.
    [14] SWAMINATHAN H, ROGERS H J. Detecting Differential Item Functioning Using Logistic Regression Procedures[J]. Journal of Educational Measurement, 1990(27):361-70.
    [15] TRISTAN A. An Adjustment for Sample Size in DIF Analysis[J].Rasch Measurement Ttransactions, 2006(20):1070-1071.