湖南大学大学英语分级测试性别项目功能差异分析

英文题名：Gender-based Differential Item Functioning Analysis of College English Placement Test of Hunan University
作者：李宁洁
论文级别：硕士
学科专业名称：外国语言学及应用语言学
中文关键词：项目功能差异 ; 项目反映理论 ; 项目偏差 ; 项目影响
英文关键词：Differential Item Functioning ; Item Response Theory ; Item Bias ; Item Impact
学位年度：2008
导师：肖云南
学科代码：050211
学位授予单位：湖南大学
论文提交日期：2008-04-25
答辩委员会主席：刘正光

摘要

本论文主要包括理论综述和实证研究两部分。理论综述部分主要介绍了项目偏差、项目影响、项目反映理论、项目功能差异的有关概念、目前国内外语言测试性别项目功能差异(DIF)的研究状况以及检测DIF的一些常用方法。在实证研究部分,运用BILOG-MG软件和项目反应理论下的单参数Logistic Model,对2007年度湖南大学大学英语分级测试两套试卷中的50个阅读理解项目(均为二级计分题)做性别DIF分析。本研究中设男生为参照组(reference group),女生为目标组(focus group)。性别DIF检测结果如下:在第一套阅读试题中,共有七个项目有显著性性别DIF,其中三个是明显有利于男生,四个明显有利于女生;另外,三个项目有中性性别DIF;在第二套阅读试题中,共有七个项目有显著性性别差异,其中五个项目是明显有利于男生,两个明显有利于女生;另外,两个项目有中性性别差异。在分析本研究中所有性别DIF项目的特点的前提下,得出本研究中的大多数DIF项目是有利的DIF,只有两套题中的第25个项目和第二套题中的第30个项目是不利的DIF。进而探索了产生性别DIF的原因,并得出如下结论: 1.与社会科学有关的逻辑推理型项目有利于男生;2.与语法或社会生活紧密相关的项目有利于女生; 3.男生和女生在有关生活问题、自然科学问题、或环境问题的项目中的差异并不显著;4.兴趣不同、语言交际能力不同、学习动机不同也可能是产生性别DIF的因素。另外,基于此次性别DIF研究结果,对湖南大学大学英语分级测试未来的题目开发、题库建设提出了建设性意见。
This paper includes two sections of theory survey and application study. Theory survey part introduces some concepts relative to Item Bias、Item Impact、Item Response Theory and Differential Item Functioning, the present domestic and overseas research development in gender DIF of language testing, and some ordinary methods for DIF detection. In the application study part, gender DIF analysis based on the two forms of reading comprehension items of College English Placement Test of Hunan University (2007) was conducted using the one-parameter logistic model under IRT. The BILOG-MG software was used for DIF detection in this study. The results of this empirical study are as follows: For the reading comprehension items of Form 1 test, there were seven items demonstrate significant DIF,of which three items were against female students and four items were against male students. And there were three items showing moderate DIF; For the reading comprehension items of Form 2 test, there were seven items showing significant DIF, of which five items were against female students and two items were against male students. And there were two items showing moderate DIF. Most DIF items in this study are manifesting beneficial DIF except item 25 of both forms and item 30 of Form 2 test,which need to be revised or deleted. Possible sources of gender DIF were presented in this study. The results were as follows: 1. Reasoning items related to social sciences are beneficial for male students. 2. Items about grammar or closely related to the daily life are beneficial for female students. 3. Little difference was found between males and females in life and nature of science, or in environmental issues. 4. Interest differences, the difference in language communication ability and learning motivation are also factors that lead to gender DIF. In addition, based on these results, some suggestions were presented for future item development and item bank building for College English Placement Test of Hunan University.

引文

[1] Alberta Education. Public information [EB/OL]. http //www education gov ab ca, 2007: 2-18
    [2] American Educational Research Association (AERA), American Psychological Association (APA) & National Council on Measurement in Education (NCME). Standards for Educational and Psychological Testing [M]. Washington DC: American Psychological Association, 1999
    [3] Angoff W H. Perspectives on differential item functioning methodology [A]. In P W Holland & H Wainer (Eds). Differential Item Function [C]. Hillsdale NJ: Lawrence Erlbaum, 1993: 3-23
    [4] Ayala C C, et al. Reasoning dimensions underlying science achievement: The case of performance assessment [J]. Educational Assessment: Issues and Practice, 2002, 8(2): 101-121
    [5] Bachman L F, Davidson F, Ryan K & Choi I C. An investigation into the comparability of two tests of English as a foreign language: the Cambridge-TOEFL comparability study [M]. Cambridge: CUP, 1993: 222
    [6] Baker F B. A criticism of Scheuneman's item bias technique [J]. Educational Measurement, 1981, 18(1): 59-62
    [7] Baker F B. Some observations on the metric of PC-BILOG results [J]. Applied Psychological Measurement, 1990, 14(2): 139-150
    [8] Beller M & Gafni N. Can item format (multiple choice vs. open-ended) account for gender differences in mathematics achievement? [J]. Sex Roles, 2000, 42 (1/2): 1-21
    [9] Bennett R E, Rock D A & Wang M. Equivalence of free-response and multiple-choice items [J]. Educational Measurement, 1991, 28(1): 77-92
    [10]Bock R D & Aitkin M. Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm [J]. Psychometrika, 1981, 46(4): 443-459
    [11]Boyle J. Sex differences in listening vocabulary [J]. Language Learning, 1987, 37(2): 273-284
    [12]Bridgeman B & Rock D A. Relationships among multiple-choice and open-ended analytical questions [J]. Educational Measurement, 1993, 30(4): 313-329
    [13]Brimer M A. Sex differences in listening comprehension [J]. Research and Development in Education, 1969, 9(3): 171-179
    [14]Camilli G & Shepard L A. Methods for identifying biased test items [M]. Thousand Oaks CA: Sage, 1994: 1
    [15]Chen Z. & Henning G. Linguistic and cultural bias in language proficiency tests [J]. Language Testing, 1985, 2(2): 155-163
    [16]Childs R A & Oppler S H. Implications of test dimensionality for unidimensional IRT scoring: An investigation of a high-stakes testing program [J]. Educational and Psychological Measurement, 2000, 60(6): 939-955
    [17]Clauser B E & Mazor K M. Using statistical procedures to identify differentially functioning test items [J]. Educational Measurement: Issues and Practice, 1998, 17(1): 31-44
    [18]Cohen A S & Kim S. A comparison of Lord's χ2 and Raja’s area measures in detection of DIF [J]. Applied Psychological Measurement, 1993, 17(1): 39-52
    [19]Cole N S & Moss P A. Bias in Test Use [A]. In R L Linn (Ed). Educational Measurement [C]. New York: Macmillan, 1993: 201-219
    [20]Cole N S. The ETS gender study: how females and males perform in educational setting [M]. Princeton NJ: Educational Testing Service, 1997: 201-219
    [21]Crocker L M & Algina J. Introduction to classical and modern test theory [M]. Orlando FL: Harcourt Brace Jovanovich, 1986: 1-149
    [22]Denno D. Sex differences in cognition: A review and critique of the longitudinal evidence [J]. Adolescence, 1982, 17(1): 779-788
    [23]Dorans N J & Holland P W. DIF detection and description: Mantel-Haenszel and standardization [A]. In P W Holland & H Wainer (Eds). Differential Item Function [C]. Hillsdale NJ: Lawrence Erlbaum, 1993: 35-66
    [24]Elder C. The effect of language background on "foreign" language test performance: The case of Chinese, Italian, and Modern Greek [J]. Language Learning, 1996, 46(2): 233-282
    [25]Ercikan K, et al. Calibration and scoring of tests with multiple choice and constructed response item types [J]. Educational Measurement, 1998, 35(2): 137-154
    [26]Frenette E & Bertrand R. Assessing dimensionality with TESTFACT and DIMTEST using large-scale assessment data sets [R]. New Orleans LA: American Educational Research Association, 2000: 1-8
    [27]Gierl M J, et al. Illustrating the utility of differential bundle functioning analysesto identify and interpret group differences on achievement tests [J]. Educational Measurement: Issues and Practice, 2001, 20(2): 26-36
    [28]Haladyna T M & Downing S M. Construct-irrelevant variance in high-stakes testing [J]. Educational Measurement: Issues and Practice, 2004, 23(1): 17-27
    [29]Hambleton R K. Principles and selected applications of item response theory [A]. In R L Linn (Ed). Educational Measurement [C]. New York: Macmillan, 1989:147-200
    [30]Hambleton R K, Robin F & Xing D. Item response models for the analysis of educational and psychological test data [A]. In H Tinsley & S Brown (Eds). Handbook of applied multivariate statistics and modelling [C]. San Diego CA: Academic Press,2000: 553-578
    [31]Hambleton R K & Rogers J H. Using item response models in educational assessments [A]. In W Schreiber & K Ingenkamp (Eds). International developments in large-scale assessment [C]. England: NFER-Nelson, 1990:155-184
    [32]Hambleton R K & Swaminathan H. Item Response Theory: Principles and applications (Ed) [M]. Boston MA: Kluwer-Nijhoff, 1985: 1-304
    [33]Hambleton R K, Swaminathan H & Rogers J H. Fundamentals of Item Response Theory [M]. New York: Sage publications, 1991: 1-156
    [34]Hamilton L, et al. Enhancing the validity and usefulness of large scale educational assessments II NELS 88 science achievement [J]. American Education Research Journal, 1995, 32(3): 555-581
    [35]Hamilton L S. Gender differences on high school science achievement tests: Do format and content matter? [J]. Educational Evaluation and Policy Analysis, 1998, 20(3): 179-195
    [36]Hamilton L, Stecher B & Klein S (Eds). Making sense of test-based accountability in education [M]. Santa Monica CA: RAND, 2002: 14-48
    [37]Harris A M & Carlton S T. Patterns of gender differences on mathematics items on the SAT [J]. Applied Measurement in Education, 1993, 6(2): 137-151
    [38]Hedges L V & Nowell A. Sex differences in mental test scores, variability, and numbers of high-scoring individuals [J]. Science, 1995, 269(5220): 41-45
    [39]Henderson D. Investigation of DIF across item format [D]. Alberta: University of Alberta at Edmonton, 1999: 1-20
    [40]Hyde J & Linn M. Gender differences in verbal ability: a meta-analysis [J]. Psychological Bulletin, 1988, 104(1): 53-69
    [41]Kennedy P & Walstad W B. Combining multiple choice and constructed response test scores: An economist’s view [J]. Applied Measurement in Education, 1997, 10(4): 359-375
    [42]Kim M. Detecting DIF across the different language groups in a speaking test [J]. Language Testing, 2001, 18(1): 89-114
    [43]Klein S P, et al. Gender and racial/ethnic differences in performance assessments in science [J]. Educational Evaluation and Policy Analysis, 1997, 19(2): 83-97
    [44]Kunnan A J. DIF in native language and gender groups in an ESL placement test [J]. TESOL Quarterly, 1990, 24: 741-746
    [45]Kunnan A J. Fairness and justice for all [A]. In A J Kunnan (Ed). Fairness and validation in language assessment [C]. Cambridge: CUP, 2000: 1-14
    [46]Lane S, et al. Examination of the assumptions and properties of the graded item response model: An example using a mathematics performance [J]. Applied Measurement in Education, 1995, 8(4): 313-340
    [47]Lane S, Wang N & Magone M. Gender-related differential item functioning on a middle school mathematics performance assessment [J]. Educational Measurement: Issues and Practice, 1996, 15(4): 21-27
    [48]Lee Y W, Breland H & Muraki E. Comparability of TOEFL CBT writing prompts for different native language groups TOEFL RR-77 [R]. Princeton NJ: ETS, 2004
    [49]Lim R G & Drasgow F. Evaluation of two methods for estimating item response theory parameters for estimating differential item functioning [J]. Journal of Applied Psychology, 1990, 75: 164-174
    [50]Lord F M. Application of item response theory to practical testing problems [M]. Hillsdale NJ: Erlbaum, 1980: 1-274
    [51]Lukhele R, Thissen D & Wainer H. On the relative value of multiple choice, constructed response, and examinee-selected items on two achievement tests [J]. Educational Measurement, 1994, 31(3): 234-250
    [52]Lu S M. An overview of procedures for identifying Differential Item Functioning [J]. Taipei municipal teachers college academic journal, 1999, 30: 149-166
    [53]Maccoby E E & Jacklin C N. The psychology of sex differences [M]. Stanford CA: Stanford University Press, 1974: 1-416
    [54]McGehee J J & Griffith L K. Large-scale assessments combined with curriculum alignment: agents of change [J]. Theory into Practice, 2001, 40(2): 137-144
    [55]Mellenberg G J. Contingency table models for assessing item bias [J]. Educational Statistics, 1982, 7: 105-118
    [56]Messick S. Validity [A]. In R L Linn (Ed). Educational Measurement (3rd edition) [C]. New York: American Council on Education, 1989: 13-103
    [57]Millsap R E & Everson H T. Methodology review: statistical approaches for assessing Measurement Bias [J]. Applied Psychological Measurement, 1993, 17: 297-334
    [58]Moss P A. The role of consequences in validity theory [J]. Educational Measurement: Issues and Practice, 1998, 17(2): 6-12
    [59]Nandakumar R. Simultaneous DIF amplification and cancellation: Shealy-Stout's test for DIF [J]. Educational Measurement, 1993, 16: 159-176
    [60]Nussbaum E M, Hamilton L S & Snow R E. Enhancing the validity and usefulness of large-scale educational assessments: IV. NELS: 88 science achievement to 12th grade [J]. American Educational Research Journal, 1997, 34(1): 151-173
    [61]O’Neill K A & McPeek W M. Item and test characteristics that are associated with differential item functioning [A]. In P W Holland & H Wainer (Eds). Differential Item Function [C]. Hillsdale NJ: Erlbaum, 1993: 255-276
    [62]Raju N S. The area between two item characteristics curves [J]. Psychometrika, 1988, 54: 459-502
    [63]Raju N S. Determining the significance of estimated sign and unsigned areas between two item response functions [J]. Applied Psychological Measurement, 1990, 14: 197-207
    [64]Raju N S, van der Linden W J & Fleer P F. IRT-based internal measures of differential functioning of items and tests [J]. Applied Psychological Measurement, 1995, 19(4): 353-368
    [65]Rebecca Z, John R D & Angela G. Assessment of Differential Item Functioning for Performance Tasks [J]. Educational Measurement, 1993, 30(3): 233-251
    [66]Reckase M D. The difficulty of test items that measure more than one ability [J]. Applied Psychological Measurement, 1985, 9(4): 401-412
    [67]Roussos L & Stout W. Simulation studies of effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance [J]. Educational Measurement, 1996, 33(2): 215-230
    [68]Ruder L M. An approach to biased item identification using latent trait measurement theory [R]. New York: American Educational Research Association , 1977:1-6
    [69]Ryan J M & Demark S. Variation in achievement scores related to gender, item format, and content area tested [A]. In G Tindal & T M Haladyna (Eds). Large-scale assessment programs for all students: Validity, technical adequacy, andimplementation [C]. Mahwah NJ: Lawrence Erlbaum Associates, 2002: 67-88
    [70]Ryan K & Bachman L. Differential item functioning on two tests of EFL proficiency [J]. Language testing, 1992, 9(1): 12-29
    [71]Sasaki M. A comparison of two methods for detecting differential item functioning in an ESL placement test [J]. Language Testing, 1991, 8 (2): 95-111
    [72]Scheunemann J D. A method of assessing bias in test items [J]. Educational Measurement, 1979, 16(3): 143-152
    [73]Schmitt A P, Holland P W & Dorans N J. Evaluating hypotheses about differential item functioning [A]. In P W Holland & H Wainer (Eds). Differential item functioning [C]. Hillsdale NJ: Erlbaum, 1993: 281-315
    [74]Shealy R & Stout W F. A model-based standardization approach that separates the bias/DIF from group ability differences and detects test/DIF as well as item bias/DIF [J]. Psychometrika, 1993, 58(3): 159-194
    [75]Shepard L A, Camilli G & Averill M. Comparison of procedures for detecting test-item bias with both internal and external ability criteria [J]. Educational Statistics, 1981, 6(4): 317-375
    [76]Stocking M L & Lord F M. Developing a common metric in item response theory [J]. Applied Psychological Measurement, 1983, 7(2): 201
    [77]Stumpf H & Stanley J C. Gender-related differences on the College Board’s advanced placement and achievement tests, 1982-1992 [J]. Journal of Educational Psychology, 1996, 88(2): 353-364
    [78]Tannen D. You just don’t understand: women and men in conversation [M]. New York: William Morrow, 1990: 23-288
    [79]Tate R. Test dimensionality [A]. In G Tindal & T M Haladyna (Eds). Large-scale assessment programs for all students: Validity, technical adequacy, and implementation [C]. Mahwah New Jersey: Lawrence Erlbaum Associates, 2002: 181-212
    [80]Thissen D & Steinberg L. A taxonomy of item response models [J]. Psychometrika, 1986, 51(4): 567-577
    [81]Thissen D, Steinberg L & Wainer H. Use of item response theory in the study of group differences in tracelines [A]. In H Wainer & H I Braun (Eds). Test Validity [C]. Hillsdale NJ: Lawrence Erlbaum, 1988: 147-169
    [82]Thissen D, Steinberg L & Wainer H. Detection of differential item functioning using the parameters of item response models [A]. In P W Holland & H Wainer (Eds). Differential Item Function [C]. Hillsdale NJ: Lawrence Erlbaum, 1993: 67-113
    [83]Thissen D, Wainer H & Wang X. Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests [J]. Educational Measurement, 1994, 31(2): 113-123
    [84]Thorne B, Kramarae C & Henley N (Eds). Language, gender and society [M]. Rowley MA: Newbury House, 1983: 12-125
    [85]Wainer H & Lukhele R. How reliable are TOEFL scores? [J]. Educational and Psychological Measurement, 1997, 57(5): 741-759
    [86]Wainer H & Thissen D. Combining multiple choice and constructed-response test scores: Toward a Marxist theory of test construction [J]. Applied Measurement in Education, 1993, 6(2): 103-118
    [87]Wang N & Lane S. Detection of gender-related differential item functioning in a mathematics performance assessment [J]. Applied Measurement in Education, 1996, 9(2): 175-199
    [88]Welch C & Hoover H D. Procedures for extending item bias detection techniques to polytomously scored items [J]. Applied Measurement in Education, 1993, 6(1): 1-19
    [89]Welch C J & Miller T R. Assessing differential item functioning in direct writing assessments: Problems and an example [J]. Educational Measurement, 1995, 32(2): 163-178
    [90]Willingham W W & Cole N S. Gender and fair assessment [M]. Mahwah NJ: Lawrence Erlbaum Associates, 1997: 185-197
    [91]Yu M L. Introduction of Item Response Theory [J]. Inservice Education Bulletin, 1993, 10(4): 9-13
    [92]Yu M L. The development trend of measurement theory [A]. In Chinese Testing Institution (Ed). The development and application of psychometrics—Thesis collection for 60 anniversary of Chinese testing institution [C]. Taipei: Psychology, 1997: 23-62
    [93]Zimowski M F, et al. BILOG-MG: Multiple-group IRT analysis and test maintenance for binary items [M]. Chicago: Scientific Software, 1996: 29-478
    [94]Zwick R. When do item response function and Mantel-Haenszel definitions of differential item functioning coincide [J]. Educational Statistics, 1990, 15(3): 185-197
    [95]Zwick R. Fair game? The use of standardized admissions tests in higher education[M]. New York: Routledge Falmer, 2002: 143-158
    [96]曾秀琴. 检验项目功能差异的两类方法—CFA 和 IRT 的比较 [J]. 心理学动态, 1999, 7(2): 41-47

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700