论语言测试效度的辩论方法

英文题名：On Argument-Based Lauguage Test Validation
副题名：辩论逻辑与效度解释
英文副题名：The Logic of Argument and Interpretation of Validity
作者：邓杰
论文级别：博士
学科专业名称：英语语言文学
中文关键词：辩论逻辑 ; 累进效度 ; 累进辩论法 ; 话语信息认知处理能力 ; 信息最大化命题方法 ; 选项可猜性
英文关键词：Argument Logic ; Progressive Argument ; Progressive Validity ; Cognitive Processing Ability of Discourse Information ; Information-Maximization Item Development ; Option Guessability
学位年度：2011
导师：邹申
学科代码：050201
学位授予单位：上海外国语大学
论文提交日期：2011-05-01

摘要

本文主要研究语言测试效度的辩论方法,或者说,基于辩论的语言测试效度验证方法(Argument-based Approach to Language Test Validation),总体上包括两个方面的内容:辩论逻辑和效度解释,具体又分为五个方面进行讨论。
     第一,本文研究了几个较有代表性、影响较大的辩论框架的辩论逻辑,着重分析了其中的逻辑缺陷及其产生的根源。所研究的框架都声称其逻辑结构为哲学研究领域的辩论模型——Toulmin模型,但在应用Toulmin模型时又都对其基本结构进行了修改。修改后的模型,其中的推理过程有可能陷入一个永无止境的“死循环”(an endless loop),辩论过程也有可能变成为一个自相矛盾的自我辩论过程,并且“声明”(Claim)也不再是声明而实为假设。没有声明的模型,本质上已不再是辩论模型。不过,修改版中虽有假设,却并不是假设检验模型,因为其中没有接受或拒绝假设的条件判断机制。进一步研究还发现,逻辑错误的产生主要是因为误解和误用了Toulmin模型中的“反驳”(Rebuttal)所致。
     第二,本研究提出了一个累进辩论结构(Progressive Argument Structure),并强调通过累进辩论的方法对测试效度进行解释。累进辩论结构不仅修正了相关辩论框架的逻辑错误,还将科学调查中的数据分析(Data Analysis)手段纳入到了理性辩论的逻辑推理过程之中。效度验证往往会涉及各种各样的复杂数据,多数情况下仅凭主观逻辑推理难以得出合理结论。通过在模型结构中增加一个条件判断要素和一个数据分析要素,就可以在进行逻辑推理前对理由(Warrant)是否充分进行判断。如果有充足的理由,则按Toulmin模型结构进行推理,否则即进行数据分析,以产生新的、更有说服力的证据数据。这种设计使得模型具有了递归机制(a recursion mechanism)。递归的结果会产生一系列声明,并且这些声明形成一个层级结构,一个声明的形成以前一个声明为基础,最后声明是所有前期声明层级累进的结果。这正是“累进”的真实含义之所在。
     第三,本研究提出了以目标构念为中心、以环节效度为基础的累进效度观(a progressive view of validity)。累进效度观强调每个测试环节产生的数据都应该充分体现测试的目标构念,效度也就是数据准确体现构念的程度。效度本质上是一个程度问题,但也存在“有效”和“无效”之分。程度高达可以接受的合理水平即为有效,低至不可接受的水平层次即为无效,有效和无效是对测试的定性评价和基本态度,不能因为效度是一个程度问题而对测试究竟是否有效含糊其辞。累进效度是各个环节效度逐级累进的结果,前任环节是后继环节的基础,一个环节失效,则整体无效。但是,效度累进不同于百分比累积,累进效度不大于最薄弱环节的效度。此外,累进效度辩论不必局限于分数的解释和使用,效度在设计之初即已存在,测前具有预期效度,测后具有实际效度。为了确保测试具有理想的实际效度,测前每个环节都应具有理想的预期效度。每个测试环节都应该进行相应的效度验证,对该环节的效度作出合理解释、做出恰当决策、并预计相应后果。
     第四,本研究提出了话语信息认知处理能力构念观。本文强调考察能力构念,仅停留在能力结构或认知过程的宏观分类上是不够,还应具体到语篇、深入到语义,从更微观的层面考察考生生成和理解话语信息的能力。并且,为了更好地从语义理解的准确率和速度、语义生成的质量和数量的角度考察语言能力,还必须解决语义的认知量化与计算问题。为此,本文首先在系统论、信息论和控制论思想的指导下,建构了话语信息认知处理系统框架和话语信息认知处理能力模型;然后以计算机面向对象理论为指导,借鉴计算机认识世界事物的方式分析语义的结构形式和计算单位,实现对语义的认知量化和统计计算;最后在语义认知量化的基础上,提出信息最大化命题方法,通过最大化计算、抽样加权、归类整理、题目编写等几个环节,为命题效度辩论提供测试内容证据。
     第五,通过两个实例,介绍信息最大化命题方法和累进效度辩论法在命题实践中的应用。命题方法实例基于一个150词的短小语篇,编写4道多项择阅读理解题。辩论法实例特别针对选项可猜性这一测试效度的反面解释进行辩论,其主要目的在于介绍如何通过理性辩论与科学调查相结合的方式,对命题效度进行证伪辩论,同时兼顾调查了我国高考命题对选项可猜性的控制情况。此实例调查了3套高考试卷,共计74道多项选择题、259个选项。结果发现,调查卷的选项可猜性比较严重,我国高考命题有必要采取更有效的措施,加强对选项可猜性的控制。
     由于涉及面广,本研究未能针对各个测试环节深入拓展,信息能力构念研究和信息最大化命题方法也有待于在实践中进行进一步检验。
Argument-based approach to validating language tests can be traced back at least to the 70s and 80s, from which time more attention began to be drawn to the importance of both verification of positive explanations and falsification of rival hypotheses of test validity. In recent years, argument-based approach has been widely accepted and used in our validation practice. However, as to how to go about arguing, there is no unanimous consensus; on the contrary, hot debates can be found in the recent publications. Following the applications and debates, two aspects of validity arguments are increasingly becoming more of our concern: the logic of argument and the interpretation of validity.
     Firstly, the present thesis analyzes the logic errors of three most influential argument-based validation frameworks, Assessment Use Argument—AUA AUA (Bachman, 2005, Bachman & Palmer, 2010), Evidence-Centered Design—ECD (Mislevy et al, 2003) and Interpretive Argument—IA (Kane, 1990, 1992, 2004). Because all the three models claim that their argument structure is the Toulmin structure of argument, a comparative study between these models and the Toulmin model (Toulmin, 2003) is carried out. The results show that all have modified to the basic structure of the Toulmin model before applying it to their frameworks and the modifications have caused serious logicality problems: 1) the reasoning process is an endless loop; 2) the argument is a typical paradox; 3) the claim is in fact a hypothesis. When there is no claim, the model is no longer an argument model. But even though there is a hypothesis, the model is not a hypothesis testing model either, because there is no conditional mechanism to decide whether to accept or reject the hypothesis.
     Further studies show that the causes of the logic errors are similar too. Due to a misunderstanding and misuse of the Toulmin Rebuttal, all counterclaims, including counter explanations and rival hypotheses, are regarded as Toulmin rebuttals. As matter of fact, the Toulmin Rebuttal refers to“the sorts of exceptional circumstance may in particular cases rebut the presumptions the warrant creates”(Toulmin, 2003, p.99, emphases added), which is just like the significance level (α) in the hypothesis testing. By nature, rebuttals belong to low probability events that can be and has to be ignored, but the modified versions emphasize that rebuttals be either verified or falsify before making a claim. This is exactly what causes the logicality problems.
     Secondly, the thesis proposes a new argument model called the Progressive Argument, which not only possesses a logical reasoning mechanism, but also incorporates scientific inquiry into rational reasoning. As is often the case of rational reasoning, the data must be simplistic and the warrant must be self-evident, in order for the claim to be plausible and easily accepted. But test data is often sophisticated and hardly any conclusion can be drawn without scientific inquiry. In face of complicated test data, data analysis has to be done so that more evidential data can be generated to authorize the logic reasoning process.
     To that end, the progressive argument embeds in its base structure two more elements in the Toulmin model, a Conditional to direct the reasoning procedure and an Analysis to carry out data analysis. Every time before starting the rational reasoning, the Conditional is invoked to decide whether there are sufficient warrants to authorize the reasoning step. If the condition is satisfied, the process is led into a Toulmin reasoning procedure; and if not, the process is directed into a data analysis procedure to generate new evidence. By including an Analysis element, the model possesses a recursion mechanism, which means that the justification of a claim may involve a recursive use of the Progressive Argument and the claim is based on the progression of all the sub-claims of the recursion steps. This is the reason why the argument is given the name Progressive Argument.
     Thirdly, this thesis proposes a construct-centered, stage-based progressive view of test validity, shortened as Progressive Validity. According to this view, test validity refers to the progression of the validity of all its stages; stage validity is defined as the extent to which data produced at the stage is an accurate representation of the target construct of the test; and validation is the process of providing evidence to justify claims about stage validity or test validity.
     The progressive view stresses that data produced at every stage should be representative of the target construct and all stages should be centered on the same construct. That is to say, when collecting data to validate a stage or the test, the evidence must be construct-centered. It also stresses that test validity lays its foundation on stage validity. For a test to be valid every stage has to be valid in the first place; if one stage is invalid, then the whole test is invalid. However validity progression is not like percentage accumulation, the validity of a test is no more than the lowest stage validity. Validity is a matter of degree by nature, but at the same time a stage or test can also be either“valid”or“invalid”. If the degree is high enough so that it is acceptable, then the stage or test is valid; or on the contrary, if the degree is too low to be acceptable, then the stage or test is invalid. By saying that a stage or test is valid or invalid, we do not mean to propose an absolute assertion, but rather a qualitative evaluation that contains our fundamental attitude towards the test.
     Another point that is of critical importance in the progressive view of validity is that validation should not be limited to score interpretation and use. Validity begins to emerge from the onset of test design. Before the test is administrated, there exists expected validity; after the administration, actual validity comes into being. To guarantee that actual validity is desirable, expected validity has to be justified—plausible interpretations need to be achieved, appropriate decisions to be made, intended and unintended consequences to be anticipated.
     Fourthly, this thesis advocates, from the perspective of cognitive processing of discourse information, an information processing view of language use ability. It is stressed that it is not enough to have only macro-level classifications of language ability components or cognitive processes, micro-level discourse and semantic analyses play a far more substantial role in language use. To attain a more accurate measure of language use ability, item-writers need to consider to what degree the candidates can process information contained in the test with expected accuracy and speed; raters need to carry out in-depth analyses of the quality and quantity of the discourse generated by the candidates.
     Information processing requires a practical solution to quantifying and computing the semantic items in specific discourses. Inspired by Systematic, Informatics and Cybernetics, a system framework and an ability model of cognitive processing of discourse information have been constructed and under the guidance of Object-Oriented Knowledge Representation Theory, the semantic structure and semantic unit for computing semantic items are proposed, on the basis of which an algorithm of cognitive quantification of discourse information and an item-writing method called Information-Maximization Item Development (IMID) have also been developed.
     Fifthly, this thesis includes two application examples to illustrate how to apply IMID and the Progressive Argument in test development stage. In the first example, four multiple-choice items were created by using IMIN method. Each item has four options and all the items are based on the same 150-word passage. The second example is an empirical study designed to develop falsification arguments against option guessability for the purpose of taking control of multiple-choice item-writing quality. The example investigates the listening and reading comprehension parts of 3 NME papers, with a total of 74 items, 259 options. The findings show that more effective measures need to be taken to better control option guessability.
     Due to the wide range of issues covered in this thesis the present study has to refrain from digging deeper into the different stages of language testing. In the meanwhile, the progressive argument model, information ability model and IMID method all await further research and feasibility test.

引文

Alderson, C. (2000). Assessing reading. Cambridge, UK: Cambridge University Press.
    Alderson, C. (1991). Dis-sporting life. Response to Alistair Pollit’s paper. In Alderson and North (ed.), 60-67.
    Allan, A. (1992). Development and validation of a scale to measure test-wiseness in EFL/ESL reading test takers. Language Testing ,9, 101-122.
    American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1974). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
    American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
    American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
    American Psychological Association. (1954). Technical recommendations for psychological tests and diagnostic techniques. Washington. DC: Author.
    American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1966). Standards for educational and psychological tests and manuals. Washington, DC: American Psychological Association.
    Anastasi, A. (1954). Psychological testing. New York: Macmillan.
    Anastasi, A. (1982). Psychological testing. New York: Macmillan.
    Angoff, W. (1988). Validity: an evolving concept. In H. Wainer & I. Braun (Ed.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
    Bachman, F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.
    Bachman, F. (1999). Fundamental Considerations in Language Testing.上海:上海外语教育出版社.
    Bachman, F. (2005). Building and Supporting a Case for Test Use. Language assessment quarterly, 2(1), 1–34.
    Bachman, F., F. Davidson, & M. Milanovic. (1996). The use of test method characteristics in the content analysis and design of EFL proficiency tests. Language Testing, 13(2), 125-150.
    Bachman, F., F. Davidson, K. Ryan, & I. Choi. (1995). An investigation into the comparability of two tests of English as foreign language—the Cambridge-TOEFL comparability study. Cambridge: Cambridge University Press.
    Bachman, F. & S. Palmer (1996). Language testing in practice. Oxford: Oxford University Press.
    Bachman, F., & S. Palmer. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.
    Bertalanffy, V. (1968). General System Theory: Foundations, Development, Applications. New York: George Braziller.
    Bingham, V., (1937). Aptitudes and aptitude testing. New York: Harper.
    Brown, A. (1987). Metacognition, executive control, self control, and other mysterious mechanisms. In F. Weinert and R. Kluwe (Eds.), Metacognition, Motivation, and Understanding (pp. 65–116). Hillsdale, NJ: Erlbaum.
    Brennan, L. (2001). Generalizability Theory. New York: Springer-Verlag New York, Inc.
    Buck, G., K. Tatsuoka, I. Kostin & M. Phelps. (1997). The sub-skills of listening: Rule-space analysis of a multiple-choice of second language listening comprehension. In A. Huhta, V. Kohonen, L. Kurki-Suonio & S. Louma (Ed.), Current developments and alternatives in language assessment. Tampere: University of Jyv?skyl?.
    Burton, J., R. Sudweeks, F. Merrill & B. Wood. (1991). How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty. Brigham Young University Testing Services and the Department of Instructional Science.
    Campbell, T. & W. Fiske. (1959). Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix. Psychological Bulletin, 56, 81-105.
    Chapelle, A., K.Enright, & J. Jamieson. (2008). Building a validity argument for the Testof English as a Foreign Language. London: Routledge.
    Chapelle, A., K. Enright & J. Jamieson. (2010). Does an Argument-Based Approach to Validity Make a Difference? Educational Measurements: Issues and Practice, 29 (1), 3–13.
    Cheng, L. (2008). Washback, impact and consequences. In E. Shohamy & H. Hornberger (Ed.), Encyclopedia of language and education. Volume 7: Language testing and assessment, (2nd ed., pp. 349–364). New York: Springer Science and Business Media LLC.
    Cronbach, J. (1949). Essentials of psychological testing. New York: Harper. Cronbach, J. (1971). Test Validation. In R. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443-507). Washington, DC: American Council on Education.
    Cronbach, J. (1980). Validity on parole: how can we go straight? In New directions for testing and measurement (Vol. 5, pp. 99–108). San Francisco: Jossey-Bass.
    Cronbach, J. (1988). Five perspectives on validity argument. In H. Wainer & H. I. Braun (Ed.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
    Cronbach, J., & E. Meehl. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
    Cronbach. J. (1989). Construct validation after thirty years. In R. Linn (Ed.), Intelligence: Measurement theory and public policy. Urbana: University of Illinois Press.
    Cureton, E. (1950). Validity. In E. Lindquist (Ed.). Educational measurement (pp. 621-694). Washington. DC: American Council on Education.
    Diamond, J. & W. Evans. (1972). An investigation of the cognitive correlates of test-wiseness. Journal of Educational Measurement, 9, 145-150.
    Dwyer, C. (2000). Excerpt from validity: Theory into practice. The Score, 22(A), 6-7.
    Ebel, R. (1961). Must all tests be valid? American Psychologist, 16, 640-647.
    Ebel, L. & A. Frisbie. (1991). Essentials of educational measurement. (5th ed.). Englewood Cliffs, NJ: Prentice-Hall
    F?rch, C. & G. Kasper. (1983). Strategies in Interlanguage Communication. London: Longman.
    Firbas, J. (1992). Functional Sentence Perspective in Written and Spoken Communication.Cambridge: Cambridge University Press.
    Flavell, H. (1976). Metacognitive aspects of problem solving. In L. Resnick (Ed.), The nature of intelligence (pp. 231–236). Hillsdale, NJ: Erlbaum.
    Freedle, R. & I. Kostin. (1999). Does the text matter in a multiple-choice test of comprehension? the case for the construct validity of TOEFL minitalks. Language Testing, 16(1), 2-32.
    Garrett. E. (1947). Statistics in psychology and education. New York: Longmans, Green.
    Goodwin. D. (1999). The role of factor analysis in the estimation of construct validity. Measurement in Physical Education and Exercise Science 3, 85-100.
    Goodwin, D. (2002a). Changing conceptions of measurement validity: An update on the new standards. Journal of Nursing Education, 41. 100-106.
    Goodwin, D. (2002b). The meaning of validity. Journal of Pediatric Gastroenterology and Nutrition, 35. 6-7.
    Goodwin, D., & L. Leech. (2003). The meaning of validity in the new standards for educational and Psychological testing: implications for measurement courses. Measurement and Evaluation in Counseling and Development, 36
    Green, A. (2003). Test impact and English for academic purposes: a comparative study in backwash between IELTS preparation and university processional courses. Unpublished PhD thesis. University of Surrey, Roehampton.
    Guilford, P. (1946). New standards for test evaluation. Educational & Psychological Measurement, 6, 427-438.
    Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.
    Field, J. (2004). Psycholinguistics: the Key Concepts, London: Routledge.
    Haladyna, M. (2004). Developing and validating multiple-choice test items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
    Haladyna,M.,M.Downing &C.Rodriguez. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 3.
    Halliday, K. (1985). An Introduction to Functional Grammar. London: Edward Arnold.
    Halliday, K. (1994) An Introduction to Functional Grammar (2nd Ed.). London: Edward Arnold. (FG)
    Halliday, K. & R. Hasan. (1976). Cohesion in English. London: Longman.
    Hamp-Lyons, L. (1997). Washback, impact and validity: ethical concerns. Language Testing, 14, 295-303.
    Hayati, M. & N. Ghojohg. (2008). Investigating the influence of proficiency and gender on the use of selected test-wiseness strategies in higher education. English Language Teaching, 2, 169-181.
    Hempel, G. (1965). Aspects of scientific explanation and other essays in the philosophy of science. Glencoe, IL: Free Press.
    Houston, E. (2005). Test-wiseness training: An investigation of the impact of test-wiseness in an employment setting. The Graduate Faculty of the University of Akron.
    Huges, A. (2003). Testing for language teachers. Cambridge: Cambridge University Press.
    Johnson, M. (2001). The art of non-conversation. New Haven, CT.: Yale University Press.
    Kane, T. (1990). An argument-based approach to validation. ACT Research Report Series, 90-13.
    Kane, T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.
    Kane, T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319–342.
    Kane, T. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice, 21, 31–41.
    Kane, T. (2004). Certification testing as an illustration of argument-based validation. Measurement: Interdisciplinary Research and Perspectives, 2, 135–170.
    Kane, T. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed.) (pp. 17–64), Westport, CT: American Council on Education and Praeger.
    Kane, T. (2010). Validity and fairness. Language Testing, 27, 177-182.
    Kane, T, T. Crooks & A. Cohen. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17.
    Khalif, H. & C. Weir. (2009). Examining Reading: research and practice in assessing second language reading. Cambridge: Cambridge University Press.
    Klein-Braley, C. (1981). Empirical investigation of cloze tests: an examination of thevalidity of cloze tests as tests of General Language Proficiency in English for German University Students. Unpublished Ph. D. dissertation. University of Duisburg.
    Krista, V. (1998). Quantitative and qualitative methods of construction in language assessment: conflict or synergy. Journal of Quantitative Linguists, 5. 1-2, 105-114.
    Kunnan, J. (2000). Fairness and justice for all. In A. Kunnan (Ed.), Fairness and validation in language assessment (pp. 1–14). Cambridge, UK: Cambridge University Press.
    Kunnan, J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context: Proceedings of the ALTE Barcelona Conference (pp. 27–48). Cambridge, UK: Cambridge University Press.
    Kunnan, J. (2005). Language assessment from a wider context. In E. Hinkel (Ed.), Handbook of research in second language learning (pp. 779–794). Mahwah, NJ: Lawrence Erlbaum.
    Kunnan, J. (2008). Towards a model of test evaluation: Using the Test Fairness and Wider Context frameworks. In L. Taylor & C. Weir (Eds.), Multilingualism and assessment: Achieving transparency, assuring quality, sustaining diversity (pp. 229–251). Cambridge, UK: Cambridge University Press.
    Kunnan, J. (2009a). Testing for citizenship: The U.S. Naturalization Test. Language Assessment Quarterly, 6, 89–97.
    Kunnan, J. (2009b). Politics and legislation in citizenship testing in the U.S. Annual Review of Applied Linguistics, 23, 37–48.
    Kunnan, J. (2010). Test fairness and Toulmin's argument structure. Language Testing, 27, 183-189.
    Lamb, S. (1999). Pathways of the B rain: The Neuro-cognitive Basis of Language. Amsterdam: John Benjamins Publishing Co..
    Langenfeld, E. & M. Crocker. (1994). The evolution of validity theory: Public school testing, the courts, and incompatible interpretations. Educational Assessment, 2. 149-165.
    Linn. L. (1997). Evaluating the validity of assessments: The consequences of use. Educational Measurement: Issues and Practice, 6(2), 14-16.
    Loevinger, J. (1957). Objective tests as instruments of psychological theory, PsychologicalReports, 3. 6350694 (Monograph Suppl.9).
    Lynch, K. (2001). Rethinking assessment from a critical perspective. Language Testing, 18, 351–372.
    Marzano, R. & D. Jesse. (1987). A Study of General Cognitive Operations in Two Achievement Test Batteries and Their Relationship to Item Difficulty. Unpublished paper, Mid-Continent Regional Educational Lab., Aurora, CO.
    Marzano, R. & A. Costa, (1988). Question: Do standardized tests measure general cognitive skills? Answer: No. Unpublished paper. http://www.ascd.org/ASCD/pdf/ journals/ed_lead/el_198805 _marzano. pdf
    Mathesius, V. (1939). Functional Sentence Perspective. Prague: Academia.
    McNamara, T. (1997).“Interaction”in second language performance assessment: Whose performance? Applied Linguistics, 18(4): 446-465.
    McNamara, T. (1998). Policy and social considerations in language assessment. In W. Grabe (Ed.), Annual Review of Applied Linguistics (Vol. 18, pp. 304–319). New York: Cambridge University Press.
    McNamara, T. & C.Roever. (2006). Language Testing: The Social Dimension. London: Blackwell.
    Mehrens, W. (1997). The consequences of consequential validity. Educational Measurement Issues and Practice, 16(2). 16-18.
    Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955-966.
    Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012-1027
    Messick, S. (1988). The once and future issues of validity: assessing the meaning and consequences of measurement. In H. Wainer & H. Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
    Messick. S. (1989a). Meaning and value in test validation: The science and ethics of assessment. Educational Researcher. 18(2), 5-11.
    Messick, S. (1989b). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/Macmillan.
    Messick, S. (1992). Validity of test interpretation and use. In M.C. Alkin (ed.), Encyclopedia of Educational Research (16th ed.). New York: Macmillan.
    Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23.
    Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749
    Messick, S. (1996) Validity and washback in language testing. Language Testing 13, 241-56.
    Mislevy, R. (1995). Probability-based inference in cognitive diagnosis. In P. Nichols, S. Chipman & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 43-71). Hillsdale, NJ: Erlbaum.
    Mislevy, J., S. Steinberg & G. Almond. (2002). Design and analysis in task-based language assessment. Language Testing, 19(4), 477–496.
    Mislevy, J., S. Steinberg & G. Almond. (2003). On the structure of assessment arguments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62.
    Morse, T. (1998). The relative difficulty of selected test-wiseness skills among college students. Educational and Psychological Measurement ,58, 299-409.
    Mosier,. (1947). A critical examination of the concepts of face validity. Educational and Psyhcological Measurement, 7, 191-205.
    Moss, A. (1902). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62, 229-258.
    Nikolaos, B. (1992). Artificial Intelligence: Methods and Applications. New York, NJ: World Scientific Publishing Co. Ltd.
    Nitko, J. (2001). Educational assessment of students (3rd ed.). Columbus, OH: Merrill Prentice Hall.
    Nevo, B. (1985). Face validity revisited. Journal of Educational Measurement, 22, 287-293.
    O’Malley, M & A. Chamot. (1990). Learning Strategies in Second Language Acquisition.Cambridge: Cambridge University Press.
    Osterlind, J. (1998). Constructing test items: Multiple choice, constructed-response, performance and other formats (2nd ed.). Boston: Kluwer Academic.
    Pearson, D. & D. Johnson. (1978). Teaching Reading Comprehension. New York, NJ: Holt, Rinehart and Winston.
    Phakiti, A. (2008). Construct validation of Bachman and Palmer’s (1996) strategic competence model over time in EFL reading tests. Language Testing, 25(2), 237–272.
    Pidwirny, M. (2008). Fundamentals of Physical Geography, (2nd Ed.). Physical Geography.net.
    Popham. J. (1997). Consequential validity: Right concern—wrong concept. Educational Measurement: Issues and Practice, 16(2). 9-13.
    Saussure, de. (2006). Writings in General Linguistics (English translation), Oxford: Oxford University Press.
    Schoonen, R. (2005). Generalizability of writing scores: an application of structural equation modeling. Language Testing, 22(1), 1–30
    Scruggs, E. & M. Mastropieri. (1992). Assessing Test-Taking Skills. Cambridge: Brookline Books.
    Searle, R. (1969). Speech Acts: An Essay in the Philosophy of Language. Cambridge: Cambridge University Press.
    Shannon, E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27.
    Shohamy, E. (2001). The power of tests: a critical perspective on the uses of language tests. Harlow: Person Education.
    Shepard, A. (1993). Evaluating test validity. In Darling-Hammond (Ed.), Review of research in education (vol. 19, pp. 405-450). Washington, DC: American Educational Research Association.
    Shepard, A. (1997). The centrality of test use and consequences for test validity. Educational. Measurement: Issues and Practice, 16(2), 5-8, 13, 24.
    Song, M. (2008) Do divisible subskills exist in second language (L2) comprehension? A structural equation modeling approach. Language Testing, 25(4), 435–464.
    Suppe, P. (1977). The structure of scientific theories. Urbana, IL: University of Illinois Press.
    Tarski, A. (1956). Logic, Semantics, Metamathematics: Papers From 1923 to 1938. Oxford: Oxford University Press.
    Toulmin, E. (2003). The uses of argument (Updated ed.). Cambridge, England: Cambridge University Press.
    Torre, J. (2009). A Cognitive Diagnosis Model for Cognitively Based Multiple-Choice Options. Applied Psychological Measurement, 33, 163-183.
    Thorndike. M. (1997). Measurement and evaluation in psychology and education (6th ed.). Upper Saddle River, NJ: Merrill.
    Tsuchihira, T. (2008). The Relationships between test-wiseness and the English listening test scores, http://cicero.u- bunkyo.ac.jp /lib/ kiyo/fsell2008/index.html.
    Wainer, H. & I. Braun. (1988). Test validity (Eds.). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
    Weir, C. (1993). Understanding and developing language tests. New York: Prentice Hall.
    Weir, C. (2005). Language testing and validation: an evidence-based approach. Basingstoke: Palgrave Macmillan.
    Widdowson, G. (1978). Teaching Language as Communication. Oxford: Oxford University Press.
    Wiener, N. (1948). Cybernetics: Or Control and Communication in the Animal and the Machine. Paris, (Hermann & Cie) & Camb. Mass. (MIT Press)
    Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27, 147-170.
    陈先达. (2004).《马克思主义哲学原理》(第2版) .北京:人民大学出版社.
    程琪龙. (1999).《认知语言学概论———语言的神经认知基础》.北京:外语教学与研究出版社.
    戴曼纯. (1996).多选题的设计原则与干扰模式.《外语界》, (2)
    邓杰, (2009).话语信息的认知处理研究,《外语与外语教学》, (3),4-8.
    邓杰, (2010).网上自主学习教学模式研究与实践.《外语电化教学》, (2),58-63.
    桂诗春. (1991).《实验心理语言学纲要———语言的感知、理解与产生》.长沙:湖南教育出版社.
    孔文. (2009).《英语专业四级考试(TEM4)阅读任务效度多角度分析》.上海:上海外国语大学.
    维纳. (2007).《控制论:或生物与机器中的控制与通信》,郝季仁译.北京:北京大学出版社.
    胡壮麟. (1994).《语篇的衔接与连贯》.上海:上海外语教育出版社.
    黄昌宁&夏莹. (1995).《语言信息处理专论》.北京:清华大学出版社.
    金艳&郭杰克. (2002).大学英语四、六级考试非面试型口语考试效度研究.《外语界》, (5): 72 - 79.
    金艳&吴江. (1998).以“内省法”检验CET阅读理解测试的效度.《外语界》, (2): 47 - 52.
    鞠玉梅. (2003).信息结构研究的功能语言学视角.《外语与外语教学》, (4) .
    李筱菊. (1997).《语言测试科学与艺术》.长沙:湖南教育出版社,.
    刘润东[美]. (2001).《UML对象设计与编程》.北京:北京希望电子出版社.
    潘之欣. (2001).语言测试中的多项选择题型.《外语界》, (4), 67-74.
    彭康洲. (2010). Validity study on listening comprehension tasks—From the Perspective of Assessment Use Argument.未发表博士论文.
    上海外国语大学TEM考试中心. (1998).《TEM考试效度研究》.上海:上海外语教育出版社,.
    宛延闿 &定海. (2001).《面向对象分析和设计》.北京:清华大学出版社.
    王纲. (1988).《普通语言学基础》.长沙:湖南教育出版社.
    谢小庆. (2004).美国1999年版与1985年版《教育与心理测试标准》的对比分析.《中国考试》, 4, 16-19.
    杨惠中& C. Weir. (1998).《大学英语四、六级考试效度研究》.上海:上海外语教育出版社.
    杨志明&张雷. (2003).《测评的概化理论及其应用》.北京:教育学科出版社.
    俞士汶. (2003).《计算语言学概论》.北京:商务印书馆.
    张今&张克定. (1998).《英汉语信息结构对比研究》.开封:河南大学出版社.
    周海中&刘绍龙. (2001).《元认知与二语习得.中国语言学研究与应用》.上海:上海外语教育出版社.
    邹申. (1998).考试效度研究项目总结报告. In邹申.《英语语言测试——理论与操作》. 上海:上海外语教育出版社.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700