基于HSK数据对核等值法与其他等值方法的比较研究

英文题名：A Comparison on the Method of Kernel Equating and Other Equating Methods Based on HSK Data
作者：罗莲
论文级别：博士
学科专业名称：语言学及应用语言学
中文关键词：测验等值 ; 核等值 ; 经典测验理论 ; 教育测量 ; 汉语水平考试
英文关键词：test equating ; the kernel method of test equating (KE) ; classic testing theory (CTT) ; educational measurement ; HSK
学位年度：2008
导师：谢小庆
学科代码：050102
学位授予单位：北京语言大学
论文提交日期：2008-05-01

摘要

对测验的试卷进行等值具有重要的意义。等值处理可以提高测验分数报告和解释的精确性,保证评价标准的稳定性,从而保证测验的质量。
     核等值法(the kernel method of test equating,KE)是一种新的等值方法。核等值法将基于经典测验理论(Classic Testing Theory,CTT)的线性等值法和等百分位等值方法纳入到统一的框架之中。核等值法通过转换给定考生总体在X卷上的观察分分布,得到Y卷上的观察分分布,因此其本质是观察分等值。一般说来,核等值法有五个步骤,分别是前平滑处理、估计分数概率、连续化、等值、计算等值标准误。核等值法已经在美国教育测验服务中心(Educational Test Service,ETS)得到了应用。
     在试卷难度相近和考生样组水平相近的假设下,核等值框架下的新方法与CTT方法等值结果存在哪些差异?核等值法框架下的不同方法等值结果是否存在差异?差异程度如何?它是否可以用于HSK考试的等值?为了回答这些问题,本研究基于HSK考试,构建了虚拟的测验,在最大程度上消除误差,根据一定的等值标准,将核等值框架下的新方法与传统的CTT等值方法进行了对比。
     本研究比较的基于CTT的锚测验设计下的等值方法包括:Tucker、Levine、Braun-Holland、链式线性方法、经过及未经过平滑的链式频数估计等百分位方法、经过及未经过平滑的频数估计等百分位等值方法;基于核等值框架下的等值方法包括:核链式优化值等值法、核链式线性大h值方法、核后分层优化h值方法、核后分层大h值等值法。核框架下每种方法都包含平滑及未平滑两种处理。
     比较的结论是:在试卷难度有差异且考生样组水平也有差异情况下,在以随机组等百分位等值方法作为标准时,两种框架下的等百分位等值方法有较好的表现,但小样本上链式方法表现欠佳;核等值法与一些基于CTT的等值方法具有一一对应的关系,线性方法无需进行平滑就可以得到与对应的传统线性方法相同的结果;核等值框架下大样本上核链式方法与核后分层方法、核链式等百分位方法与核链式线性方法、核后分层等百分位方法与核后分层线性方法之间都有较大差异;在小样本上,核链式方法与对应后分层方法、核链式等百分位方法与线性方法、核后分层等百分位方法与线性方法之间大部分时候差异较小,但是经过平滑后可能差异增大。
     由于现在的HSK考试比1989年时的考试难度大而且考生水平也提高了,因此当样本较小时,可采用CTT框架下经过平滑的频数估计等百分位方法或者核框架下经过平滑的核后分优化h值方法,避免使用链式方法;当样本较大时,可采用的方法有:CTT框架下频数估计等百分位方法以及链式等百分位方法、核框架下的核后分层优化h值方法以及链式优化h值方法。
     研究还讨论了不同的等值标准和统计指标。根据这些不同的标准,等值方法的比较得到了不同的结论。
The equating of test scores derived from different test forms is significant. When equating is being carried out, test scores could be reported and explained more accurately. Also, equating keeps the evaluation criterion stable so that the quality of tests could be controlled.
     The kernel method of test equating is a new equating method. It integrates the linear methods and equipercentile methods based on classic testing theory into one frame. It converts the scores of the given testees' population on test form X into that of the observed score distribution on test form Y, so it is an observed score equating method. The kernel method of test equating has five steps, including presmoothing, estimation of test score probability, continuation, equating and calculation of standard error of equating. Kernel equating has been in use at Educational Testing Service (ETS) for some time.
     Is there any difference between the new methods under the KE frame and the traditional equating methods based on CTT? To what extent are KE methods different from those corresponding CTT methods? Is there any difference between methods under KE frame? Shall the KE equating methods be used in HSK equating? In order to answer these questions, this study constructs new test forms based on real HSK data to remove error, and comparison has been done in line with some equating criteria.
     Sixteen methods have been compared in this study, including 8 CTT methods—Tucker, Levine, Braun-Holland, Chain linear equating, presmoothed and unpresmoothed chain equipercentile method, frequencey estimation method, and 8 KE methods—KE chained with optimal bandwidth (CE optimal), KE chained with large h bandwidth (CE linear), KE poststratification with optimal bandwidth (PSE optimal), KE poststratification with large bandwidth (PSE linear)—each method under KE has two treatments, either presmoothed or unpresmoothed.
     The result shows that KE methods approximate their corresponding methods based on CTT under NEAT design. With random group equipercentile method as a criterion, the equipercentile methods under both CTT and KE frames perform well, but the chained equating method should be avoided for small samples; kernel linear mehods could produce the same results as the CTT methods without presmoothing. For large samples, the CE and PSE methods, the corresponding methods with optimal and large h values yield different results, and the differences are significant from zero. For small samples, the corresponding methods might produce similar results without presmoothing. Presmoothing plays an important role in the equating of smaller samples.
     Since the present test forms of HSK are more difficult than that of 1989, and the testing groups are higher achieving than before, this study makes the following proposal: the frequency estimation equipercentile and presmoothed PSE with optimal h value are better choices for small samples; the frequency estimation equipercentile methods, the chained equipercentile methods, PSE and CE with optimal h values work better for large samples.
     In this study, different equating criteria and statistic indexes are also discussed, and it is found that the comparisons based on different equating criteria might lead to different conclusions.

引文

Anderson. E.B. (1977). Sufficient statistics and latent trait models. Psychometrika 42,69-81.
    Andrich. D (1978). Application of a psychometric rating model to ordered categories which are scored with successive integers. Applied Psychological Measurement, Vol.2,581-594.
    Andrulis, R.S., Starr, L.M., & Furst, L.W. (1978). The effects of repeaters on test equating. Educational and Psychological Measurement, 38, 341-349.
    Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education.
    Angoff, W. H., & Cowell, C.C. (1986). An examination of the assumption that the equating of parallel forms is population-independent. Journal of Educational Measurement, 23, 327-345.

    Assessment System Corporation.(1995). ITEMANW (program)
    Bloxom, B., McCully, R., Branch, R., Waters, B.K., Barnes, J., & Gribben, M. (1993). Operational calibration of the circular-response optical-mark reader answer sheets for the ASVAB. Monterey,CA: Defense Manpower Data Center.
    Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37,29-51.
    Braun, H. I., & Holland, P.W. (1982). Observed score test equating, In Holland, P. W., Rubin, D.B. ed. Test equating ,New York: Academic Press
    Brennan, R.L., & Kolen, M.J. (1987). Some practical issues in equating. Applied Psychological Measurement, 11,279-290.
    Brennan, R.L. (1992). The context of context effects. Applied Measurement in Education, 5,225-264.
    Brennan, R.L. (Ed.) (1989). Methodology used in scaling the ACT Assessment and P-ACT+. Iowa City, LA: American College Testing.
    Budescu, D. (1987). Selecting an Equating Method: Linear or Equapercentile? Journal of Educational Statistics, 12, 33-43
    Burke, E.F., Hartke, D., & Shadow, L. (1989) Print format effects on ASVAB Test score performance: Literature Review (AFHRL Technical Paper 88-58). Brooks Air Force Base, TX: Air Force Human Resources Laboratory.
    Cook, L.L., & Peterson, N.S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11,225-244.
    Cope, R.T. (1986). Use versus nonuse of repeater examinees in common item linear equating with nonequivalent populations (ACT Technical Bullitin 51). Iowa city, IA: American College Testing.
    Davey, T. & Thomas, L. (1996). Constructing adaptive tests to parallel conventional programs. Paper presented at the Annual Meeting of the American Educational Research Association, New York, NY.
    von Davier. A.A., Holland, P.W., & Thayer, D..(2004). The Kernel Method of Test Equating. New York: Springer-Verlag New York Inc.
    von Davier, A. A., Fournier-Zajac, S., & Holland, P. W. (2007). An Equipercentile Version of the Levine Linear Observed-Score Equating Function Using the Methods of Kernel Equating (ETS RR-07-14). Princeton, NJ: ETS.
    von Davier, A. A., Holland, P. W., Livingston, S. A., Casabianca, J., Grant, M. C., & Martin, K. (2006), An evaluation of the kernel equating method. A special study with pseudo-tests constructed from real test data (ETS RR-06-02). Princeton, NJ: ETS.
    Divgi, D.R. (1981). Two direct procedures for scaling and equating tests with item response theory. Paper presented at the Annual Meeting of the American Educational Research Association, Los Angeles.
    Donlon, T. (Ed.) (1984). The College Board technical handbook for the Scholastical handbook for the Scholastic Aptitude Test and Achievement Tests. New York: College Entrance Examination Board.
    Dorans. N. J. (1986). The impact of item deletion on equating conversions and reported score distributions. Journal of Educational Measurement. 23,245-264.
    Dorans. N. J. (2002). Recentering and Realigning the SAT score distributions: how and why. Journal of Educational Measurement. 39, 59-84.

    Dorans, N. J. & Holland, P.W. (2000). Population invariance and the equalibility of tests: Basic theory and the linear case. Journal of Educational Measurement. 37, 281-306.
    Drasgow, F., & Olson-Buchanan, J. (Eds). (1999). Innovations in computerized assessment. Mahwah, NJ: Lawrence Erlbaum Associates.
    Educational Testing Service. (2007). KE Version3.0(program)
    Educational Testing Service. (2007). Loglin Version3.0(program)
    Eignor, D.R. (1985). An investigation of the feasibility and practical outcomes of preequating the SAT verbal and mathematical sections. (Research Report 85-10). Princeton, NJ: Educational Testing Service.
    Eignor, D.R. & Schaeffer, G.A. (1995). Comparability studies for the GRE General CAT and the NCLEX using CAT. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA.
    Gilmer .J.S. (1989). The effects of test disclosure on equated scores and pass rates. Applied Psychological Measurement, 13,245-255.
    Hambleton, R.K., Swaminnathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
    Han, T., & Kolen, M. (1997). A comparison among IRT true- and observed-score equatings and traditional equipercentile equating, Applied Measurement in Education, 10-2,105-121
    Hanson, B. A. (1989). Scaling the P-ACT+.in R.L.Brennan (Ed.), Methodology used in scaling the ACT assessment and P-ACT+ (pp.57-73). Iowa city, IA:American College Testing.
    Hanson, B.A. (1992). Testing for differences in test score distributions using log-linear models. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.
    Hanson, B. A. (1996). Testing for differences in test score distributions using log-linear models. Applied Measurement in Education, 9,305-321.
    Harris, D.J. (1986). A Comparison of two answer sheet formats. Educational and Psychological Measurement, 46,475-478
    Harris, D.J. (1987). Estimating examinnee achievement using a customized test. Paper Paper presented at the annual meeting of the American Educational Research Association, Washington, D.C.
    Harris, D.J. (1988). An examination of the effect of test length on customized testing using tiem response theory. Paper presented at the annual meeting of the American Educational Research Association, New Orleans.
    Harris,D.J.(1993).Practical issues in Equating.Paler presented at the annual meeting of the American Educational Research Association,Atlanta.
    Harris,D.J.& Crouse,J.D.(1993).A study of eriteria used in equating,Applied Measurement in Education,6(3),195-240
    Harris,D,J.& Kolen,M.J.(1986).Effect of Examinee group on equating relationships.Applied Psychological Measurement,10,35-43
    Holland,P.W.& Dorans.N.J.(2006).Linking and Equating.in Educational Measurement(fourth edition).Westport:American Council on Education and Praeger Publishers.
    Holland,P.W.,& Rubin,D.B.(1982).Test Equating.New York:Academic.
    Holland,P.W.,King,B.F.,& Thayer,D.T.(1987).The Standard error of equating for the kernel method of equating score distributions.(ETS Technical Report No.TR-89-83).Princeton,NJ:ETS.
    Holland,P.W.,& Thayer,D.T.(1987).Notes on the use of log-linear models for fitting discrete probability distributions(ETS Technical Rep.No.TR-87-79).Princeton,NJ:ETS.
    Holland,P.W.,& Thayer,D.T.(1989).The kemel method of equating score distributions(ETS Research Rep.No.RR-89-7).Princeton,NJ:ETS.
    Holland,P.W.,& Wightman,L.E.(1982).Section-pre-equafing:A preliminary investigation.In P.W.Holland & D.B.Rubin(Eds.),Test Equating(pp.271-297).New York:Academic Press,Inc.
    Jaeger,R.M.(1981).Some Exploratory Indices for Selection of a Test Equating Method.Journal of Educational Measurement,18,23-38
    Kim,D.I.(2005).A comparison of IRT equating and beta 4 equating.Journal of Educational Measurement,42(1),77-99.
    Kiplinger,V.L.,& Linn,R.L.(1996).Raising the stakes of test administration:The impact on student performance on the national Assessment of Educational Progress.Educational Assessment,3(2),111-133.
    Kolen,M.J.(1981).CIPE(program)
    Kolen,M.J.(1981).RAGE(program)
    Kolen,M.J.(1981).Comparison of traditional and item response theory methods for equating tests,Journal of Educational Measurement,18-1
    Kolen,M.J.(2006).Book Review.Psychometrika.71:211-214
    Kolen,M.J.,& Brerman,R.L..(1987).Linear equating models for the common item nonequivalent-populations design.Applied psychological measurement,11,263-277.
    Kolen,M.J.,& Brennan,R.L..(1995).Test Equating:Methods and Practices.New York:Springer-Verlag New York Inc.
    Kolen,M.J.,& Brennan,R.L..(2004).Test Equating,Sealing and Linking:Methods and Practices(Second Edition).New York:Springer-Verlag New York Inc.
    Kolen,M.J.,Hanson,B.A.,& Brennan,R.L.(1992).Conditional standard errors of measurement for scale scores.Journal of Educational Measurement,29,285-307.
    Kolen,M.J.,& Harris,D.J.(1990).Comparison of item preequating and random groups equating using IRT and equipercentile method,Journal of Educational Measurement,27,27-39.
    Kolen,M.J.& Whitney,D.R.(1982).Comparison of four procedures for equating the Test of General Educational Development.Journal of Educational Meaurement,19(4),279-293.
    Kolen,M.J.,Zeng,L.,& Hanson,B.A.(1996).Conditional standard errors of measurement for scale scores using IRT.Journal of Educational Measurement,33(2),129-140.
    Li,D.P.,(2006) Book Review.Applied Psychological Measurement.29:404-406
    van der Linden,W.J.(2006).Book Review.Journal of Educational Measurement.43:291-294
    van der Linden,W.J.,& Glas,C.A.W.(2000).Computerized Adaptive Testing:Theory and Practice.Dorcrecht,Boston:Kluwer Academic.
    Lima,R.L.,& Hambleton,R.K.(1991).Customized tests and customized norms.Applied Measurement in Education,4,185-207.
    Liou,M.,& Cheng,P.E.(1995).Asymptotic standard error of equipercentile equating.Journal of Educational and Behavioral Statistics,20,259-286.
    Liou,M.,Cheng,P.E.,& Wu,C.-J.(1999).Using repeaters in estimating comparable scores.British Journal of Mathematical and Statistical Psychology,52(2),273-284
    Liou,M.,Cheng,P.E.,& Li,M.-Y.(2001).Estimating comparable scores using surrogate variables.Applied Psychological Measurement,25(2),197-207.
    Liu,J.H.,& Low,A.(2006).An Exploration of Kernel Equating Using SAT Data.Paper presented at the annual meeting of the American Educational Research Association(AERA),San Francisco,CA.
    Liu, J.H., & Low, A. (2007). An Exploration of Kernel Equating Using SAT Data. An Exploration of Kernel Equating Using SAT? Data: Equating to a Similar Population and to a Distant Population. (ETS RR-06-02). Princeton, NJ: ETS.
    Livingston, S. A. (1993a). Small sample equatings with log-linear smoothing. Journal of Educational Measurement. 30:23-39.
    Livingston, S. A. (1993b). An empirical tryout of kernel equating (ETS RR-93-33). Princeton, NJ: ETS.

    Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: ETS
    Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
    Lord, F. M. & Novick, (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.
    Loyd, B.H. (1991). Mathematics test performance: the effects of item type an calculator use. Applied Measurement in Education, 4,11-22.
    Lunz, M.E., & Bergstrom, B.A. (1995). Equating computerized adaptive certification examinations: The Board of Registry Series of studies. Paper presented at the 'Annual Meeting of the National Council on Measurement in Education, San Francisco, CA.
    Maier, M.H. (1993). Military aptitude testing: The past fifty years (DMDC Tchnical report 93-007). Monterey, CA: Defense Manpower Data Center.
    Mao, X., von Davier, A. A., Rupp S. (2006). Comparisons of the Kernel equating method with the traditional equating methods on PRAXIS data. (RR-06-30). Princeton, NJ: ETS.
    Maozzeo, J., & Harvey, A.L. (1988). The equivalence of scores from automated and conventional educational and psychological tests. A review of the literature (College Board Report 88-8). New York: College Entrance Examinational Board.
    Marco, G.L., Peterson, N., & Stewart, E.(1979). A test of the adequacy of curvilinear score equating models. Paper presented at the Computerized Adaptive Testing Conference, Minneapolis, MN.
    Marco, G.L. (1981). Equating tests in the era of test disclosure. In B.F.Green (Ed.), New directions for testing and measurement: Issues in testing-coaching, disclosure, and ethnic bias (ppl05-122).San Francisco: Jossey-Bass.
    Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 4,7 149-174.
    McKinley, R.L., & Schaeffer, G.A. (1989). Reducing test form overlap of the GRE subject test in mathematics using IRT triple-part equating (Research Report 89-8). Princeton, NJ: ETS.
    Mills, C.N., Potenza, M.T., Fremer, J.J., & Ward, W.C. (Eds). (2002). Computer-based Testing: Building the information for future assessments. Mahwah, NJ: Lawrence. Eribaum Associates.
    Morgan, R., & Stevens, J. (1991). Expreimenal study of the Effects of calculator use in the advanced placement caculus examinations. (Research Report 91-5). Princeton, NJ: Educational Testing Service.
    Morris, C.N. (1982). On the foundation of test equating. In P.W. Holland & D.B. Rubin (Eds), Test equating (pp.169-191). New York: Academic press.

    Moses, T., Yang, W.L., & Wilson, C. (2007). Using Kernel Equating to Assess Item Order Effects on Test Scores. Journal of Educational Measurement, Vol. 44, No. 2,pp. 157-178
    Moses, T., & Holland, P.W. (2007). Kernel equating without presmoothing. Paper presented at the annual meeting of AERA, San Francisco, CA.
    Moses, T., & Holland, P. (2007). Kernel and Traditional Equipercentile Equating With Degrees of Presmoothing (RR-07-15). Princeton, NJ: ETS
    Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement 16,159-176.
    O'Neil, H.F., Jr., Sugrue, B., & Backer, E.L. (1996). Effects of Motivational interventions on vertical scaling of multilevel achievement test data. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Diego, CA.
    Parshall, C.G., Houghton, P.D., & Kromrey, J.D. (1995). Equating error and statistical bias in small sample linear equating. Journal of Educational Measurement, 32(1), 37-54.
    Parshall, C.G., Spray, J.A., Kalohn, J.C., & Davey, T. (2002). Practical considerations in computer-based testing. New York: Springer-Verlag.
    Petersen, N. S., Cook, L.L., & Stocking, M.L. (1983) IRT versus conventional equating methods: a comparative of scale stability, Journal of Educational Statistics, 8, 137-156
    Petersen, N. S., Kolen, M.J. & Hoover, H.D. (1989) Scaling, norming, and equating. In R.L.Linn (Ed.), Educational measurement (3rd ed., pp221-262). New York: Macmillan.
    Petersen, N. S., Marco, G.L. & Stewart, E.E. (1982). A test of the adequacy of linear score equating method, in P.W.Holland, D.B. Rubin (Eds.), Test equating (pp71-135). New York: Academic Press, Inc.
    Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores, Psychometrika Monograph, No.17.
    Sands, W.A.,Waters, B.K., & McBride, J.R. (Eds.). (1997). Computerized Adaptive Testing: from Inquiry to Operation. Washington,D.C: American Psychological Association.
    Segall, D.O. (1997). Chapter 19.Equating the CAT-ASVAB. In Sands, W.A., Waters, B.K., & McBride, J.R. (Eds.). (1997). Computerized Adaptive Testing: from Inquiry to Operation. Washington, D.C: American Psychological Association.
    Skaggs, G. (1990). Assessing the utility of item response theory models for test equating. Paper presented at the annual meeting of the National Council on Measurement in education, Boston.
    Skaggs, G., & Lissitz, R.W. (1986). IRT test equating: relevant issues and a review of recent research, Review of Educational Research, 56,495-529
    Stocking, M.L. (1994). Three practical issues for modem adaptive testing item pools. (Research Rport 94-5). Princeton, NJ: Educational Testing Service.

    Thomasson, G.L. (1997). The goal of equity within and between computerized adaptive tests and paper and pencil forms. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL. Thomasson, G.L., Bloxom, B., & Wise, L. (1994). Initial operational test and evaluation of forms 20, 21, and 22 of the Armed Services Vocational

    Battery(ASVAB) (DMDC technical Report 94-001). Monterey, CA: Defense Manpower Data Center.
    Tong, Y., & Kolen, M.J. (2005). Assessing Equating Results on Different Equating Criteria, Applied Psychological Measurement, Vol.29, No.6:418-432
    Tong, Y. (2002). Assessing Equating Results with respect to different properties. Unpublished MA thesis, The University of Iowa, Iowa City, IA.
    Wainer,H.(Ed.).(1993).Some practical considerations when converting a linearly administered test to an adaptive format.EducatiComputerized adaptive testing:A palmer,(2~(nd) ed.).Mahwah,NJ:Erlbaum
    Wainer,H.(Ed.).(2000) Computerized adaptive testing:A primer,(2~(nd) ed.).Mahwah,NJ:Erlbaum
    Wainer,H.& Mislevy,R.J.(2000).Item Response Theory,item calibration,and proficiency estimation.In H.Wainer(Ed.) Computerized adaptive testing:A primer,(2~(nd) ed.).Mahwah,NJ:Erlbaum
    Wang,T.,Kolen.,M.J.,& B.A.,& Harris,D.J..(2000).Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT.Journal of Educational Measurement,37(2).141-162.
    Wang,T.,& Kolen.,M.J.(2001).Evaluating comparability in computerized adaptive testing:Issues,criteria,and an example.Journal of Educational Measurement,38(1),19-49.
    Way,W.D.,Forsyth,R.A.,& Ansley,T.N.(1989).IRT ability estimates from customized achievement tests without representative content sampling.Applied Measurement in Education,2,15-35.
    Yen,W.M.(I983).Tau-equivalence and equipercentile equating.Psychometrika,48,353-369.
    Zeng,L.(1995).The optimal degree of smoothing in equipercentile equating with postsmoothing,Applied Psychological Measurement,19-2,177-190
    Zwick,R.(1991).Effects of item order and context on estimation of NAEP reading proficiency.Educational Measurement:Issues and Practice,10,10-16.
    蔡建民.1997.我国高中会考制度的发展与探索.课程·教材·教法,6:39-44
    陈希镇.1999.测验等值中等值公式的研究.莆田高等专科学校学报,6(1):4-7.
    陈希镇.2006.铆测验设计下IRT等值常数的新方法.中国考试,2006年第5期:39-42
    陈希镇.2007.不等信度下等值新公式.中国考试,2007年第1期:22-25
    陈希镇,华栋.2007.对确定等值常数Haebara方法的改进.统计教育,2007年第5期:11-13
    程德巧.2005.绝对值等值准则及求解算法的应用.江西师范大学硕士学位论文
    戴海崎.1999.等值误差理论与我国高考等值的误差控制.江西师范大学学报(哲学社会科学版),32(1):29-35
    戴海崎.2000。等级反应模型项目特征曲线法等值研究.心理学探新,20(75):49-53
    戴海崎,刘启辉.2002.锚题题型与等值估计方法对等值的影响.心理学报,34(4):367-370
    戴海崎.2003.高考等值试验的几个重要问题研究.湖北招生考试,84(4):7-9
    邓湘云.1996.CTT与IRT等值方法比较研究.江西师范大学硕士学位论文
    丁树良,熊建华,罗芬,吴锐,甘小方,涂白.2005.一种新的等值准则及其适用范围的探讨.心理学报,37(5).674-680
    丁树良,熊建华,戴海崎.2005.影响项目反应理论等值效果的因素探查.中国考试,(1):25-26
    丁树良,熊建华.2003.项目反应理论框架下几个等值问题的探讨.中国考试,(12):14-15
    丁树良,熊建华,毛萌萌.2003.项目反应理论框架下的新等值方法-对数对比等值法.心理学报,35(6):835-841
    耿金福.2000.高考等值模型中链等值、频数估计等值以及线性等值的比较研究.北京师范大学硕士学位论文
    焦丽亚,辛涛.2006.基于CTT的锚测验非等组设计中四种等值方法的比较研究.心理发展与教育,(1):97-102
    焦丽亚.2007.新课程背景下中考的IRT项目参数等值方法比较研究.北京师范大学硕士学位论文
    刘启辉.1999.经典理论的等值方法与铆结构的交互作用研究.江西师范大学硕士学位论文
    刘晓瑜,臧铁军.1997.教育考试等值在学校和地方教育管理中的应用.现代教育论丛,(2):36-38
    刘瑜.2002.对上肢力量类测验等值的实证研究.扬州师范大学硕士学位论文
    刘瑜.2005.对男大学生上肢力量类测验项目的等值研究.首都体育学院学报,17(6):37-40
    罗照盛.1997.经典理论等值的误差研究.江西师范大学硕士学位论文
    罗照盛.2000.经典测量理论等值的误差研究.心理科学,23(4):501,494(2000年第23卷第4期)
    马世晔.1996.题库理论与我国题库的发展状况.教育理论与实践,16(1):44-46
    莆田高专测验等值研究课题组.1999.测验等值设计的一种新方法-单组设计试卷分半法.莆田高等专科学校学报,6(1):1-4
    漆书青.1987.略论测验等值.江西教育科研,第4期
    孙娟.1998.高考语文铆卷结构对等值结果的影响之研究.江西师范大学硕士学位论文
    王悦.(2000).汉语水平考试机助自适应系统框架研究和实现.上海交通大学硕士学位论文
    谢小庆.1998a.等值试验研究及HSK等值方案.北京师范大学博士学位论文
    谢小庆.1998b.关于HSK等值的试验研究.世界汉语教学,45(3):88-96
    谢小庆,任杰.1999.对从HSK题库中自动生成试卷稳定性的试验检验.心理学探新,19(72):42-46
    谢小庆.2000.对15种测验等值方法的比较研究.心理学报,32(2):217-223
    谢小庆.2005.HSK和MHK的等值.考试研究,1(1):33-46
    熊建华.2002.项目反应理论中等值方法及其比较.江西师范大学硕士学位论文
    张光旭,杨志明.1999.高中会考的等值方法比较研究.心理学探新,19(72):47-55
    张敏强,胡晖.1988.略论测验等值的理论、方法和应用.华南师范大学学报(社会科学版),第4期
    张敏强.1996.教育测量学.北京:高等教育出版社
    张忠华.2004.共同题数量和测验长度对共同题等值精确性的影响.北京师范大学硕士学位论文
    赵守盈,王洪礼,江新会,骆文淑.2007.测验项目编制与等值的一种有效策略—层面理论.考试研究,3(2):62-70
    周骏.2000.项目反应理论中等级反应模型下参数等值引法的研究.江西师范大学硕士学位论文
    朱正才,杨惠中.2004.大学英语四、六级考试分数的机助百分位等值研究.现代外语(季刊),27(1):70-75
    朱正才.2005.大学英语四、六级考试分数等值研究:一个基于铆题和两参数IRT 模型的解决方案.心理学报,37(2):280-284
    朱正才,杨惠中,杨浩然.2003.Rasch模型在CET考试分数等值中的应用.现代外语(季刊),26(1):69-75

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700