Comparison of Classical Test Theory vs. Multi-Facet Rasch Theory in writing assessment
DOI:
https://doi.org/10.47750/pegegog.12.02.21Keywords:
Writing assessment, CTT, MFRM, IRT, criterion validityAbstract
Testing English writing skills could be multi-dimensional; thus, the study aimed to compare students’ writing scores calculated according to Classical Test Theory (CTT) and Multi-Facet Rasch Model (MFRM). The research was carried out in 2019 with 100 university students studying at a foreign language preparatory class and four experienced instructors who participated in the study as raters. Data of the study were collected by using a writing rubric consisting of four components (content, organization, grammar and vocabulary). Participants' writing scores were analysed thoroughly both by CTT and MFRM. At the first step, the participants’ writing scores were calculated by taking the means of the writing points given by the graders in the CTT model. Then, the MFRM was applied to the data through a three-facet design considering the rater, student and rubric components as MFRM facets respectively. Finally, ability estimates obtained and reported in the logit scale via Rasch Analysis were converted into the analytic rubric’s component scores used throughout the scoring procedure. Finally, two sets of writing scores were calculated and compared according to both measurement models. Considering the findings, it was summoned that there was a positive and high correlation between the ability estimates found according to the CTT and the MFRM. However, the mean score difference calculated according to both theories was still significant. Moreover, the analyses showed that criterion validity of the writing scores obtained via the MFRM was higher than the scores obtained via the CTT.
Downloads
References
Akın, Ö. ve Baştürk, R. (2012). Keman eğitiminde temel becerilerin Rasch ölçme modeli ile değerlendirilmesi. Pamukkale Üniversitesi Eğitim Fakültesi Dergisi, 31(1), 175-187.
Andrade, H. (1997). Understanding Rubrics. Educational Leadership, 54(4).
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(9), 561-573. http://dx.doi.org/10.1007/BF02293814
Bayram, N. (2009). Data analysis through SPSS in social sciences. Ezgi Pub.
Brown, J. (2012). Developing, using, and analysing rubrics in language assessment with case studies in Asian and Pacific languages. Honolulu, HI: National Foreign Language Resource Centre.
Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language writing skills. Language Learning, 34(4), 21–42.
Crusan, D. (2010). Assessment in the second language writing classroom. University of Michigan Press. https://doi.org/10.3998/mpub.770334
Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik SPSS ve LISREL uygulamaları. Pegem Akademi.
David, A.B. (2008). Comparison of classification accuracy using Cohen’s weighted kappa. Expert Systems with Applications, 34(2), 825-832. http://dx.doi.org/10.1016/j.eswa.2006.10.022
Davidson, F., & Lynch, B.K. (2002). Testcraft: a teacher’s guide to writing and using language test specifications. Newhaven, CT: Yale University Press.
DeMars, C. (2010). Item response theory. Oxford, UK: Oxford University Press.
East, M., & Young, D. (2007). Scoring L2 writing samples: Exploring the relative effectiveness of two different diagnostic methods. New Zealand Studies in Applied Linguistics, 13(1), 1.
Eckes, T. (2011). Introduction to Many-facet Rasch measurement: Analysing and evaluating rater-mediated assessments. Frankfurt, Germany: Lang.
Eckes, T. (2012) Operational Rater Types in Writing Assessment: Linking Rater Cognition to Rater Behaviour, Language Assessment Quarterly, 9(3), 270-292, DOI: 10.1080/15434303.2011.649381
Elbow, P. (2012). Good enough evaluation: When is it feasible and when is evaluation not worth having. Writing Assessment in the 21st century: Essays in honour of Edward M. White, 303-325.
Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24, 37–64.
Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171–191.
Engelhard, G. Jr., & Stone, G. E. (1998). Evaluating the quality of ratings obtained from standard-setting judges. Educational and Psychological Measurement, 58(2), 179–196.
Goffin, R. D., & Olson, J. M. (2011). Is it all relative? Comparative judgments and the possible improvement of self-ratings and ratings of others. Perspectives on Psychological Science, 6(1), 48–60.
Greenberg, K. L. (1992). Validity and reliability issues in the direct assessment of writing. Writing Program Administration, 16(1–2), 7–22.
Güler, N., & Gelbal, S. (2010). Study based on classic test theory and many facet Rasch model. Journal of Educational Research, 38 (1), 108-125. http://www.aniyayincilik.com.tr/main/pdfler/38/7_guler_nese.pdf
Haiyang, S. (2010). An application of classical test theory and many facet Rasch measurement in analysing the reliability of an English test for non-English major graduates. Chinese Journal of Applied Linguistics, 33(2), 87-102. http://www.celea.org.cn/teic/90/10060807.pdf
Hamp-Lyons, L. (1995). Rating non-native writing: The trouble with holistic scoring. TESOL Quarterly, 29(4), 759-762.
Huang, T., Guo, G., Loadman, W., & Low, F. (2014). Rating score data analysis by classical test theory and many-facet Rasch model. Psychology Research, 4(3), 222-231. http://www.davidpublishing.com/show.html?15856
Jones, I., & Inglis, M. (2015). The problem of assessing problem solving. Educational Studies in Mathematics, 89(3), 337–355.
Kahveci, N., & Şentürk, B. (2021). A case study on the evaluation of writing skill in teaching Turkish as a foreign language. International Journal of Education Technology and Science. 1(4) (2021) 170–183.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
Karataş, T. Ö., & Okan, Z. (2021). The powerful use of an English language teacher recruitment exam in the Turkish context: An interactive qualitative case study. Int. Online Journal of Education and Teaching, 8(3). 1649-1677.
Kline, T. (2005). Psychological testing: A practical approach to design and evaluation. Thousand Oaks: Sage Publications.
Knoch, U., & Chapelle, C. A. (2017). Validation of rating processes within an argument-based framework. Language Testing, Vol. 35(4) 477–499 https://doi.org/10.1177/0265532217710049
Kobrin, J. L., Deng, H., & Shaw, E. J. (2011). The association between SAT prompt characteristics, response features, and essay scores. Assessing Writing, 16(3), 154–169. https://doi.org/10.1016/j.asw.2011.01.001.
Laming, D. (2004). Human judgment: The eye of the beholder. Thomson.
LeBreton & Senter, (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11(4), 815-852. http://dx.doi.org/10.1177/1094428106296642
Linacre, J.M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.
Linacre, J. M. (2017). FACETS computer program for many-facet Rasch measurement. Beaverton.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276.
MacMillan, P.D. (2000). Classical, generalizability and multifaceted Rasch detection of interrater variability. Journal of Experimental Education, 68 (2),167-190. http://dx.doi.org/10.1080/00220970009598501
Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika,47(2),149-174. http://dx.doi.org/10.1007/BF02296272
McNamara, T. (1996). Measuring second language performance. UK, Longman.
McNamara, T. (2000). Language testing. Oxford: Oxford University Press.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
Nalbantoğlu, Y., F. (2017). Analysis of the Rater effects on the scoring of diagnostic trees prepared by teacher candidates with the many-facet Rasch model. Online Submission, 8(18), 174-184.
Polat, M. (2020). A Rasch analysis of rater behaviour in speaking assessment. International Online Journal of Education and Teaching (IOJET), 7(3). 1126-1141. https://iojet.org/index.php/IOJET/article/view/902
Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37. http://www.peterliljedahl.com/wp-content/uploads/Myth-of-Objectivity.pdf
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25 (4) 465–493. DOI:10.1177/0265532208094273
Semerci, Ç., Semerci, N. & Duman, B. (2013). Yüksek lisans öğrencilerinin seminer sunu performanslarının çok-yüzeyli Rasch modeli ile analizi. Sakarya Üniversitesi Eğitim Fakültesi Dergisi, 25, 7-22.
Shaw, S. D., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second language writing. Cambridge University Press.
Sims, M., Cox, T., Eckstein, G., Hartshorn, J., Wilcox, M., & Hart, J. (2020). Rubric Rating with MFRM versus Randomly Distributed Comparative Judgment: A Comparison of Two Approaches to Second-Language Writing. Assessment Educational Measurement: Issues and Practice, 39(4),30–40.
Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay scoring. Applied Measurement in Education, 29(3),223.
Sudweeks, R.R., Reeve, S., & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239-261. http://dx.doi.org/10.1016/j.asw.2004.11.001
Tabachnick, B.G., & Fidell, L.S. (2007). Using multivariate statistics. Boston, Pearson Education, Inc.
Tobaş, C. (2020). Examination of the differential rater behaviours in performance evaluation with Many Facet Rasch Measurement. Gazi University Graduate School of Educational Sciences. (Unpublished M.S. Master Thesis)
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale maker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49-70.
Tuzcu-Eken, D. (2021). Peer evaluation in writing: How to implement efficiently. International Online Journal of Education and Teaching (IOJET), 8(2). 708.
Vaughan, C. (1991). What goes on in the raters’ minds? In L. Hamp-Lyons, (Ed.), Assessing second language writing in academic contexts (pp. 111–125). Norwood, NJ: Ablex.
Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press.
Yüzüak, A. V., Yüzüak, B. & Kaptan, F. (2015). Performans görevinin akran gruplar ve öğretmen yaklaşımları doğrultusunda ÇYRM ile analizi. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 6(1), 1-11.
Zaman, A., Kashmiri, A., Mubarak, M., & Ali, A. (2008). Students ranking, based on their abilities on objective type test: Comparison of CTT and IRT. EDU-COM International Conference. http://ro.ecu.edu.au/ceducom/52/
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Pegem Journal of Education and Instruction

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.