Comparison of Classical Test Theory vs. Multi-Facet Rasch Theory in writing assessment

Authors

  • Murat Polat Anadolu University
  • Nihan Sölpük Turhan Fatih Sultan Mehmet University
  • Çetin Toraman Çanakkale Onsekizmart University

DOI:

https://doi.org/10.47750/pegegog.12.02.21

Keywords:

Writing assessment, CTT, MFRM, IRT, criterion validity

Abstract

Testing English writing skills could be multi-dimensional; thus, the study aimed to compare students’ writing scores calculated according to Classical Test Theory (CTT) and Multi-Facet Rasch Model (MFRM). The research was carried out in 2019 with 100 university students studying at a foreign language preparatory class and four experienced instructors who participated in the study as raters. Data of the study were collected by using a writing rubric consisting of four components (content, organization, grammar and vocabulary). Participants' writing scores were analysed thoroughly both by CTT and MFRM. At the first step, the participants’ writing scores were calculated by taking the means of the writing points given by the graders in the CTT model. Then, the MFRM was applied to the data through a three-facet design considering the rater, student and rubric components as MFRM facets respectively. Finally, ability estimates obtained and reported in the logit scale via Rasch Analysis were converted into the analytic rubric’s component scores used throughout the scoring procedure. Finally, two sets of writing scores were calculated and compared according to both measurement models. Considering the findings, it was summoned that there was a positive and high correlation between the ability estimates found according to the CTT and the MFRM. However, the mean score difference calculated according to both theories was still significant. Moreover, the analyses showed that criterion validity of the writing scores obtained via the MFRM was higher than the scores obtained via the CTT.

Downloads

Download data is not yet available.

References

Akın, Ö. ve Baştürk, R. (2012). Keman eğitiminde temel becerilerin Rasch ölçme modeli ile değerlendirilmesi. Pamukkale Üniversitesi Eğitim Fakültesi Dergisi, 31(1), 175-187.

Andrade, H. (1997). Understanding Rubrics. Educational Leadership, 54(4).

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(9), 561-573. http://dx.doi.org/10.1007/BF02293814

Bayram, N. (2009). Data analysis through SPSS in social sciences. Ezgi Pub.

Brown, J. (2012). Developing, using, and analysing rubrics in language assessment with case studies in Asian and Pacific languages. Honolulu, HI: National Foreign Language Resource Centre.

Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language writing skills. Language Learning, 34(4), 21–42.

Crusan, D. (2010). Assessment in the second language writing classroom. University of Michigan Press. https://doi.org/10.3998/mpub.770334

Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik SPSS ve LISREL uygulamaları. Pegem Akademi.

David, A.B. (2008). Comparison of classification accuracy using Cohen’s weighted kappa. Expert Systems with Applications, 34(2), 825-832. http://dx.doi.org/10.1016/j.eswa.2006.10.022

Davidson, F., & Lynch, B.K. (2002). Testcraft: a teacher’s guide to writing and using language test specifications. Newhaven, CT: Yale University Press.

DeMars, C. (2010). Item response theory. Oxford, UK: Oxford University Press.

East, M., & Young, D. (2007). Scoring L2 writing samples: Exploring the relative effectiveness of two different diagnostic methods. New Zealand Studies in Applied Linguistics, 13(1), 1.

Eckes, T. (2011). Introduction to Many-facet Rasch measurement: Analysing and evaluating rater-mediated assessments. Frankfurt, Germany: Lang.

Eckes, T. (2012) Operational Rater Types in Writing Assessment: Linking Rater Cognition to Rater Behaviour, Language Assessment Quarterly, 9(3), 270-292, DOI: 10.1080/15434303.2011.649381

Elbow, P. (2012). Good enough evaluation: When is it feasible and when is evaluation not worth having. Writing Assessment in the 21st century: Essays in honour of Edward M. White, 303-325.

Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24, 37–64.

Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171–191.

Engelhard, G. Jr., & Stone, G. E. (1998). Evaluating the quality of ratings obtained from standard-setting judges. Educational and Psychological Measurement, 58(2), 179–196.

Goffin, R. D., & Olson, J. M. (2011). Is it all relative? Comparative judgments and the possible improvement of self-ratings and ratings of others. Perspectives on Psychological Science, 6(1), 48–60.

Greenberg, K. L. (1992). Validity and reliability issues in the direct assessment of writing. Writing Program Administration, 16(1–2), 7–22.

Güler, N., & Gelbal, S. (2010). Study based on classic test theory and many facet Rasch model. Journal of Educational Research, 38 (1), 108-125. http://www.aniyayincilik.com.tr/main/pdfler/38/7_guler_nese.pdf

Haiyang, S. (2010). An application of classical test theory and many facet Rasch measurement in analysing the reliability of an English test for non-English major graduates. Chinese Journal of Applied Linguistics, 33(2), 87-102. http://www.celea.org.cn/teic/90/10060807.pdf

Hamp-Lyons, L. (1995). Rating non-native writing: The trouble with holistic scoring. TESOL Quarterly, 29(4), 759-762.

Huang, T., Guo, G., Loadman, W., & Low, F. (2014). Rating score data analysis by classical test theory and many-facet Rasch model. Psychology Research, 4(3), 222-231. http://www.davidpublishing.com/show.html?15856

Jones, I., & Inglis, M. (2015). The problem of assessing problem solving. Educational Studies in Mathematics, 89(3), 337–355.

Kahveci, N., & Şentürk, B. (2021). A case study on the evaluation of writing skill in teaching Turkish as a foreign language. International Journal of Education Technology and Science. 1(4) (2021) 170–183.

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.

Karataş, T. Ö., & Okan, Z. (2021). The powerful use of an English language teacher recruitment exam in the Turkish context: An interactive qualitative case study. Int. Online Journal of Education and Teaching, 8(3). 1649-1677.

Kline, T. (2005). Psychological testing: A practical approach to design and evaluation. Thousand Oaks: Sage Publications.

Knoch, U., & Chapelle, C. A. (2017). Validation of rating processes within an argument-based framework. Language Testing, Vol. 35(4) 477–499 https://doi.org/10.1177/0265532217710049

Kobrin, J. L., Deng, H., & Shaw, E. J. (2011). The association between SAT prompt characteristics, response features, and essay scores. Assessing Writing, 16(3), 154–169. https://doi.org/10.1016/j.asw.2011.01.001.

Laming, D. (2004). Human judgment: The eye of the beholder. Thomson.

LeBreton & Senter, (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11(4), 815-852. http://dx.doi.org/10.1177/1094428106296642

Linacre, J.M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.

Linacre, J. M. (2017). FACETS computer program for many-facet Rasch measurement. Beaverton.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276.

MacMillan, P.D. (2000). Classical, generalizability and multifaceted Rasch detection of interrater variability. Journal of Experimental Education, 68 (2),167-190. http://dx.doi.org/10.1080/00220970009598501

Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika,47(2),149-174. http://dx.doi.org/10.1007/BF02296272

McNamara, T. (1996). Measuring second language performance. UK, Longman.

McNamara, T. (2000). Language testing. Oxford: Oxford University Press.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

Nalbantoğlu, Y., F. (2017). Analysis of the Rater effects on the scoring of diagnostic trees prepared by teacher candidates with the many-facet Rasch model. Online Submission, 8(18), 174-184.

Polat, M. (2020). A Rasch analysis of rater behaviour in speaking assessment. International Online Journal of Education and Teaching (IOJET), 7(3). 1126-1141. https://iojet.org/index.php/IOJET/article/view/902

Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37. http://www.peterliljedahl.com/wp-content/uploads/Myth-of-Objectivity.pdf

Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25 (4) 465–493. DOI:10.1177/0265532208094273

Semerci, Ç., Semerci, N. & Duman, B. (2013). Yüksek lisans öğrencilerinin seminer sunu performanslarının çok-yüzeyli Rasch modeli ile analizi. Sakarya Üniversitesi Eğitim Fakültesi Dergisi, 25, 7-22.

Shaw, S. D., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second language writing. Cambridge University Press.

Sims, M., Cox, T., Eckstein, G., Hartshorn, J., Wilcox, M., & Hart, J. (2020). Rubric Rating with MFRM versus Randomly Distributed Comparative Judgment: A Comparison of Two Approaches to Second-Language Writing. Assessment Educational Measurement: Issues and Practice, 39(4),30–40.

Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay scoring. Applied Measurement in Education, 29(3),223.

Sudweeks, R.R., Reeve, S., & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239-261. http://dx.doi.org/10.1016/j.asw.2004.11.001

Tabachnick, B.G., & Fidell, L.S. (2007). Using multivariate statistics. Boston, Pearson Education, Inc.

Tobaş, C. (2020). Examination of the differential rater behaviours in performance evaluation with Many Facet Rasch Measurement. Gazi University Graduate School of Educational Sciences. (Unpublished M.S. Master Thesis)

Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale maker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49-70.

Tuzcu-Eken, D. (2021). Peer evaluation in writing: How to implement efficiently. International Online Journal of Education and Teaching (IOJET), 8(2). 708.

Vaughan, C. (1991). What goes on in the raters’ minds? In L. Hamp-Lyons, (Ed.), Assessing second language writing in academic contexts (pp. 111–125). Norwood, NJ: Ablex.

Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press.

Yüzüak, A. V., Yüzüak, B. & Kaptan, F. (2015). Performans görevinin akran gruplar ve öğretmen yaklaşımları doğrultusunda ÇYRM ile analizi. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 6(1), 1-11.

Zaman, A., Kashmiri, A., Mubarak, M., & Ali, A. (2008). Students ranking, based on their abilities on objective type test: Comparison of CTT and IRT. EDU-COM International Conference. http://ro.ecu.edu.au/ceducom/52/

Downloads

Published

2022-04-01

How to Cite

Polat, M., Sölpük Turhan, N. ., & Toraman, Çetin . (2022). Comparison of Classical Test Theory vs. Multi-Facet Rasch Theory in writing assessment. Pegem Journal of Education and Instruction, 12(2), 213–225. https://doi.org/10.47750/pegegog.12.02.21

Issue

Section

Article

Most read articles by the same author(s)