The Korean Association for the Study of English Language and Linguistics

[ Article ]

Korea Journal of English Language and Linguistics - Vol. 26, No. 0, pp.580-605

ISSN: 1598-1398 (Print) 2586-7474 (Online)

Print publication date 30 Apr 2026

Received 16 Mar 2026 Revised 09 Apr 2026 Accepted 09 Apr 2026

DOI: https://doi.org/10.15738/kjell.26..202604.580

Developing and Validating a Curriculum-Based Rating Scale for Korean Middle School EFL Writing: An Argument-Based Validation Approach

Haeyun Jin ; Wooyeon Kim ; Sun-Young Oh

(First author) Assistant Professor, Department of English Language and Literature Korea National Open University haeyunj@knou.ac.kr
Doctoral Student, Department of English Language Education Seoul National University hallokatze@snu.ac.kr
(Corresponding author) Professor, Department of English Language Education & Learning Sciences Research Institute, College of Education Seoul National University 1 Gwanak-ro, Gwanak-gu Seoul, Korea, Tel: +82-2-880-7675 sunoh@snu.ac.kr

© 2026 KASELL All rights reserved
This is an open-access article distributed under the terms of the Creative Commons License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Advances in artificial intelligence have led to the rapid expansion of automated writing evaluation tools for second language writing. However, the effectiveness of such systems depends critically on the assessment frameworks used to generate training data and interpret scores. In particular, little research has examined the development and empirical functioning of curriculum-based rating scales designed for secondary-level learners. Situated within a larger project developing an AI-assisted writing feedback tool for middle school English learners, this study reports the development and validation of a rating scale for assessing middle school EFL writing. Guided by an argument-based validation framework (Knoch and Chapelle 2018), the study examines the domain definition and evaluation inferences relevant to rater-mediated assessment. The scale was developed based on analyses of the national curriculum, middle school textbooks, and learner writing data. Its functioning was examined using Many-Facet Rasch Measurement (MFRM). The results indicate that the rating scale reflects key dimensions of middle school EFL writing and that scale criteria functioned in a consistent manner across raters. Category analyses further showed that the five score levels distinguished meaningful differences in student performance. The findings provide empirical support for the rating scale and offer methodological insights for the development of rating frameworks for future AI-assisted writing assessment systems.

Keywords:

Acknowledgments

This work was supported by the Learning Sciences Research Institute at Seoul National University (0767-20240007).

References

Aluthman, E. S. 2016. The effect of using automated essay evaluation on ESL undergraduate students’ writing skill. International Journal of English Linguistics 6(5), 54-70. [https://doi.org/10.5539/ijel.v6n5p54]
Attali, Y. and J. Burstein. 2006. Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment 4(3).
Bae, J. and L. F. Bachman. 2010. An investigation of four writing traits and two tasks across two languages. Language Testing 27(2), 213-234. [https://doi.org/10.1177/0265532209349470]
Bae, J. and Y. S. Lee. 2012. Evaluating the development of children’s writing ability in an EFL context. Language Assessment Quarterly 9(4), 348-374. [https://doi.org/10.1080/15434303.2012.721424]
Cotos, E. 2014. Genre-Based Automated Writing Evaluation for L2 Research Writing: From Design to Evaluation and Enhancement. London: Palgrave Macmillan. [https://doi.org/10.1057/9781137333377]
Eckes, T. 2012. Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly 9, 270-292. [https://doi.org/10.1080/15434303.2011.649381]
Eckes, T. 2015. Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd rev. and updated ed.). Bern, Switzerland: Peter Lang Verlag.
Fulcher, G. 1987. Tests of oral performance: The need for data-based criteria. ELT Journal 41(4), 287-291. [https://doi.org/10.1093/elt/41.4.287]
Fulcher, G. 2003. Testing Second Language Speaking. London: Routledge.
Fulcher, G. 2012. Scoring performance tests. In G. Fulcher and F. Davidson, eds., The Routledge Handbook of Language Testing, 378-392. London: Routledge. [https://doi.org/10.4324/9780203181287]
Hayes, A. F. and K. Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1(1), 77-89. [https://doi.org/10.1080/19312450709336664]
Huang, S. and W. A. Renandya. 2020. Exploring the integration of automated feedback among lower-proficiency EFL learners. Innovation in Language Learning and Teaching 14(1), 15-26. [https://doi.org/10.1080/17501229.2018.1471083]
Hudson, J. A. and L. R. Shapiro. 1991. From knowing to telling: The development of children’s scripts, stories, and personal narratives. In A. McCabe and C. Peterson, eds., Developing Narrative Structure, 89-135. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
Janssen, G., V. Meier and J. Trace. 2015. Building a better rubric: Mixed methods rubric revision. Assessing Writing 26, 51-66. [https://doi.org/10.1016/j.asw.2015.07.002]
Kane, M. T. 2013. Validating the interpretations and uses of test scores. Journal of Educational Measurement 50, 1-73. [https://doi.org/10.1111/jedm.12000]
Kane, M. T., B. E. Clauser and J. Kane. 2017. A validation framework for credentialing tests. In C. W. Buckendahl and S. Davis-Becker, eds., Testing in the Professions: Credentialing Policies and Practice, 20-41. London: Routledge. [https://doi.org/10.4324/9781315751672-2]
Knoch, U. 2009. Diagnostic assessment of writing: A comparison of two rating scales. Language Testing 26(2), 275-304. [https://doi.org/10.1177/0265532208101008]
Knoch, U. and C. A. Chapelle. 2018. Validation of rating processes within an argument-based framework. Language Testing 35(4), 477-499. [https://doi.org/10.1177/0265532217710049]
Knoch, U., B. Deygers and A. Khamboonruang. 2021. Revisiting rating scale development for rater-mediated language performance assessments: Modelling construct and contextual choices made by scale developers. Language Testing 38(4), 602-626. [https://doi.org/10.1177/0265532221994052]
Koltovskaia, S. 2020. Student engagement with automated written corrective feedback: A case study of two language learners’ experiences with Grammarly. Computer Assisted Language Learning 33(5-6), 510-527.
Korean Ministry of Education. 2022. 2022 revised national curriculum: English. Available online at https://ncic.re.kr/
Link, S., M. Mehrzad and M. Rahimi. 2022. Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computers & Education 181, 104458.
McNamara, T. 1996. Measuring Second Language Performance. Oxford, UK: Blackwell.
McNamara, T. 2002. Discourse and assessment. Annual Review of Applied Linguistics 22, 221-242. [https://doi.org/10.1017/S0267190502000120]
Mendoza, A. and U. Knoch. 2018. Examining the validity of an analytic rating scale for a Spanish test for academic purposes using the argument-based approach to validation. Assessing Writing 35, 41-55. [https://doi.org/10.1016/j.asw.2017.12.003]
Messick, S. 1989. Validity. In R. L. Linn, ed., Educational Measurement, 3rd ed., 13-103. Washington, DC & New York: American Council on Education & Macmillan.
Montee, M. and M. Malone. 2014. Writing scoring criteria and score reports. In A. Kunnan, ed., The Companion to Language Assessment, Vol. 2, 847-859. Oxford, UK: Wiley-Blackwell. [https://doi.org/10.1002/9781118411360.wbcla112]
Peterson, S., R. Childs and K. Kennedy. 2004. Written feedback and scoring of sixth-grade girls’ and boys’ narrative and persuasive writing. Assessing Writing 9(2), 160-180. [https://doi.org/10.1016/j.asw.2004.07.002]
Shermis, M. D. and J. Burstein. 2003. Automated Essay Scoring: A Cross-Disciplinary Perspective. Mahwah, NJ: Lawrence Erlbaum Associates Publishers. [https://doi.org/10.4324/9781410606860]
Stevenson, M. and A. Phakiti. 2019. Automated writing evaluation: Investigating the feedback–revision relationship and learner engagement. Language Learning & Technology 23(3), 66-96.
Tian, L. and Y. Zhou. 2020. Learner engagement with automated feedback, peer feedback, and teacher feedback in an online EFL writing context. System 91, 102247. [https://doi.org/10.1016/j.system.2020.102247]
Whang, S. E., Y. Roh and H. Song. 2023. Data collection and quality challenges in deep learning: A data-centric AI perspective. The VLDB Journal 32, 791-813. [https://doi.org/10.1007/s00778-022-00775-9]
Wilson, J. and R. D. Roscoe. 2020. Automated writing evaluation and feedback: Multiple metrics of efficacy. Journal of Educational Computing Research 58(1), 87-125. [https://doi.org/10.1177/0735633119830764]
Woodworth, J. and K. Barkaoui. 2020. Perspectives on using automated writing evaluation systems to provide written corrective feedback in the ESL classroom. TESL Canada Journal 37(2), 234-247. [https://doi.org/10.18806/tesl.v37i2.1340]
Wright, B. D. and J. M. Linacre. 1994. Reasonable mean-square fit values. Rasch Measurement Transactions 8, 370.
Youn, S. Y. 2015. Validity argument for assessing L2 pragmatics in interaction using mixed methods. Language Testing 32(2), 199-225. [https://doi.org/10.1177/0265532214557113]