
Classifying TED-Ed Texts by CEFR-Based Analytic Scale: Human Judgment, LLM Prompting, and Stability Analysis
© 2026 KASELL All rights reserved
This is an open-access article distributed under the terms of the Creative Commons License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
This study examines how large language model (LLM) prompting strategies align with human CEFR classifications of TED-Ed transcripts, educational video materials from the TED-Ed platform, for Korean learners of English as a foreign language (EFL). Two trained raters evaluated 321 texts using a CEFR-based analytic framework tailored to the Korean EFL context, incorporating vocabulary, syntax, discourse organization, cognitive demands, and cultural concepts. Most texts were rated at B2–C1, indicating upper-intermediate to advanced reading demands. In contrast, traditional readability indices, including Lexile and Oxford 3000/5000-based measures, placed most texts at A2–B1, reflecting their reliance on lexico-syntactic features rather than discourse, conceptual, and cultural demands. A stratified subset of 71 texts was then classified by GPT-5 under three prompting conditions: zero-shot (no exemplars), fixed few-shot (constant exemplars), and randomized few-shot (varying exemplars across runs). Agreement with human ratings was κ = .45 for zero-shot prompting and κ = .48 for fixed few-shot prompting, while randomized few-shot prompting yielded κ values between .60 and .64 with higher variability across runs. Human inter-rater reliability was κ = .76, indicating that LLM classifications did not reach the consistency of trained raters. The findings suggest that LLM-assisted CEFR classification can support, but not replace, human judgment, and that prompting design affects both agreement and stability. The difference between human and LLM evaluation procedures, including the simplified implementation of the human rating framework in the prompt, may partly explain this gap.
Keywords:
CEFR, TED-Ed, LLM prompting, text classification, automated assessmentReferences
- Anandha, A., D. Anggraheni and A. Yogatama. 2024. Students’ perspective on the use of TED-Ed video on enhancing English comprehension. In Proceedings of the English Language & Literature International Conference (ELLiC), 402-411.
- Anggraeni, C. W. and L. Indriani. 2018a. TED-ED for autonomous listener. In Proceedings of the 5th Asia Pacific Education Conference (AECON 2018), 18-22.
-
Anggraeni, C. W. and L. Indriani. 2018b. Teachers’ perceptions toward TED-ED in listening class insight the era of disruptive technology. Metathesis: Journal of English Language Literature and Teaching 2(2), 222-235.
[https://doi.org/10.31002/metathesis.v2i2.925]
- Asghar, R., A. Khan and M. Farooq. 2023. The role of TED-ED animations in enhancing the speaking fluency of undergraduate ESL learners in a Pakistani setting. University of Chitral Journal of Linguistics and Literature 7(1), 288-297.
-
Benedetto, L., G. Gaudeau, A. Caines and P. Buttery. 2025. Assessing how accurately large language models encode and apply the common European framework of reference for languages. Computers and Education: Artificial Intelligence 8, 100353.
[https://doi.org/10.1016/j.caeai.2024.100353]
-
Bermejo, V. J., A. Gago, R. H. Gálvez and N. Harari. 2025. LLMs outperform outsourced human coders on complex textual analysis. Scientific Reports 15(1), 40122.
[https://doi.org/10.1038/s41598-025-23798-y]
- Bhurt, S. M., S. A. Memon and B. Bhurt. 2023. The impact of using TED-Ed as learning instrument on enhancing undergraduate ESL learners listening skill. Orient Research Journal of Social Sciences 8(2), 63-77.
- Brown, T., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter and D. Amodei. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 1877-1901.
- Chen, X., R. Chi, X. Wang and D. Zhou. 2024. Premise order matters in reasoning with large language models. In Proceedings of the 41st International Conference on Machine Learning, 6596-6620.
-
Cohen, J. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70(4), 213-220.
[https://doi.org/10.1037/h0026256]
-
Cooper, C. 2025. Predicting the CEFR level of English listening texts with machine learning methods. Research Methods in Applied Linguistics 4(3), 1-16.
[https://doi.org/10.1016/j.rmal.2025.100234]
- Council of Europe. 2001. Common European Framework of Reference for Languages. Available online at https://rm.coe.int/1680459f97
- Council of Europe. 2020. Common European Framework of Reference for Languages: Learning, Teaching, Assessment – Companion Volume. Available online at https://rm.coe.int/common-european-framework-of-reference-for-languages-learning-teaching/16809ea0d4
-
Crossley, S., D. Allen and D. McNamara. 2011. Text readability and intuitive simplification: A comparison of readability formulas. Reading in a Foreign Language 23(1), 86-101.
[https://doi.org/10.64152/10125/66657]
-
Eckes, T. 2019. Many-facet Rasch measurement: Implications for rater-mediated language assessment. In V. Aryadoust and M. Raquel, eds., Quantitative Data Analysis for Language Assessment, 153-175. London: Routledge.
[https://doi.org/10.4324/9781315187815-8]
-
Flesch, R. 1948. A new readability yardstick. Journal of Applied Psychology 32(3), 221-233.
[https://doi.org/10.1037/h0057532]
-
Graesser, A. C., H. Li and C. Forsyth. 2014. Learning by communicating in natural language with conversational agents. Current Directions in Psychological Science 23(5), 374-380.
[https://doi.org/10.1177/0963721414540680]
-
Huang, Y., D. Li and A. Cheung. 2025. Evaluating the linguistic complexity of machine translation and LLMs for EFL/ESL applications: An entropy weight method. Research Methods in Applied Linguistics 4(3), 100229.
[https://doi.org/10.1016/j.rmal.2025.100229]
-
Hung, H. T. 2015. Flipping the classroom for English language learners to foster active learning. Computer Assisted Language Learning 28(1), 81-96.
[https://doi.org/10.1080/09588221.2014.967701]
-
Jeon, J-H. 2022. A systematic review of CEFR-related research of English education in South Korea. Journal of Curriculum and Teaching 11(8), 363-375.
[https://doi.org/10.5430/jct.v11n8p363]
-
Jeon, J-H. 2024. A textbook task analysis based on the CEFR mediation scale of basic user. Primary English Education 30(1), 91-115.
[https://doi.org/10.25231/pee.2024.30.1.91]
-
Kim, M. and S. Han. 2025. Register variation in TED-Ed videos: A multidimensional analysis across academic disciplines. Korean Journal of English Language and Linguistics 25, 1026-1047.
[https://doi.org/10.15738/kjell.25..202507.1026]
-
Kincaid, J. P., R. P. Fishburne, R. L. Rogers and B. S. Chissom. 1975. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel (Res. Rep. No. 8-57). Orlando, FL: University of Central Florida, Institute for Simulation and Training.
[https://doi.org/10.21236/ADA006655]
- Kirana, L. 2023. TED-Ed animated videos’ impact on vocabulary gain of Indonesian EFL middle schoolers. Research on English Language Teaching in Indonesia 11(3), 39-45.
-
Liu, C. Y. 2023a. Suitability of TED-Ed animations for academic listening. English for Specific Purposes 72, 4-15.
[https://doi.org/10.1016/j.esp.2023.06.001]
-
Liu, C. Y. 2023b. Specialized vocabulary in TED talks and TED-Ed animations: Implications for learning English for science and technology. Journal of English for Academic Purposes 65, 101293.
[https://doi.org/10.1016/j.jeap.2023.101293]
-
Lu, K. and Q. Chen. 2020. A study on the learning effects of a blended listening and speaking course: A case study of medicine-related EFL learners. In 2020 Conference on Education, Language and Inter-cultural Communication (ELIC 2020), 97-102.
[https://doi.org/10.2991/assehr.k.201127.019]
-
Min, S., X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi and L. Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11048-11064.
[https://doi.org/10.18653/v1/2022.emnlp-main.759]
- Ministry of Education. 2022, December. 2022 Revised National Curriculum for Primary, Secondary, and Special Schools Announced. Available online at https://english.moe.go.kr/boardCnts/viewRenewal.do?boardID=265&boardSeq=93810&lev=0&statusYN=W&s=english&m=0201&opType=N
-
Mizumoto, A. and M. Eguchi. 2023. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics 2(2), 1-13.
[https://doi.org/10.1016/j.rmal.2023.100050]
- Negishi, M., T. Takada and Y. Tono. 2013. A progress report on the development of the CEFR-J. In E. D. Galaczi and C. J. Weir, eds., Exploring Language Frameworks: Proceedings of the ALTE Kraków Conference, 135-163. Cambridge, UK: Cambridge University Press.
-
Nguyen, C-D. 2023. TED Ed for incidental L2 academic vocabulary learning: A corpus-driven study. In B. L. Reynolds, ed., Vocabulary Learning in the Wild, 241-261. Singapore: Springer.
[https://doi.org/10.1007/978-981-99-1490-6_9]
- North, B. 2014. English Profile Studies: The CEFR in Practice. Cambridge, UK: Cambridge University Press.
- Oxford Learner’s Dictionary. 2019. Oxford 3000 and 5000. Oxford: Oxford University Press.
-
Rashtchi, M., B. Khoshnevisan and M. Shirvani. 2021. Integration of audiovisual input via TED-ED videos and language skills to enhance vocabulary learning. MEXTESOL 45(1), 1-18.
[https://doi.org/10.61871/mj.v45n1-12]
-
Schmalz, V. and A. Brutti. 2021. Automatic assessment of English CEFR levels using BERT embeddings. In Proceedings of the Eighth Italian Conference on Computational Linguistics, 293-299.
[https://doi.org/10.4000/books.aaccademia.10828]
-
Song, J. Y. K. and K. Nah. 2017. Audio-visual elements of motion graphics in online educational video: Focused on Ted-Ed originals. Journal of the Korean Society of Design Culture 23(3), 452-461.
[https://doi.org/10.18208/ksdc.2017.23.3.451]
- Tang, B., H. Liang, K. Jiang and X. Dong. 2025. On the importance of task complexity in evaluating LLM-based multi-agent systems. In Proceedings of the NeurIPS 2025 Scaling Environments for Agents (SEA) Workshop.
-
Uchida, S. and M. Negishi. 2025. Assigning CEFR-J levels to English learners’ writing: An approach using lexical metrics and generative AI. Research Methods in Applied Linguistics 4(2), 1-14.
[https://doi.org/10.1016/j.rmal.2025.100199]
- UNESCO Institute for Statistics. 2015. International Standard Classification of Education: Fields of Education and Training 2013 (ISCED-F 2013). Montreal, Canada: UNESCO Institute for Statistics.
-
Utami, S., S. Noerjanah and N. Ibnus. 2024. The effectiveness of TED-ED videos on students’ speaking skills in ninth grade of junior high school. FLIP: Foreign Language Instruction Probe 3(1), 1-10.
[https://doi.org/10.54213/flip.v3i1.411]
- Vajjala, S. and K. Lõo. 2014. Automatic CEFR level prediction for Estonian learner text. In Proceedings of The Third Workshop on NLP for Computer-assisted Language Learning, 113-127.
-
Vajjala, S. and I. Lučić. 2018. OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 297-304.
[https://doi.org/10.18653/v1/W18-0535]
-
Vajjala, S. and D. Meurers. 2014. Readability assessment for text simplification: From analyzing documents to identifying sentential simplifications. International Journal of Applied Linguistics 165(2), 194-222.
[https://doi.org/10.1075/itl.165.2.04vaj]
- Wei, J. and A. Vande Moere. 2021. Aligning the Lexile® Framework for Reading to the Common European Framework of Reference. Technical report. MetaMetrics. Available online at https://metametricsinc.com/wp-content/uploads/2018/07/Aligning-the-Lexile-Framework-to-theCEFR.pdf
-
Wei, J., X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le and D. Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 24824-24837.
[https://doi.org/10.52202/068431-1800]
-
Wolfer, S. and R. Lew. 2025. Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy. Humanities and Social Sciences Communications 12(1), 1151.
[https://doi.org/10.1057/s41599-025-05446-y]
-
Xia, M., E. Kochmar and T. Briscoe. 2016. Text readability assessment for second language learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, 12-22.
[https://doi.org/10.18653/v1/W16-0502]
-
Xia, S. 2023. Explaining science to the non-specialist online audience: A multimodal genre analysis of TED talk videos. English for Specific Purposes 70, 70-85.
[https://doi.org/10.1016/j.esp.2022.11.007]
-
Yoon, S. 2023. Multimodality in online animated lectures: A case study of TED-Ed from a cognitive linguistic approach. The Journal of Linguistic Science 107, 545-570.
[https://doi.org/10.21296/jls.2023.09.107.545]
-
Yoshida, L. 2024. The impact of example selection in few-shot prompting on automated essay scoring using GPT models. In Proceedings of the 25th International Conference on Artificial Intelligence in Education, 61-73.
[https://doi.org/10.1007/978-3-031-64315-6_5]
-
Zhao, W., B. Wang, D. Coniam and B. Xie. 2017. Calibrating the CEFR against the China standards of English for college English vocabulary education in China. Language Testing in Asia 7, 1-18.
[https://doi.org/10.1186/s40468-017-0036-1]