The Korean Association for the Study of English Language and Linguistics

[ Article ]

Korea Journal of English Language and Linguistics - Vol. 25, No. 0, pp.1468-1495

ISSN: 1598-1398 (Print) 2586-7474 (Online)

Print publication date 31 Jan 2025

Received 11 Sep 2025 Revised 15 Oct 2025 Accepted 25 Oct 2025

DOI: https://doi.org/10.15738/kjell.25..202511.1468

Can GPT-4o Reason about Language? A Syntax Challenge

Hye-Won Choi ; Soo-Yeon Kim ; Sanghoun Song

(First author) Professor, Department of English Language and Literature, Ewha Womans University 52, Ewhayeodae-gil, Seodaemun-gu, Seoul, 03760, Republic of Korea hwchoi@ewha.ac.kr
(Co-author) Professor, English Data Convergence Major, Sejong University 209, Neungdong-ro, Gwangjin-gu, Seoul, 05006, Republic of Korea kimsy@sejong.ac.kr
(Corresponding author) Associate Professor, Department of Linguistics, Korea University 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea sanghoun@korea.ac.kr

© 2025 KASELL All rights reserved
This is an open-access article distributed under the terms of the Creative Commons License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This study investigates the capacity of GPT-4o, a multimodal large language model, to engage in linguistic analysis through a syntax exam designed to probe foundational concepts such as constituency, ambiguity, and recursion in both English and Korean. While often providing accurate definitions and fluent responses, the model struggles to apply syntactic principles consistently, especially in tasks requiring structural reasoning and tree diagram generation. The model’s frequent misinterpretations and incoherent analyses within and across tasks reveal a reliance on pattern recognition and heuristics rather than a systematic grasp of hierarchical structures and fundamental linguistic reasoning. These findings point to the limitations of current large language models in performing metalinguistic analyses, exposing a gap between surface-level performance and genuine metalinguistic competence, which in turn presupposes linguistic competence. By examining GPT-4o’s responses across a range of syntactic challenges, this study emphasizes the need for more rigorous evaluation frameworks that go beyond surface-level fluency to assess models’ capacity for human-like linguistic reasoning and analysis.

Keywords:

Large Language Model (LLM), GPT-4o, syntax, constituency, ambiguity, recursion

References

Beguš, G., M. Dąbkowski and R. Rhodes. 2023. Large linguistic models: Analyzing theoretical linguistic abilities of LLMs. arXiv preprint arXiv:2305.00948, .
Bender, E. M., T. Gebru, A. McMillan-Major and S. Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610-623 [https://doi.org/10.1145/3442188.3445922]
Chang, T. A. and B. K. Bergen. 2024. Language model behavior: A comprehensive survey. Computational Linguistics 50(1), 293-350. [https://doi.org/10.1162/coli_a_00492]
Cho, H., S. Park, S. Song and E. E. Oh. 2025. Investigating ChatGPT's phonology problem-solving abilities through reasoning with varying custom instructions. Linguistic Research 42(1), 53-93.
Chomsky, N. 2000. Knowledge of language: Its nature, origin, and use. In R. J. Stainton, ed., Perspectives in the Philosophy of Language: A Concise Anthology, 3-44.
Dąbkowski, M. and G. Beguš. 2023. Large language models and (non-) linguistic recursion. arXiv preprint arXiv:2306.07195, .
Frazier, L. and K. Rayner. 1982. Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive psychology 14(2), 178-210. [https://doi.org/10.1016/0010-0285(82)90008-1]
Garrett, M., T. Bever and J. Fodor. 1966. The active use of grammar in speech perception. Perception & Psychophysics 1(1), 30-32. [https://doi.org/10.3758/BF03207817]
Haider, H. 2023. Is Chat-GPT a grammatically competent informant? Ms., Salzburg University.
Howitt, K., S. Nair, A. Dods and R. M. Hopkins. 2024. Generalizations across filler-gap dependencies in neural language models. arXiv preprint arXiv:2410.18225, . [https://doi.org/10.18653/v1/2024.conll-1.21]
Hymes, D. 1972. On communicative competence. In J. B. Price and J. Holmes, eds., Sociolinguistics: Selected Readings, 269-293.
Kamath, G., S. Schuster, S. Vajjala and S. Reddy. 2024. Scope ambiguities in large language models. Transactions of the Association for Computational Linguistics 12, 738-754. [https://doi.org/10.1162/tacl_a_00670]
Kim, S.-Y. 2000. Acceptability and preference in the interpretation of anaphors. Linguistics 38(2), 315-353. [https://doi.org/10.1515/ling.38.2.315]
Kobzeva, A., S. Arehalli, T. Linzen and D. Kush. 2025. Learning filler-gap dependencies with neural language models: Testing island sensitivity in Norwegian and English. Journal of Memory and Language 144, 104663. [https://doi.org/10.1016/j.jml.2025.104663]
Mahowald, K., A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum and E. Fedorenko. 2024. Dissociating language and thought in large language models. Trends in Cognitive Sciences 28(6), 517-540. [https://doi.org/10.1016/j.tics.2024.01.011]
Manning, C. D. 2022. Human language understanding & reasoning. Daedalus 151(2), 127-138. [https://doi.org/10.1162/daed_a_01905]
Meseguer, E., M. Carreiras and C. Clifton. 2002. Overt reanalysis strategies and eye movements during the reading of mild garden path sentences. Memory & cognition 30(4), 551-561. [https://doi.org/10.3758/BF03194956]
Millière, R. 2024. Language models as models of language. arXiv preprint arXiv:2408.07144, .
OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774, .
Reinhart, T. 1981. Definite NP anaphora and c-command domains. Linguistic inquiry 12(4), 605-635.
Shin, U., E. Yi and S. Song. 2023. Investigating a neural language model’s replicability of psycholinguistic experiments: A case study of NPI licensing. Frontiers in Psychology 14, 937656. [https://doi.org/10.3389/fpsyg.2023.937656]
Wilcox, E. G., R. Futrell and R. Levy. 2024. Using computational models to test syntactic learnability. Linguistic Inquiry 55(4), 805-848. [https://doi.org/10.1162/ling_a_00491]