The Korean Association for the Study of English Language and Linguistics
[ Article ]
Korea Journal of English Language and Linguistics - Vol. 24, No. 0, pp.568-588
ISSN: 1598-1398 (Print) 2586-7474 (Online)
Print publication date 31 Jan 2024
Received 01 Feb 2024 Revised 11 Apr 2024 Accepted 19 Jun 2024
DOI: https://doi.org/10.15738/kjell.24..202406.568

Investigating Grammatical Transfer in Korean-English GPT2 Language Models

Keonwoo Koo ; Jaemin Lee ; Myung-Kwan Park
(co-1st author) Ph.D. Candidate, Department of English, Dongguk University qjelrjsdn@naver.com
(co-1st author) MA Candidate, Department of English, Dongguk University whd7987@gmail.com
(corresponding author) Professor, Department of English, Dongguk University parkmk@dgu.edu


© 2024 KASELL All rights reserved
This is an open-access article distributed under the terms of the Creative Commons License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

With the recent success of artificial neural language model (LMs), their language acquisition has gained much attention (Futrell et al. 2019, Hu et al. 2020, Linzen et al. 2016, Warstadt et al. 2020, Wilcox et al. 2018). This paper delves into their second language (L2) acquisition, a largely unexplored area compared to their first language (L1) learning. The primary focus is on unraveling transfer effects originating from the L1’s linguistic structures. By closely examining our LMs’ performances on English grammar tasks, this study inspects how LMs encode abstract grammatical knowledge, particularly how pre-training biases acquired from Korean (L1) influence English (L2) performances in LMs. We present exploratory experiments where LMs were first trained on the dataset representing the initial language acquisition stage, followed by fine-tuning on the second language dataset. We analyzed cross-lingual transfer effects across diverse linguistic phenomena with the BLiMP test suite. We found that L1 pre-training did not accelerate linguistic generalization in the second language. Furthermore, our results revealed significant L1-interference, where the initial language knowledge hindered the LMs' ability to acquire and apply second language rules.

Keywords:

second language acquisition, neural language model, GPT-2, transfer effects, L1-interference

References

  • Adger, D. 2003. Core Syntax: A Minimalist Approach. Oxford: Oxford University Press. [https://doi.org/10.1093/oso/9780199243709.001.0001]
  • Artetxe, M., G. Labaka and E. Agirre. 2018. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Proceedings of the AAAI Conference on Artificial Intelligence 32.1. [https://doi.org/10.1609/aaai.v32i1.11992]
  • Blevins, T., H. Gonen and L. Zettlemoyer. 2022. Analyzing the mono- and cross-lingual pretraining dynamics of multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3575–3590. [https://doi.org/10.18653/v1/2022.emnlp-main.234]
  • Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal and D. Amodei. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877-1901.
  • Chang, T. A., Z. Tu and B. K. Bergen. 2022. The geometry of multilingual language model representations. arXiv preprint arXiv:2205.10964. [https://doi.org/10.18653/v1/2022.emnlp-main.9]
  • Chiswick, B. R. and P. W. Miller. 2003. A test of the critical period hypothesis for language learning. Journal of Multilingual and Multicultural Development 29(1), 16-29. [https://doi.org/10.2167/jmmd555.0]
  • Conneau, A., G. Lample, M. A. Ranzato, L. Denoyer and H. Jégou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087.
  • Conneau, A., G. Lample, G., Rinott, R., Williams, A., Bowman, S. R., Schwenk, H. and Stoyanov, V. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2475-2485. [https://doi.org/10.18653/v1/D18-1269]
  • Conneau, A., A. Baevski, R. Collobert, A. Mohamed and M. Auli. 2020. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979. [https://doi.org/10.21437/Interspeech.2021-329]
  • Deshpande, A., T. Talukdar and K. Narasimhan. 2022. When is BERT multilingual? Isolating crucial ingredients for cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3610–3623. [https://doi.org/10.18653/v1/2022.naacl-main.264]
  • Dong, C., C. C. Loy, K. He and X. Tang. 2015. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(2), 295-307. [https://doi.org/10.1109/TPAMI.2015.2439281]
  • Ellis, R. 2010. Second language acquisition, teacher education and language pedagogy. Language Teaching 43(2), 182-201. [https://doi.org/10.1017/S0261444809990139]
  • Ettinger, A. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8, 24-48. [https://doi.org/10.1162/tacl_a_00298]
  • Futrell, R., E. Wilcox, T. Morita and R. Levy. 2019. RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency. arXiv preprint arXiv:1809.01329.
  • Giulianelli, M., J. Harding, F. Mohnert, D. Hupkes and W. Zuidema. 2018. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 240–248. [https://doi.org/10.18653/v1/W18-5426]
  • Gulordava, K., P. Bojanowski, E. Grave, T. Linzen and M. Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1 (Long Papers),1195–1205. [https://doi.org/10.18653/v1/N18-1108]
  • Grimes, F, B. and J. E. Grimes. 2022. Ethnologue: Languages of the World. Dallas, TX: SIL International.
  • Hatch, E. M. 1983. Psycholinguistics: A Second Language Perspective. Rowley, MA: Newbury House Publishers, Inc.,
  • Hu, J., J. Gauthier, P. Qian, E. Wilcox and R. P. Levy. 2020. A systematic assessment of syntactic generalization in neural language models. arXiv preprint arXiv:2005.03692. [https://doi.org/10.18653/v1/2020.acl-main.158]
  • Huebner, P. A. and J. A. Willits. 2021. Using lexical context to discover the noun category: Younger children have it easier. In Psychology of learning and motivation, 279-331. [https://doi.org/10.1016/bs.plm.2021.08.002]
  • Huebner, P. A., E. Sulem, F. Cynthia and D. Roth. 2021. BabyBERTa: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language Learning, 624-646. [https://doi.org/10.18653/v1/2021.conll-1.49]
  • Kirov, C. and R. Cotterell. 2018. Recurrent neural networks in linguistic theory: Revisiting Pinker and Prince (1988) and the past tense debate. Transactions of the Association for Computational Linguistics 6, 651-665. [https://doi.org/10.1162/tacl_a_00247]
  • Krashen, S. 1977. The monitor model for adult second language performance. Viewpoints on English as a Second Language, 152-161.
  • Krashen, S. 1981. Second language acquisition. Second Language Learning 3(7), 19-39.
  • Koo, K., J. Lee and M.-K. Park. 2022. An assessment of processing negative polarity Items by an L2 neural language model trained on English textbooks. Language and Information Society 46, 103-126.
  • Koo, K., J. Lee and M.-K. Park. 2023. Hierarchical inductive bias in the L2 textbook-T5 and Child-T5 language model: A study of data and architecture. Korean Journal of Applied Linguistics 39(4), 179-196. [https://doi.org/10.17154/kjal.2023.12.39.4.179]
  • Lakretz, Y., G. Kruszewski, T. Desbordes, D. Hupkes, S. Dehaene and M. Baroni. 2019. The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), 11–20. [https://doi.org/10.18653/v1/N19-1002]
  • Lample, G. and A. Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
  • Linzen, T., E. Dupoux and Y. Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4, 521-535. [https://doi.org/10.1162/tacl_a_00115]
  • Manning, C. D. 2015. Last words: Computational linguistics and deep learning. Computational Linguistics 41(4), 701–707. [https://doi.org/10.1162/COLI_a_00239]
  • Niu, J. and G. Penn. 2020. Grammaticality and language modelling. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. [https://doi.org/10.18653/v1/2020.eval4nlp-1.11]
  • Oba, M., T. Kuribayashi, H. Ouchi. and T. Watanabe. 2023. Second language acquisition of neural language models. arXiv preprint arXiv:2306.02920. [https://doi.org/10.18653/v1/2023.findings-acl.856]
  • Papadimitriou, I. and D. Jurafsky. 2020. Learning music helps you read: Using transfer to study linguistic structure in language models. arXiv preprint arXiv:2004.14601. [https://doi.org/10.18653/v1/2020.emnlp-main.554]
  • Pérez-Mayos, L., A. T. García, S. Mille and L. Wanner. 2021. Assessing the syntactic capabilities of transformer-based multilingual language models. arXiv preprint arXiv:2105.04688. [https://doi.org/10.18653/v1/2021.findings-acl.333]
  • Pinker, S. and A. Prince. 1988. On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition 28(1-2), 73-193. [https://doi.org/10.1016/0010-0277(88)90032-7]
  • Radford, A., J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Technical Report.
  • Ri, R. and Y. Tsuruoka. 2022. Pretraining with artificial language: Studying transferable knowledge in language models. arXiv preprint arXiv:2203.10326. [https://doi.org/10.18653/v1/2022.acl-long.504]
  • Ruder, S. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
  • Rumelhart, D. E. and J. L. McClelland. 1986. On learning the past tenses of English verbs. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 216-271. [https://doi.org/10.7551/mitpress/5236.001.0001]
  • Sag, A. I., T. Wasow, M. E. Bender. 2003. Syntactic Theory: A Formal Introduction. Stanford: CSLI Publications.
  • Shi, F., X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, ... and D. Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, 31210-31227.
  • Sportiche, D., H. Koopman and E. Stabler. 2013. An Introduction to Syntactic Analysis and Theory. John Wiley and Sons.
  • Sprouse, J. and N. Hornstein. 2013. Experimental syntax and island effects: Toward a comprehensive theory of islands. Experimental Syntax and Island Effects, 1-20. [https://doi.org/10.1017/CBO9781139035309.001]
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ... and I. Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30.
  • Warstadt, A., A. Singh and S. R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7, 625–641. [https://doi.org/10.1162/tacl_a_00290]
  • Warstadt, A., A. Parrish, H. Liu, A. Mohananey, W. Peng, S. F. Wang and S. R. Bowman. 2020. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics 8, 377-392. [https://doi.org/10.1162/tacl_a_00321]
  • Warstadt, A. and S. R. Bowman. 2022. What artificial neural networks can tell us about human language acquisition. Algebraic Structures in Natural Language, 17-60. [https://doi.org/10.1201/9781003205388-2]
  • Wilcox, E., R. Levy, T. Morita and R. Futrell. 2018. What do RNN language models learn about filler-gap dependencies? arXiv preprint arXiv:1809.00042. [https://doi.org/10.18653/v1/W18-5423]
  • Wu, S. and M. Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. arXiv preprint arXiv:1904.09077. [https://doi.org/10.18653/v1/D19-1077]
  • Wu, S., A. Conneau, H. Li, L. Zettlemoyer and V. Stoyanov. 2019. Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464. [https://doi.org/10.18653/v1/2020.acl-main.536]
  • Yadavalli, A., A. Yadavalli and V. Tobin. 2023. SLABERT Talk pretty one day: Modeling second language acquisition with BERT. arXiv preprint arXiv:2305.19589. [https://doi.org/10.18653/v1/2023.acl-long.657]