The Korean Association for the Study of English Language and Linguistics
[ Article ]
Korea Journal of English Language and Linguistics - Vol. 23, No. 0, pp.461-481
ISSN: 1598-1398 (Print) 2586-7474 (Online)
Print publication date 30 Jan 2023
Received 12 Apr 2023 Revised 17 May 2023 Accepted 07 Jun 2023
DOI: https://doi.org/10.15738/kjell.23..202306.461

Decoding BERT’s Internal Processing of Garden-Path Structures through Attention Maps

Jonghyun Lee ; Jeong-Ah Shin
(first author) Senior Researcher, Institute of Humanities, Seoul National University museeq@snu.ac.kr
(corresponding author) Professor, Division of English Language and Literature, Dongguk University jashin@dongguk.edu


© 2023 KASELL All rights reserved
This is an open-access article distributed under the terms of the Creative Commons License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Recent advancements in deep learning neural models, such as BERT, have demonstrated remarkable performance in natural language processing tasks, yet understanding their internal processing remains a challenge. This study employs the method of examining attention maps to uncover the internal processing of BERT, specifically when dealing with garden-path sentences. The analysis focuses on BERT's utilization of linguistic cues, such as transitivity, plausibility, and the presence of a comma, and evaluates its capacity for reanalyzing misinterpretations. The results revealed that BERT exhibits human-like syntactic processing by attending to the presence of a comma, showing sensitivity to transitivity, and reanalyzing misinterpretations, despite initially lacking sensitivity to plausibility. By concentrating on attention maps, the present study provides valuable insights into the inner workings of BERT and contributes to a deeper understanding of how advanced neural language models acquire and process complex linguistic structures.

Keywords:

attention map, Natural Language Processing, Psycholinguistics, Transformers, garden-path structure

Acknowledgments

This work was supported by the Park Chung-Jip Scholarship Fund for the Next Generation in English literature and language at Seoul National University in 2022.

References

  • Adams, B. C., C. Clifton. and D. C. Mitchell. 1998. Lexical guidance in sentence processing? Psychonomic Bulletin and Review 5(2), 265-270. [https://doi.org/10.3758/BF03212949]
  • Baayen, R. H., D. J. Davidson. and D. M. Bates. 2008. Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language 59(4), 390-412. [https://doi.org/10.1016/j.jml.2007.12.005]
  • Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever. and D. Amodei. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165
  • Clark, K., U. Khandelwal, O. Levy. and C. D. Manning. 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341. [https://doi.org/10.18653/v1/W19-4828]
  • Devlin, J., M. W. Chang, K. Lee. and K. Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • De Marneffe, M. C., and C. D. Manning. 2008. Stanford typed dependencies manual (pp. 338-345). Technical report, Stanford University.
  • Ettinger, A. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8, 34-48. [https://doi.org/10.1162/tacl_a_00298]
  • Ferreira, F., and J. M. Henderson. 1991. Recovery from misanalyses of garden-path sentences. Journal of Memory and Language 30(6), 725-745. [https://doi.org/10.1016/0749-596X(91)90034-H]
  • Ferreira, F., K. Christianson. and A. Hollingworth. 2001. Misinterpretations of garden-path sentences: Implications for models of sentence processing and reanalysis. Journal of Psycholinguistic Research 30, 3-20. [https://doi.org/10.1023/A:1005290706460]
  • Frazier, L. and K. Rayner. 1982. Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology 14(2), 178-210. [https://doi.org/10.1016/0010-0285(82)90008-1]
  • Futrell, R., E. Wilcox, T. Morita. and R. Levy. 2018. RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency. arXiv preprint arXiv:1809.01329.
  • Futrell, R., E. Wilcox, T. Morita, P. Qian, M. Ballesteros. and R. Levy. 2019. Neural language models as psycholinguistic subjects: Representations of syntactic state. arXiv preprint arXiv:1903.03260. [https://doi.org/10.18653/v1/N19-1004]
  • Goldberg, Y. 2019. Assessing BERT’s syntactic abilities. arXiv preprint arXiv:1901.05287.
  • Goodkind, A. and K. Bicknell. 2018, January. Predictive power of word surprisal for reading times is a linear function of language model quality. In Proceedings of the 8th workshop on cognitive modeling and computational linguistics (CMCL 2018), 10-18. Salt Lake City, Utah, USA, 7 January, 2018 [https://doi.org/10.18653/v1/W18-0102]
  • Gulordava, K., P. Bojanowski, E. Grave, T. Linzen and M. Baroni. 2018. Colorless green recurrent networks dream hierarchically. arXiv preprint arXiv:1803.11138. [https://doi.org/10.18653/v1/N18-1108]
  • Hao, Y., S. Mendelsohn, R. Sterneck, R. Martinez. and R. Frank. 2020. Probabilistic predictions of people perusing: Evaluating metrics of language model performance for psycholinguistic modeling. arXiv preprint arXiv:2009.03954. [https://doi.org/10.18653/v1/2020.cmcl-1.10]
  • Hoover, B., H. Strobelt. and S. Gehrmann. 2019. exbert: A visual analysis tool to explore learned representations in transformers models. arXiv preprint arXiv:1910.05276. [https://doi.org/10.18653/v1/2020.acl-demos.22]
  • Hopp, H. 2015. Individual differences in the second language processing of object–subject ambiguities. Applied Psycholinguistics 36(2), 129-173. [https://doi.org/10.1017/S0142716413000180]
  • Hu, J., J. Gauthier, P. Qian, E. Wilcox. and R. Levy. 2020. A systematic assessment of syntactic generalization in neural language models. arXiv preprint arXiv:2005.03692. [https://doi.org/10.18653/v1/2020.acl-main.158]
  • Hunter, J. D. 2007. Matplotlib: A 2D graphics environment. Computing in Science and Engineering, 9(03), 90-95. [https://doi.org/10.1109/MCSE.2007.55]
  • Koroteev, M. V. 2021. BERT: A review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943.
  • Kuncoro, A., L. Kong, D. Fried, D. Yogatama, L. Rimell, C. Dyer. and P. Blunsom. 2020. Syntactic structure distillation pretraining for bidirectional encoders. arXiv preprint arXiv:2005.13482. [https://doi.org/10.1162/tacl_a_00345]
  • Kuznetsova, A., P. B. Brockhoff. and R. H. Christensen. 2017. lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82, 1-26. [https://doi.org/10.18637/jss.v082.i13]
  • Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma. and R. Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  • Lee, J., J. A. Shin. and M. K. Park. 2022. (AL)BERT down the garden path: Psycholinguistic experiments for pre-trained language models. Korean Journal of English Language and Linguistics 22, 1033-1050.
  • Levy, R. 2008. Expectation-based syntactic comprehension. Cognition 106(3), 1126-1177. [https://doi.org/10.1016/j.cognition.2007.05.006]
  • Linzen, T., E. Dupoux. and Y. Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4, 521-535. [https://doi.org/10.1162/tacl_a_00115]
  • Marvin, R. and T. Linzen. 2018. Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031. [https://doi.org/10.18653/v1/D18-1151]
  • Mitchell, D. C. 1987. Lexical guidance in human parsing: Locus and processing characteristics. In M. Coltheart ed., Attention and Performance XII: The Psychology of Reading, 601-618. London, Routledge
  • Oh, B. D., C. Clark. and W. Schuler. 2022. Comparison of structural parsers and neural language models as surprisal estimators. Frontiers in Artificial Intelligence, 5, 777963. [https://doi.org/10.3389/frai.2022.777963]
  • Oh, B. D. and W. Schuler. 2023. Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times?. Transactions of the Association for Computational Linguistics 11, 336-350. [https://doi.org/10.1162/tacl_a_00548]
  • Pickering, M. J. and M. J. Traxler. 1998. Plausibility and recovery from garden paths: An eye-tracking study. Journal of Experimental Psychology: Learning, Memory, and Cognition 24(4), 940. [https://doi.org/10.1037/0278-7393.24.4.940]
  • Posner, M.I. and S. E. Petersen. 1990. The attention system of the human brain. Annual Review of Neuroscience 13, 25-42. [https://doi.org/10.1146/annurev.ne.13.030190.000325]
  • R Core Team 2023. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Rogers, A., O. Kovaleva. and A. Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8, 842-866. [https://doi.org/10.1162/tacl_a_00349]
  • Smith, N. J. and R. Levy. 2013. The effect of word predictability on reading time is logarithmic. Cognition 128(3), 302-319. [https://doi.org/10.1016/j.cognition.2013.02.013]
  • Staub, A. 2007. The parser doesn’t ignore intransitivity, after all. Journal of Experimental Psychology: Learning, Memory, and Cognition 33(3), 550. [https://doi.org/10.1037/0278-7393.33.3.550]
  • Trueswell, J. C., M. K. Tanenhaus. and S. M. Garnsey. 1994. Semantic influences on parsing: Use of thematic role information in syntactic ambiguity resolution. Journal of Memory and Language 33(3), 285-318. [https://doi.org/10.1006/jmla.1994.1014]
  • Van Gompel, R. P. and M. J. Pickering. 2001. Lexical guidance in sentence processing: A note on Adams, Clifton, and Mitchell 1998. Psychonomic Bulletin and Review 8, 851-857. [https://doi.org/10.3758/BF03196228]
  • Van Schijndel, M. and T. Linzen. 2018. Modeling garden path effects without explicit hierarchical syntax. In Proceedings of 40th Annual Meeting of the Cognitive Science Society, 2600-2605. Madison, Wisconsin, USA, 25-28 July, 2018.
  • Van Schijndel, M., A. Mueller. and T. Linzen. 2019. Quantity doesn’t buy quality syntax with neural language models. arXiv preprint arXiv:1909.00111. [https://doi.org/10.18653/v1/D19-1592]
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, T. Kaiser. and I. Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762
  • Vig, J. 2019. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714. [https://doi.org/10.18653/v1/P19-3007]
  • Vig, J. and Y. Belinkov. 2019. Analyzing the structure of attention in a transformer language model. arXiv preprint arXiv:1906.04284. [https://doi.org/10.18653/v1/W19-4808]
  • Waskom, M. L. 2021. Seaborn: statistical data visualization. Journal of Open Source Software 6(60), 3021. [https://doi.org/10.21105/joss.03021]
  • Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis (2nd ed.). New York, NY: Springer. [https://doi.org/10.1007/978-3-319-24277-4_9]
  • Wilcox, E., R. Levy, T. Morita. and R. Futrell. 2018. What do RNN language models learn about filler-gap dependencies?. arXiv preprint arXiv:1809.00042. [https://doi.org/10.18653/v1/W18-5423]
  • Wilcox, E., R. Levy. and R. Futrell. 2019. Hierarchical representation in neural language models: Suppression and recovery of expectations. arXiv preprint arXiv:1906.04068. [https://doi.org/10.18653/v1/W19-4819]