The Korean Association for the Study of English Language and Linguistics
[ Article ]
Korea Journal of English Language and Linguistics - Vol. 24, No. 0, pp.979-1010
ISSN: 1598-1398 (Print) 2586-7474 (Online)
Print publication date 31 Jan 2024
Received 29 Jul 2024 Revised 02 Sep 2024 Accepted 23 Sep 2024
DOI: https://doi.org/10.15738/kjell.24..202409.979

AI 기반 영문학 텍스트 분석: 제인 오스틴의 소설 엠마를 중심으로 한 자연어처리 연구

오영교
전남대학교
AI-powered text analysis of English literature: A Natural Language Processing (NLP) study centered on Jane Austen’s novel Emma
Young-kyo Oh
Lecturer, Dept. of Education, College of Education, Chonnam National University; 77, Yongbong-ro, Buk-gu, Gwangju, Korea, 61186 5young-kyo@hanmail.net


© 2024 KASELL All rights reserved
This is an open-access article distributed under the terms of the Creative Commons License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This study attempts to conduct data-driven text analysis on the text of Jane Austen’s novel Emma (1815). Text analysis is a research method that utilizes Natural Language Processing (NLP) techniques to extract meaningful content and information from large-scale unstructured text data, and to discover new meaning and insights at the contextual level by considering the relationship between text and words. The main text analysis techniques include text network analysis, topic modeling, and sentiment analysis. In this study, we tried to analyze the text of the novel Emma, which is representative among English literary texts, according to NLP algorithms. To do so, we first analyzed the text of the novel Emma through term frequency (TF) analysis and term frequency-inverse document frequency (TF-IDF) analysis to determine the relative importance of words according to word frequency. Then, to examine the relationships between characters in the novel, we conducted co-occurrence and network centrality analysis through text network analysis. Next, we applied topic modeling using Latent Dirichlet Allocation (LDA) to classify Emma, a novel consisting of three volumes and 55 chapters, into four topics. Finally, sentiment analysis was conducted to calculate the degree of positivity and negativity for each volume to quantify the sentiment score. This study aims to help non-literature majors understand and appreciate English classical texts by objectively quantifying the literary content and character relationships inherent in 19th-century English fiction texts, and furthermore, to gain implications for English education in terms of effective English text comprehension education.

Keywords:

text analysis, Emma, network text analysis, topic modeling, sentiment analysis

References

  • 곽기영(Kwak, K.-Y.). 2014. 『소셜네트워크분석』 (Social network analysis). 서울: 청람(Seoul: Cheongram).
  • 김상락(Kim, S. R.). 2005. 문학 작품에서의 복잡계 연결망 분석: 소설 토지를 중심으로(Complex network analysis in literature: Togi). ≪새물리≫(New Physics: Sae Mulli) 50-4, 267-271.
  • 김선영(Kim, S. Y.). 2022. 새마을운동 관련 사회적 이슈 탐색 및 의미에 관한 연구: 뉴스 빅데이터의 LDA 기반 토픽분석을 중심으로(Exploration of the social issues related to Saemaul Undong, and the Meaning: Focusing on LDA topic analysis of news big data). ≪사회적경제와 정책연구≫(Social Economy & Policy Studies) 12-2, 151-178. [https://doi.org/10.22340/seps.2022.05.12.2.151]
  • 김용규(Kim, Y. K.). (역). (2012). 『멀리서 읽기: 세계문학과 수량적 형식주의』(Franco Moretti의 Distance Reading). 서울: 현암사.
  • 박주섭⋅홍순구⋅김종원(Park, J. S., S. G. Hong and J. W. Kim). 2017. 토픽 모델링을 활용한 과학기술동향 및 예측에 관한 연구(A Study on Science Technology Trend and Prediction Using Topic Modeling). ≪한국산업정보학회논문지≫(Journal of the Korea Industrial Information Systems Research) 22-4, 19-28.
  • 원영선(Won, Y. S.). 2016. 하이베리 탐구: 제인 오스틴의 [엠마](Exploring Highbury in Jane Austen’s Emma). ≪19세기 영어권 문학≫(Nineteenth Century Literature in English) 20-1, 95-121.
  • 원영선(Won, Y. S.). 2023. 디지털 문학연구의 탐색과 적용: 엠마의 소셜네트워크 분석(Exploring digital literary study, an application: Social network analysis of Emm). ≪19세기 영어권 문학≫(Nineteenth Century Literature in English) 27-2, 39-70. [https://doi.org/10.24152/NCLE.2023.9.27.2.39]
  • 이승은(Lee, S. E.). 2022. 텍스트 마이닝을 활용한 근대 조리서의 분석 연구(Study of Korean Modern Cookbooks Using Text Mining Analysis). 박사학위논문(Doctoral dissertation), 이화여자대학교, 서울, 한국.
  • 장민서⋅오수진⋅김응모(Jang, M., S. Oh and U. M. Kim). 2018. TF-IDF를 활용한 k-means 기반의 효율적인 대용량 기사 처리 및 요약 알고리즘(Article analytic and summarizing algorithm by facilitating TF-IDF based on k-means). Paper presented at the Korea Information Processing Society Conference, 271-274.
  • 정성훈(Jeong, S.-H.). 2014. 현대 한국어 부사에 대한 계량언어학적 연구-확률 통계 모형과 네트워크를 이용한 분석(The quantitative linguistical study on modern Korean adverbs: Using probability-statistical model and network model). 박사학위논문(Doctoral dissertation), 서울대학교 대학원, 서울, 한국.
  • 최연무(Choi, Y. M.). 2004. 복잡계 네트워크로서의 그리스 신화(Greek myth as a complex network). ≪새물리≫(New Physics: Sae Mulli) 49-3, 298-302.
  • 최은샘⋅정채관(Choi, E. and C. K. Jung). 2021. 영미 아동 모험 소설에 관한 코퍼스 분석 연구: 보물섬을 중심으로(A corpus analysis of British-American children‘s adventure novels: Treasure). ≪한국콘텐츠학회논문지≫(The Journal of the Korea Contents Association) 21-1, 333-342.
  • 하명정(Ha, M. J.). 2013. 코퍼스에 기반한 문학 텍스트 분석(Corpus-based literary analysis). ≪한국콘텐츠학회논문지≫(The Journal of the Korea Contents Association) 13-9, 440-447. [https://doi.org/10.5392/JKCA.2013.13.09.440]
  • 홍주현⋅나은경(Hong, J. H. and E. K. Na). 2015. 세월호 사건 보도의 피해자 비난 경향 연구: 보수 종편 채널 뉴스의 피해자 범주화 및 단어 네트워크 프레임 분석: 보수 종편 채널 뉴스의 피해자 범주화 및 단어 네트워크 프레임 분석(Victim Blaming of Sewal-ferry Disaster on News in Conservative Total TV Programming: Categorization of Victims and Word Network Analysis). ≪한국언론학보≫(Korean Journal of Journalism & Communication Studies) 59-6, 69-106.
  • Blei, D. M., A. Y. Ng and M. I. Jordan. 2003. Latent dirichlet allocation, Journal of machine Learning research 3, 993-1022.
  • Borgatti, S. P., A. Mehra, D. J. Brass and G. Labianca. 2009. Network analysis in the social sciences. Science 323(5916), 892-895. [https://doi.org/10.1126/science.1165821]
  • Chae, S. H., J. I. Lim and J. Kang. 2015. A comparative analysis of social commerce and open market using user reviews in Korean mobile commerce. Journal of Intelligence and Information Systems 21(4), 53-77. [https://doi.org/10.13088/jiis.2015.21.4.053]
  • Chen, C. and C. Chang. 2019. A Chinese ancient book digital humanities research platform to support digital humanities research. The Electronic Library 37(2), 314-336. [https://doi.org/10.1108/EL-10-2018-0213]
  • Cho, H., J. Kang and D. Y. Jeong. 2016. An exploratory study on mobile app review through comparative analysis between South Korea and US. Journal of Information Technology Services 15(2), 169-184. [https://doi.org/10.9716/KITS.2016.15.2.169]
  • Cho, H. J., S. G. Kim and J. Y. Kang. 2017. An empirical analysis of doppelgänger brand image effects: Focused on the Internet community. The Journal of Information Systems 26(1), 21-51. [https://doi.org/10.5859/KAIS.2017.26.1.21]
  • Choi, S. R. and J. W. Yoo. 2014. Present of the analysis method of the validation between the story proceeding and the character-by the generative trajectory of meaning with Greimass and Enneagram. Journal of Digital Design 14(2), 139-147. [https://doi.org/10.17280/jdd.2014.14.2.014]
  • Chung, P., H. Ahn and K. Y. Kwahk. 2019. Identification of core features and values of smart phone design using text mining and social network analysis. Korean Business Association 32(1), 27-47. [https://doi.org/10.18032/kaaba.2019.32.1.27]
  • Elson, D. K., K. McKeown and N. J. Dames. 2010. Extracting social networks from literary fiction. In Workshop Proceedings of TextGraphs-7: Graph-based Methods for Natural Language Processing, 6-14.
  • Fischer, F. and D. Skorinkin. 2021. Social network analysis in Russian literary studies. In Daria Gritsenko et al. eds., The Palgrave Handbook of Digital Russia Studies, 517-536. Basingstoke: Palgrave Macmillan. [https://doi.org/10.1007/978-3-030-42855-6_29]
  • Hassan, A., A. Abu-Jbara and D. Radev. 2012. Extracting signed social networks from text. In Workshop Proceedings of TextGraphs-7: Graph-based Methods for Natural Language Processing, 6-14.
  • Heuser, R. and L. Le-Khac. 2012. A Quantitative Literary History of 2,958 Nineteenth-century British Novels: The Semantic Cohort Method. Stanford Literary Lab.
  • Jockers, M. L. 2013. Macroanalysis: Digital Methods and Literary History. University of Illinois Press. [https://doi.org/10.5406/illinois/9780252037528.001.0001]
  • Kang, B. M. 2010. Constructing networks of related concepts based on co-occurring nouns. Korean Semantics 32, 1-28.
  • Kang, D. J. and K. N. Lee. 2015. A study on co-author networks of journal of Korea trade research association using social network analysis. Korea trade review 40(5), 1-23.
  • Karl, A., J. Wisnowski and W. H. Rushing. 2015. A practical guide to text mining with topic extraction. Wiley Interdisciplinary Reviews: Computational Statistics 7(5), 326-340. [https://doi.org/10.1002/wics.1361]
  • Kim, S., H. J. Cho and J. Y. Kang. 2016. The status of using text mining in academic research and analysis methods. Journal of Information Technology and Architecture 13(2), 317-329.
  • Kim, S. G. and J. Kang. 2018. Analyzing the discriminative attributes of products using text mining focused on cosmetic reviews. Information Processing & Management 54(6), 938-957. [https://doi.org/10.1016/j.ipm.2018.06.003]
  • Labatut, V. and X. Bost. 2019. Extraction and analysis of fictional character networks: A survey. ACM Computing Surveys (CSUR) 52(5), 1-40. [https://doi.org/10.1145/3344548]
  • Medhat, W., A. Hassan and H. Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal 5(4), 1093-1113 [https://doi.org/10.1016/j.asej.2014.04.011]
  • Mimno, D. 2012. Computational historiography: Data mining in a century of classics journals. Journal on Computing and Cultural Heritage (JOCCH) 5(1), 1-19. [https://doi.org/10.1145/2160165.2160168]
  • Moretti, F. 2013. Distant Reading. NY: Verso Books.
  • Oh, Y. L., J. O. Min, Y. G. Kim., D. J. Kim., Y. K. Park and B. G. Lee. 2017. A comparative analysis for the extraction of similar patent claims based on word embedding. In Paper presented at the Korean Institute of Information Scientists and Engineers, 20-22.
  • Pang, B. and L. Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in information retrieval, 2(1–2), 1-135. [https://doi.org/10.1561/1500000011]
  • Park, G. M., S. H. Kim and H. G. Cho. 2013. Analysis of social network according to the distance of characters statements. The Journal of the Korea Contents Association 13(4), 427-439. [https://doi.org/10.5392/JKCA.2013.13.04.427]
  • Ramsay, S. 2011. Reading Machines: Toward and Algorithmic Criticism. University of Illinois Press. [https://doi.org/10.5406/illinois/9780252036415.001.0001]
  • Rhody, L. M. 2012. Topic modeling and figurative language. Journal of Digital Humanities 2(1), 17-38.
  • Sawng, Y. W. and S. J. Lee. 2018. Analysis on the research trend of medical automation industries utilizing the keyword network analysis. Korean Association of Business Education 33(2), 225-242. [https://doi.org/10.23839/kabe.2018.33.2.225]
  • Schreibman, S., R. Siemens and J. Unsworth. 2008. A Companion to Digital Humanities. Oxford: Blackwell Publishing.
  • Scott, J. 2012. What is Social Network Analysis? NY: Bloomsbury Academic. [https://doi.org/10.5040/9781849668187]
  • Silge, J. and D. Robinson. 2017. Text Mining with R: A Tidy Approach. Sebastopol: O’Reilly.
  • Smeets, R. 2021. Character Constellations: Representations of Social Groups in Present-day Dutch Literary Fiction. Leuven University Press. [https://doi.org/10.2307/j.ctv21wj5cb]
  • Stiller, J., D. Nettle and R. I. Dunbar. 2003. The small world of Shakespeare’s plays. Human Nature, 14, 397-408. [https://doi.org/10.1007/s12110-003-1013-1]
  • Tsvetovat, M. and A. Kouzentsov. 2011. Social Network Analysis for Startups. Sebastopol, CA: O’Reilly.