The Korean Association for the Study of English Language and Linguistics

Current Issue

Korea Journal of English Language and Linguistics - Vol. 20

[ Article ]
Korea Journal of English Language and Linguistics - Vol. 20, No. 1, pp.42-63
Abbreviation: KASELL
ISSN: 1598-1398 (Print)
Print publication date 31 Mar 2020
Received 10 Feb 2020 Revised 10 Mar 2020 Accepted 20 Mar 2020

딥러닝을 활용한 감정 분석 과정에서 필요한 데이터 전처리 및 형태 변형
서혜진 ; 신정아*
제1저자, 동국대학교
*교신저자, 동국대학교

Data preprocessing and transformation in the sentiment analysis using a deep learning technique
Hye-Jin Seo ; Jeong-Ah Shin*

Copyright 2020 KASELL
This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited.


This study examined how to preprocess and transform data efficiently in order to use deep learning techniques in analyzing linguistic data. Researchers’ interests in deep learning techniques have explosively increased worldwide; however, it is not easy for them to link linguistics to deep learning techniques or algorithms because linguists do not know how and where to begin in using them. Thus, this study provides the general procedure to train data using deep learning algorithms in practice. In particular, for instance, we focused on how to preprocess and transform Tweet data for a sentiment analysis by using deep learning techniques. In addition, we introduced the latest deep learning algorithm, so-called BERT, in the data preprocessing and transformation procedure. The data preprocessing is particularly important because the result from deep learning can significantly vary depending on it. Even though the data preprocessing procedure can differ according to the aim of research, this study tries to introduce the general way that advanced researchers frequently use for deep learning algorithms. This study is expected to lower the barriers in applying deep learning techniques to linguistic data and make it easier for researchers to conduct deep learning research related to linguistics.

Keywords: data preprocessing, transformation, sentiment analysis, deep learning

1. 이정복⋅김봉국⋅이은경⋅하귀녀. 2000. 바람직한 통신언어 확립을 위한 기초 연구. 연구보고서, 문화체육관광부.
2. 이정복(Lee, J. B.). 2011. 트위터의 소통 구조와 통신 언어 영역(Communication Structure of the Twitter and its Type of Net-language). 《인문과학연구》(Journal of Humanity Therapy) 37, 235-270.
3. 장세은⋅이경은⋅박호민⋅송원문⋅정해룡⋅이수상⋅김재훈(S-E. Jhang, K-E. Lee, H. Park, W-M. Song, H-R. Jung, S-S. Lee and J-H. Kim). 2019. 대화코퍼스를 통한 셰익스피어 비극 작품과 주요 남녀 등장인물 간의 감정분석(Sentiment Analysis of Shakespeare’s Tragedy Plays and Their Major Male and Female Characters through Dialogue Corpora). 《언어과학》(Journalof Language Sciences) 26(1), 115-147.
4. 허찬⋅온승엽(C. Heo, S-Y Ohn). 2017. Word2vec와 Label Propagation을 이용한 감성사전 구축방법(A Novel Method for Constructing Sentiment Dictionaries using Word2vec and Label Propagation). 《한국차세대컴퓨팅학회 논문지》(The Journal of KoreanInstitute of Next Generation Computing) 13(2), 93-101.
5. 허상희⋅최규수(S. H. Hur, K. S. Choi). 2012. 트위터에서 트윗 (tweet) 의 특징과 유형 연구(A Study on characteristics and types of tweet in twitter). 《한민족어문학》 61, 455-494.
6. Devlin, J., M. W. Chang, K. Lee and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805.
7. Gilbert, E. and C. J. Hutto. 2014. VADER: Aparsimonious rule-based modelfor sentiment analysis of social media text. Eighth International Conference on Weblogs and Social Media, 216-225.
8. Linzen, T. 2019. What can linguistics and deep learning contribute to each other? Response to Pater. Language, 95(1), e99-e108.
9. Linzen, T., E. Dupoux and Y. Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4, 521-535.
10. Nalisnick, E. T. and H. S. Baird. 2013. Extracting sentiment networks from Shakespeare's plays. 2013 12th International Conference on Document Analysis and Recognition, 758-762.
11. Nielsen, F. Å. 2011. A new ANEW: Evaluation of a wordlist for sentiment analysis in microblogs. arXiv preprintarXiv:1103.2903.
12. McCoy, R. T., R. Frank and T. Linzen. 2018. Revisiting the poverty of the stimulus: Hierarchical generalization without a hierarchical bias in recurrent neural networks. arXiv preprintarXiv:1802.09091.
13. Severyn, A. and A. Moschitti. 2015, August. Twitter sentiment analysis with deep convolutional neural networks. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 959-962.

서혜진(Seo, Hye-Jin), 대학원생(Graduate Student)동국대학교영어영문학부04620 서울특별시 중구 필동로 1길 30(30 Pildong-ro 1 gil, Jung-gu, Seoul 04620)Tel: 02) 2260-8705E-mail:

신정아(Shin, Jeong-Ah), 교수(Professor)동국대학교(Dongguk University)영어영문학부(Division of English Language and Literature)04620 서울특별시 중구 필동로 1길 30(30 Pildong-ro 1 gil, Jung-gu, Seoul 04620)Tel: 02) 2260-3167E-mail: