A sentiment analysis dataset for code-mixed Sinhala-English language

Show simple item record

dc.contributor.author Uthpala, K.
dc.contributor.author Thirukumaran, S.
dc.date.accessioned 2022-08-17T04:42:32Z
dc.date.available 2022-08-17T04:42:32Z
dc.date.issued 2021-09-15
dc.identifier.uri http://drr.vau.ac.lk/handle/123456789/304
dc.description.abstract Communication is an essential part of human life, and technology has been involved in several ways much wider than ever before. Single language often does not reflect on how people communicate naturally. Many parts of the world are multilingual on a societal level because most people using two or more languages. In multilingual settings, people tend to mix words from all languages they use, which is introduced as code-mixing. Moreover, there is an increasing demand for sentiment analysis of text from social media, mostly code-mixed. It can be helpful in various decisionmaking processes. Systems trained for monolingual data fail for code-mixed data due to the complexity of mixing at different levels. However, available resources for code mix data to create a model are very few. Only a few datasets are available for some popular languages such as Hindi-English, and Spanish-English. There are no resources available for Sinhala-English code-mixed language, and still, researchers have not given their attention to sentiment analysis for Sinhala-English mixed language. To overcome this, we presented a Sinhala-English, sentiment-labelled corpus using comments from YouTube videos for sentiment analysis of code-mixed text in the Sinhala-English language. Comments from social media do not follow strict grammar rules, and those are very noisy; we preprocessed to clean the comments. Then we use an annotation setup to label and create Sinhala-English dataset with 7,832 comments for sentiment analysis. At least two annotators annotate each comment, and altogether nine voluntary annotators contribute to the annotation process. The whole data have been categorized into three classes; positive, negative, and neutral. We use five machine learning algorithms on the newly created Sinhala-English code- mixed language dataset to show the insight of the dataset. All the used classification algorithms achieved significant accuracy for the created Sinhala-English dataset. The logistic regression algorithm showed the higher macro average score for precision, recall, and F1 score for the presented dataset. en_US
dc.language.iso en en_US
dc.publisher Faculty of Applied Science en_US
dc.subject Code-mixed language en_US
dc.subject Sentiment analysis en_US
dc.subject Sinhala-English en_US
dc.title A sentiment analysis dataset for code-mixed Sinhala-English language en_US
dc.type Conference paper en_US
dc.identifier.proceedings FARS 2021 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Browse

My Account