Bibliša: Aligned Collection Search Tool

[ Log In ] [ Register ]
Creating a Synthetic Evaluation Dataset forthe Serbian SentiWordNetIzrada sintetičkog evaluativnog skupa podataka za Srpski SentiWordNet
INFOtheca, Scientific paper [pdf]INFOteka, Naučni rad [pdf]
ID: 1.2024.1.3 Number: 1 Volume: 24 Month: 02 Year: 2025 UDC: 811.163.41’322.2 [tmx] [bow]
Saša Petalnikar
Institution: University of Belgrade, Multidisciplinary PhD Studies, Intelligent Systems, Belgrade, Serbia
Mail: sasa5linkar@gmail.com
Saša Petalnikar
Institucija: Univerzitet u Beogradu, Multidisciplinarnedoktorske studije, Inteligentni sistemi, Beograd, Srbija
E-pošta: sasa5linkar@gmail.com
Abstract
This study presents the creation of a synthetic evaluation dataset for the Serbian SentiWordNet using Large Language Models (LLMs), specifically focusing on the Mistral model. Addressing the scarcity of the sentiment analysis resources for Serbian, this research aims to bridge this gap by generating a dataset to evaluate and enhance sentiment analysis tools for Serbian. Sentiment polarity values from the English SentiWordNet were automatically mapped to Serbian WordNet via the Inter-Lingual Index (ILI). To refine these values for better alignment with the Serbian language, a new evaluation dataset was created. Initially, 500 synsets from the Serbian WordNet were selected based on their alignment with the senti-pol-srlexicon and withthe mapped values from SentiWordNet. These synsets underwent sentiment polarity classifcation using the Mistral model. A balanced subset of 75 synsets was then randomly extracted. It was further refined for sentiment gradation, and manually reviewed. The findings demonstrate a high model reliability, with approximately 93% of responses meeting the established acceptability criteria.
Apstrakt
U radu se predstavlja izrada sintetičkog skupa za evaluaciju Srpskog SentiWordNet-a koja koristi velike jezičke modele, posebno model Mistral. Zbog nedostatka resursa za analizu sentimenta na srpskom jeziku, cilj istraživanja je premošćavanje ovog jaza generisanjem skupa za evaluaciju i unapređenje alata za analizu sentimenta na srpskom. Vrednosti polariteta sentimenta iz engleskog SentiWordNet-a automatski su mapirane na Srpski Vordnet. Kako bise ove vrednosti preciznije prilagodile srpskom jeziku, kreiran je poseban skup za evaluaciju. Inicijalno je odabrano 500 sinsetova iz Srpskog Vordneta, na osnovu njihove usklađenosti sa senti-pol-sr leksikonom i mapiranim vrednostima iz SentiWordNet-a. Ovi sinsetovi su klasifikovani prema polaritetu sentimenta korišćenjem Mistral-a.Izbalansirani podskup od 75 sinsetova nasumično je izdvojen, dodatno profinjen finijom gradacijom sentimenta i ručno pregledan. Rezultati pokazuju visoku preciznost, približno 93%.
Keywords: SentiWordNet, synthetic dataset, Large Language Models, Serbian, sentiment analysisKljučne reči:
Pages: 53-70Strane: 53-70
Publishing place:
Publisher:
Publishing year:
Mesto izdanja:
Izdavač:
Godina izdanja:
Translator: Prevodilac:
C:\inetpub\BiblishaMongo\export\11\svg\1_2024_1_3_tmx_0.svg