New Textual Corpora for Serbian Language Modeling	Novi tekstualni korpusi za modelovanje srpskog jezika
INFOtheca, Scientific paper [pdf]	INFOteka, Naučni rad [pdf]
ID: 1.2024.1.4 Number: 1 Volume: 24 Month: 02 Year: 2025 UDC: 811.163.4’322.2 [tmx] [bow]
Mihailo Škorić Institution: University of Belgrade, Faculty of Mining and Geology, Belgrade, Serbia Mail: mihailo.skoric@rgf.bg.ac.rs	Mihailo Škorić Institucija: Univerzitet u Beogradu, Rudarsko-geološki fakultet, Beograd, Srbija E-pošta: mihailo.skoric@rgf.bg.ac.rs
Nikola Janković Institution: University of Belgrade, Faculty of Philology, Belgrade, Serbia Mail: nikolajankovickv@gmail.com	Nikola Janković Institucija: Univerzitet u Beogradu, Filološki fakultet, Beograd, Srbija E-pošta: nikolajankovickv@gmail.com
Abstract This paper will present textual corpora for Serbian (and Serbo-Croatian) that can be used for the training of large language models and that are publicly available at one of the several important online repositories of such resources. Each corpus will be classified using multiple methods and its characteristics will be described in details. Additionally, the paper will introduce three new corpora: anew umbrella web corpus of Serbo-Croatian, a new high-quality corpus based on the doctoral dissertations from all Universities in Serbia, stored within the National Repository of Doctoral Dissertations (NARDUS), and a parallel corpus of dissertation abstracts and their translations, derived from the same source. The uniqueness of both old and new corpora will be accessed via frequency-based stylometric methods, and the results will be briefly discussed.	Apstrakt Ovaj rad će predstaviti tekstualne korpuse za srpski (i srpskohrvatski) koji se mogu koristiti za treniranje velikih jezičkih modela, a koji su javno dostupni na jednom od nekoliko značajnih veb repozitorijuma. Svaki korpus će biti klasifikovan pomoću više metoda i njegove karakteristike će biti detaljno opisane. Pored toga, rad će predstaviti tri nova korpusa: novi krovni veb-korpus za srpskohrvatski, novi visokokvalitetni korpus zasnovan na doktorskim disertacijama pohranjenim u Nacionalnom repozitorijumu doktorskih disertacija sa svih univerziteta u Srbiji, i paralelni korpus prevoda sažetaka iz istog izvora. Jedinstvenost starihi novih korpusa biće ocenjena putem stilometrijskih metoda zasnovanih na frekvenciji, i ukratko će se diskutovati o rezultatima.
Keywords: corpora, Serbian language, language models, evaluation	Ključne reči: korpusi, srpski jezik, jezički modeli, evaluacija
Pages: 71-96	Strane: 71-96
Publishing place: Publisher: Publishing year:	Mesto izdanja: Izdavač: Godina izdanja:
Translator:	Prevodilac:

C:\inetpub\BiblishaMongo\export\11\svg\1_2024_1_4_tmx_0.svg

Bibliša: Aligned Collection Search Tool