    Authors: Sokol, Marin;

    The growth of the internet has allowed us to express and share our opinions on an unprecedented scale, while at the same time the modern lifestyle made us the busiest we’ve ever been. This impasse has put many people in position to make decisions based on limited data, even though vast amounts of pertinent data may be available. Multi-document summarization (MDS) methods could help in alleviating this problem. The topic of this thesis are MDS methods for movie reviews, in particular the application of methods traditionally used on texts from other domains to text from the movie review domain. Study the literature on MDS and movie reviews summarization. Devise an MDS system for the IMDB dataset. Currently, IMDb is showing only the "most helpful" review on the front page for each movie, not representing hundreds or thousands of other reviews existing for each movie. Your task is to devise a model capable of producing a coherent and helpful review that is a better representation of a number of other reviews, and one that could prove more useful for the user to decide upon watching the specific movie. In addressing this task, rely on descriptive statistics to gain an understanding of the data, devise an evaluation strategy, implement baseline models, and then develop and train at least one state-of-the-art NLP model commonly used for this problem. Perform a thorough evaluation of the model, a comparison against sensible baselines, as well as a detailed error analysis and statistical analysis of the results. Brz razvoj interneta omogucio nam je da izražavamo i dijelimo svoja mišljenja na dosad ´ nevidenoj skali, pri ¯ cemu je tempo života ljudi postao sve intenzivniji. Takvi su uvjeti prisilili ˇ ljude da svoje odluke temelje tek na manjem uzorku podataka, unatoc iscrpnoj koli ˇ cini dos- ˇ tupnih informacija. Metode sažimanja više dokumenata (MDS) mogle bi pomoci pri rješa- ´ vanju ovog problema. Tema diplomskoga rada su MDS metode, konkretno primjena metoda korištenih u drugim domenama na na recenzije filmova. Prouciti literaturu o MDS metodama ˇ i sažimanju recenzija. Formirati MDS sustav za skup podataka IMDb. Platforma IMDb trenutacno prikazuje samo malen podskup recenzija koje su rangirane kao "najkorisnije" za ˇ neki film, zanemarujuci stotine ili tisu ´ ce drugih dostupnih recenzija. Cilj rada jest generirati ´ koherentnu i korisnu recenziju filma na temelju niza drugih recenzija, koja ce samostalno ´ sadržavati više informacija od ostalih pojedinacnih recenzija i pojednostavniti korisniku od- ˇ luku. Pri izvršavanju ovog zadatka potrebno je primijeniti deskriptivne statisticke metode ˇ kako bi se podaci bolje razumjeli, oblikovati evaluacijske strategije MDS modela te implementirati i trenirati neki od najboljih modela za sažimanje više dokumenata. Provesti iscrpno vrednovanje modela, usporedbu s referentnim modelima, analizu pogreški i statisticku anal- ˇ izu rezultata.

    FER Repository
    Master thesis . 2021
    Data sources: FER Repository
      FER Repository
      Master thesis . 2021
      Data sources: FER Repository
    Authors: Vladika, Juraj;

    Uz obilje novih podataka koji se generiraju svakodnevno, računala se sve više upotrebljavaju za otkrivanje znanja u podatcima i učenje predviđanja novih vrijednih informacija. Proces prikupljanja podataka s korisnim oznakama može biti skup i dugotrajan. Aktivno učenje (AL) pojavilo se kao pristup biranja samo najinformativnijih podataka iz skupa i predaja tih podataka ljudima na označavanje, tako smanjujući troškove i štedeći vrijeme. U ovom je radu izložena teorijska i matematička podloga za aktivno učenje. Opisano je pet različitih strategija aktivnog učenja temeljenih na mjerama nesigurnosti i pretraživanju prostora inačica. Četiri klasifikacijska zadatka iz domene obrade prirodnoga jezika riješena su dvama modelima strojnog učenja, strojem potpornih vektora i povratnom neuronskom mrežom. Provedeni su eksperimenti koji uspoređuju uspješnost rješavanja zadataka rabeći strategije aktivnoga učenja i nasumično uzorkovanje. Analiza rezultata pokazala je da je barem jedna aktivna strategija učenja uvijek dala bolje rezultate od slučajnog uzorkovanja na svakom skupu podataka sa svakim modelom. Također je utvrđeno da je teško unaprijed znati koju strategiju aktivnoga učenja odabrati za biranje podataka. Predstavljeni su savjeti za učinkovitu implementaciju procesa aktivnog učenja u praksi. With an abundance of new data generated every day, machines are increasingly used to extract knowledge from data and learn to predict new valuable information. The process of gathering data with useful labels can be expensive and time-consuming. Active learning (AL) has emerged as an approach for selecting only the most informative data instances and handing them to humans for labeling, thus reducing costs and saving time. In this thesis, a theoretical and mathematical foundation for active learning is laid out. Five different active learning selection strategies based on uncertainty measures and version-space search were described. Four classification tasks from the domain of Natural Language Processing were solved using two machine learning models, a support vector machine, and a recurrent neural network. Experiments that compare performance on tasks when using active learning strategies versus random sampling were conducted. The analysis of results showed that at least one active learning strategy always performed consistently better than random sampling on every dataset with each model. It was also found that it is hard to know beforehand which AL strategy to use for querying data. Tips for efficient implementation of the active learning process in practice were presented.

    Authors: Palić, Kristijan;

    Nowadays patient care is mostly documented in electronic health records (EHRs), thus providing a lot of textual information in the digital format. While primarily designed for archiving patient information and performing administrative healthcare tasks like billing, many researchers have found secondary use of these records for various clinical informatics applications, such as automated clinical diagnosis. The topic of the thesis is the task of clinical note-diagnosis mapping that uses external knowledge for diagnosis inference. The goal of the thesis was not to create a system that can be used in a real life scenario, but to create a foundation for some future research. Obtained results showed that even with limited infrastructure and time, it is possible to utilize the external knowledge to infer clinical diagnosis. With better infrastructure and field specialists, models based on approaches given in this thesis might actually be used in real life. Danas se skrb o pacijentima uglavnom dokumentira u elektroničkim zdravstvenim zapisima, što pruža puno tekstualnih podataka u digitalnom formatu. Iako su takvi zapisi prvenstveno namijenjeni arhiviranju podataka o pacijentu i obavljanju određenih administrativnih zadataka, mnogi ih istraživači koriste i u različitim primjenama kliničke informatike, poput automatizirane kliničke dijagnoze. Tema diplomskoga rada jest zadatak mapiranja anamneze i dijagnoze u kojem se za postavljanje dijagnoze koristi vanjsko znanje. Cilj diplomskog rada nije bio stvaranje sustava koji se može koristiti u stvarnom životu, već postavljanje temelja za buduća istraživanja koja bi to uspjela. Dobiveni razultati pokazuju kako se čak i uz ograničenu infrastrukuru i vrijeme mogu dobiti modeli koji koriste vanjsko znanje za postavljanje kliničke dijagnoze. Uz bolju infrastrukturu i medicinske stručnjake, bilo bi moguće izraditi sustav koji bi se koristio u stvarnom životu.

    Authors: Carin, Alen;

    Automatsko rješavanje tekstualnih matematičkih zadataka, odnosno matematičkih zadataka napisanih riječima koji se učestalo pojavljuju u osnovnoškolskim knjigama, je kompleksan i još uvijek neriješen zadatak iz područja obrade prirodnog jezika. Potrebno je veoma dubinsko razumijevanje jezika da bi se mogao riješiti ovakav tip problema. U ovom radu kao i mnogim drugim povezanim radovima, fokus je stavljen na rješavanje aritmetičkih zadataka zadanih riječima koji su podskup svih matematičkih zadataka zadanih riječima. Ovaj rad je motiviran sličnošću zadataka automatskog rješavanja matematičkih zadataka zadanih tekstom i automatskog apstraktnog sažimanja teksta i nedavnih uspjeha u sažimanju teksta. Ono što je zajedničko oboma zadatcima je što im je cilj proizvesti kraću verziju teksta koja najbolje opisuje originalni ulazni tekst. U ovom radu, načinjena su dva automatska rješavača matematičkih zadataka zadanih riječima -- reimplementacija modela koji je ostvario najbolje postignuće na Ape210K setu podataka te model koji se bazira na BERT modelu i principu slijed-u-slijed. Iako reimplementacija nije potpuno identična kao u originalnom radu, proizvela je obećavajuće rezultate, ali je još uvijek velika razlika među rezultatima. Model baziran na BERT modelu nije proizveo značajne rezultate, vjerojatno zbog greške u implementaciji koja još nije detektirana. Svejedno, u budućem radu model bi mogao dati odlične rezultate i nadam se da će služiti kao inspiracija za daljnje istraživanje. Automatic solving of math word problems (MWPs), which are math problems given in words, as typically found in elementary school textbooks, is a complex and still unresolved task in natural language processing (NLP). A very deep understanding of the language is needed to solve this type of problem. In this work and many other related work, the focus is put on arithmetic word problems, which is a subset of math word problems. The motivation of this work is in the similarity between automatic math word problems and abstractive text summarization and the recent improvements in automatic text summarizers. What both tasks have in common is the aim to produce a shorter version of the input text which best describes the original version. In this work, two automatic math word problem solvers have been produced – a reimplementation of the state-of-the-art model on Ape210K dataset and a BERT-based sequence-to-sequence model. Although the reimplementation is not a perfect match with the original paper, it did produce very promising results, but there is still a big gap between the two. Potentially due to a bug, the BERT-based model did not produce meaningful results. Nevertheless, in future work, it could achieve great results and I hope that it will inspire further research.

    FER Repository
    Master thesis . 2021
    Data sources: FER Repository
      FER Repository
      Master thesis . 2021
      Data sources: FER Repository
    Authors: Blašković, Mirna;

    This paper sets out to analyse representations of trauma in five contemporary Irish novels in the light of trauma studies: Emer Martin’s The Cruelty Men (2018), Anne Enright’s The Gathering (2007), Julia O’Faolain’s No Country for Young Men (1980), William Trevor’s Fools of Fortune (1988), and Seamus Deane’s Reading in the Dark (1996). The paper examines issues of child abuse and various sorts of trauma mainly set in Irish postcolonial context. It also attempts to demonstrate how in the selected novels traumatic events from the past still haunt contemporary Irish society. Ovaj diplomski rad analizira traumu u pet suvremenih irskih romana u kontekstu studija o traumi: The Cruelty Men (2018) Emer Martin, The Gathering (2007) Anne Enright, No Country for Young Men (1980) Julia O’Faolain, Fools of Fortune (1988) Williama Trevora, Reading in the Dark (1996) Seamusa Deanea. Rad prikazuje teme zlostavljanja djece i različite tipove traume pretežno smještene u irski postkolonijalni kontekst. Rad također nastoji prikazati kako su u odabranim romanima traumatični događaji iz prošlosti i dalje aktualni u suvremenom irskom društvu.

    Authors: Kurdija, Vedran;

    User comments contain invaluable information for service providers. A significant part of this information lies in the emotions and sentiment of the comment. Manual processing of these comments is expensive and time-consuming. The goal of this thesis was to develop and implement deep learning models that can perform this task. An analysis of the CATACX dataset containing public user comments in Croatian with labeled emotions and sentiments was performed, and correlations between sentiments and emotions were investigated. Machine learning and deep learning models were developed and implemented, a logistic regression model, a long short-term memory (LSTM) model, and a bidirectional LSTM model. The models were used to classify emotions and sentiments and performed considerably better in predicting sentiments, suggesting that predicting emotions is a more difficult task. Komentari korisnika sadrže neprocjenjive informacije za pružatelje usluga. Značajan dio tih informacija leži u emocijama i sentimentu komentara. Ručna obrada tih komentara skupa je i vremenski zahtjevna. Cilj ovog rada bio je osmisliti i implementirati modele dubokog učenja koji mogu obavljati ovaj zadatak. Provedena je analiza skupa podataka CATACX koji sadrži javne korisničke komentare na hrvatskom jeziku s označenim emocijama i sentimentima. Istražene su korelacije između sentimenata i emocija. Razvijeni su i implementirani modeli strojnog i dubokog učenja: logistička regresija, jedinica za dugotrajno kratkoročno pamćenje (LSTM) i dvosmjerni LSTM. Modeli su upotrijebljeni za klasifikaciju emocija i sentimenata, a ostvarili su značajno bolje rezultate u predviđanju sentimenata, što ukazuje na to da je predviđanje emocija složeniji zadatak.

    Authors: Čeović, Helena;

    Cilj ovog diplomskog rada je razvoj modela visoke točnosti za prepoznavanje imenovanih entiteta adresa. Izazovi uključuju raznolikost, dvosmislenost i kompleksnost entiteta adrese. Različite arhitekture modela korištene za treniranje klasifikatora su jednostavniji modeli logističke regresije i algoritam slučajnih šuma te složeniji model temeljen na dvosmjernoj povratnoj neuronskoj mreži s dodatnim slojem koji sadrži uvjetovana slučajna polja (BiLSTM-CRF) i implementiran je pomoću Flair knjižnice. Sva testiranja provedena su na dva skupa podataka koji se razlikuju po načinu označavanja temeljenom na granularnosti imenavanog entiteta: adresa kao cjelina i adresa sastavljena od podelemenata. Za oba skupa podataka najbolje rezultate postiže model temeljen na BiLSTM-CRF arhitekturi s jednim RNN slojem te treniran ili samo na BERT-u ili na kombinaciji BERT i GloVe embedding-a. This thesis focuses on developing a high-performing named entity recognition model for addresses. Challenges including diversity, ambiguity and complexity of the address entity are introduced as well as the different model architectures used for training the classifier. These include the simpler logistic regression and random forest models as well as the more complex bidirectional LSTM network with a conditional random field layer (BiLSTM-CRF) implemented using Flair framework. Experiments are conducted using variously configured models on two sets of corpora tagged differently based on address entity granularity: entire address and address consisting of subparts. For both corpora, the best results are achieved on a BiLSTM-CRF architecture model with a single RNN layer trained on either standalone BERT embeddings or a stacked combination of BERT and GloVe.

    FER Repository
    Master thesis . 2021
    Data sources: FER Repository
      FER Repository
      Master thesis . 2021
      Data sources: FER Repository
    Authors: Rajnović, Marko;

    The task of this thesis was to gather data about advertisements from Njuškalo, preprocess it, then try and do price prediction. A sizable number of advertisements were scraped, the data gathered was cleaned, and prepared for machine learning. In total, three models were implemented: support vector regression, gradient boosting, and a feed-forward neural network. Each of these models had three variants: non-text variant, with only numeric and categorical features present, text-variant, with only the descriptions present, and the combined variant, joining the two last variants together. The models and variants had very similar results. Ultimately, there was no statistically significant difference observed between the non-text and combined variants for support vector regression and gradient boosting models. The combined variant of the feed-forward network model performed better than the non-text one. The model chosen as the best one was gradient boosting, using the non-text variant, as it took the least amount of time to train and it was shown that it was statistically the best model for this task. Zadatak ovog završnog rada bio je skupljanje podataka o reklamama sa stranice Njuškalo, pretprocesiranje podataka te predvid̄anje cijene. Poveći broj reklama je skinut sa stranice, podaci su skupljeni i očišćeni te pripremljeni za strojno učenje. Sveukupno, implementirana su tri modela: regresija potpornih vektora, pojačavanje gradijenta i umjetna neuronska mreža. Svaki od ova tri modela imali su tri varijante: varijantu bez teksta, samo tekst i kombiniranu varijantu. Modeli i varijante imali su vrlo slične rezultate. Na kraju, nisu postojale statistički signifikantne razlike izmed̄u varijanti bez teksta i kombiniranih varijanti za regresiju potpornih vektora i pojačavanje gradijenta, dok je kombinirani model bio bolji kod umjetnih neuronskih mreža. Izabrani model je bio pojačavanje gradijenata, koristeći varijantu bez teksta, jer mu je trebalo najmanje vremena za treniranje i jer se pokazao statistički najboljim za ovaj zadatak.

    Authors: Stone, Graham; Błaszczyńska, Marta; Lebon, Chloé; Morka, Agata; +5 Authors

    There have been significant recent developments in the OA publishing world, and an increasing focus on monographs in particular. There are a number of existing and emerging OA monograph policies, which are leading to an increased focus on business models. Given this dynamic landscape, it was felt that a more in-depth understanding was needed of European monograph publishers’ current business models for open access, their challenges, and their views on how infrastructure for open access monographs could be improved. This white paper builds on the previous OPERAS Business Models Special Interest Group white paper on Business Models for Open Access (Speicher, et al., 2018). In particular, OPERAS wished to gain a better understanding about how the social sciences and humanities (SSH) publishing community applies or could apply collaborative models for open access books, and what issues it encounters in this context. We further wanted to recognise the challenges publishers faced when engaging with or thinking about engaging in collaborative models for OA books. Are there sufficient funds, enough human resources? Are relevant infrastructures in place? What kind of support is needed? This white paper reports on an OPERAS survey, which was held between February and April 2021 and was designed to serve two core aims: To further our understanding of the scholarly publishing landscape and of the challenges that publishers face in the context of publishing OA monographs. To identify main trends (including opportunities and challenges) and the knowledge of collaborative funding and infrastructure models in OA publishing in SSH. The survey received a total of 77 responses from 17 countries: 14 EU states, the UK, Norway, and the United States. The results provide a more comprehensive insight into how OPERAS can make a tangible change and best support the community in building sustainable paths of transition towards collaborative models for open access books. This white paper presents some early observations from the preliminary analysis of the findings.

    Report . 2021
    License: CC BY
    Data sources: ZENODO

      Report . 2021
      License: CC BY
      Data sources: ZENODO

    Authors: Pavlović, Luka;

    Cilj ovog rada bio je razvoj modela strojnog učenja koji će uspješno otkrivati prisutnost stresa u objavama na Redditu. Skup podataka korišten u ovom radu stvorili su istraživači sa Sveučilišta Columbia. Skup podataka stvoren je web-scrapingom društvene mreže Reddit. Podaci su podijeljeni po domenama koje sadrže jedan subreddit ili više njih. Iako skup podataka sadrži veći broj značajki, samo tri su korištene u ovom radu, a to su: subreddit, tekst objave i oznaka koja označava prisutnost stresa. U pokušaju stvaranja modela koji će imati zadovoljavajuće performanse izgrađena su četiri modela. Kao osnovni modeli stvoreni su: klasifikator većinske klase te stroj potpornih vektora (engl. Support Vector Machine, SVM), koji je koristio Term Frequency-Inverse Document Frequency vektore. Također, stvorene su dvije verzije modela dubokog učenja: Long Short-Term Memory (LSTM) s globalnim vektorima za reprezentaciju riječi (engl. Global Vectors for Word Representation, GloVe) i LSTM s vektorizatorom FastText. Model SVM i model LSTM s GloVe optimizirani su na najbolje hiperparametre te su evaluirani na skupu za testiranje. Unatoč visokim očekivanjima od modela dubokog učenja, najbolje rezultate dao je model SVM. The aim of this thesis was to develop a machine learning model that will successfully detect the presence of stress in Reddit posts. The dataset used in this thesis was created by researchers from Columbia University. The dataset was created by scraping of social network Reddit. Data is divided into domains that contain one or more subreddits. Although the dataset contains a number of features, only three were used for purposes of this thesis, namely: subreddit, users’ post text, and a label indicating the presence of stress. Four models were built in an attempt to create a model that would have satisfactory performance. As baseline models, the following have been created: majority class baseline and Support Vector Machine (SVM) which uses TF-IDF vectors. Also, two versions of the deep learning model were created: Long Short-Term Memory (LSTM) with GloVe vectors and LSTM with FastText vectorizer. SVM model and LSTM with GloVe were optimized for the best hyperparameters and evaluated on the test set. Despite high expectations from the deep learning model, the best results were obtained using the SVM model.

    FER Repository
    Bachelor thesis . 2021
    Data sources: FER Repository
      FER Repository
      Bachelor thesis . 2021
      Data sources: FER Repository
    Authors: Pavlović, Luka;

    Cilj ovog rada bio je razvoj modela strojnog učenja koji će uspješno otkrivati prisutnost stresa u objavama na Redditu. Skup podataka korišten u ovom radu stvorili su istraživači sa Sveučilišta Columbia. Skup podataka stvoren je web-scrapingom društvene mreže Reddit. Podaci su podijeljeni po domenama koje sadrže jedan subreddit ili više njih. Iako skup podataka sadrži veći broj značajki, samo tri su korištene u ovom radu, a to su: subreddit, tekst objave i oznaka koja označava prisutnost stresa. U pokušaju stvaranja modela koji će imati zadovoljavajuće performanse izgrađena su četiri modela. Kao osnovni modeli stvoreni su: klasifikator većinske klase te stroj potpornih vektora (engl. Support Vector Machine, SVM), koji je koristio Term Frequency-Inverse Document Frequency vektore. Također, stvorene su dvije verzije modela dubokog učenja: Long Short-Term Memory (LSTM) s globalnim vektorima za reprezentaciju riječi (engl. Global Vectors for Word Representation, GloVe) i LSTM s vektorizatorom FastText. Model SVM i model LSTM s GloVe optimizirani su na najbolje hiperparametre te su evaluirani na skupu za testiranje. Unatoč visokim očekivanjima od modela dubokog učenja, najbolje rezultate dao je model SVM. The aim of this thesis was to develop a machine learning model that will successfully detect the presence of stress in Reddit posts. The dataset used in this thesis was created by researchers from Columbia University. The dataset was created by scraping of social network Reddit. Data is divided into domains that contain one or more subreddits. Although the dataset contains a number of features, only three were used for purposes of this thesis, namely: subreddit, users’ post text, and a label indicating the presence of stress. Four models were built in an attempt to create a model that would have satisfactory performance. As baseline models, the following have been created: majority class baseline and Support Vector Machine (SVM) which uses TF-IDF vectors. Also, two versions of the deep learning model were created: Long Short-Term Memory (LSTM) with GloVe vectors and LSTM with FastText vectorizer. SVM model and LSTM with GloVe were optimized for the best hyperparameters and evaluated on the test set. Despite high expectations from the deep learning model, the best results were obtained using the SVM model.

    FER Repository
    Bachelor thesis . 2021
    Data sources: FER Repository
      FER Repository
      Bachelor thesis . 2021
      Data sources: FER Repository