Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
48 Research products, page 1 of 5

  • Digital Humanities and Cultural Heritage
  • Research data
  • Other research products
  • 2013-2022
  • Dataset
  • CLARIN.SI repository

10
arrow_drop_down
Relevance
arrow_drop_down
  • Research data . 2019 . Embargo End Date: 15 Oct 2019
    Open Access
    Authors: 
    Ulčar, Matej;
    Publisher: Faculty of Computer and Information Science, University of Ljubljana
    Project: EC | EMBEDDIA (825153)

    ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus (https://viri.cjvt.si/gigafida/System/Impressum) for 10 epochs. 1,364,064 most common tokens were provided as vocabulary during the training. The model can also infer OOV words, since the neural network input is on the character level.

  • Research data . 2016 . Embargo End Date: 29 May 2016
    Open Access
    Authors: 
    Popović, Maja; Arčan, Mihael;
    Publisher: Insight Centre for Data Analytics, National University of Ireland, Galway
    Project: EC | TraMOOC (644333)

    The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their post-edited versions, and error annotations of the performed post-edit operations. The main advantage of the corpus is the fusion of post-editing and error classification tasks, which have usually been seen as two independent tasks, although naturally they are not.

  • Research data . 2016 . Embargo End Date: 23 Jun 2016
    Open Access
    Authors: 
    Ljubešić, Nikola;
    Publisher: Faculty of Humanities and Social Sciences, University of Zagreb
    Project: EC | ABU-MATRAN (324414)

    srLex is a large inflectional lexicon of Serbian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the srWaC v1.2 corpus. The MSD tagset follows the MULTEXT-East V5 tagset for Bosnian available at http://nl.ijs.si/ME/V5/msd/html/msd-bs.html.

  • Research data . 2022 . Embargo End Date: 04 Feb 2022
    Restricted
    Authors: 
    Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko; Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Ferme, Marko; Borovič, Mladen; Boškovič, Borko; Ojsteršek, Milan; +1 more
    Publisher: Faculty of Electrical Engineering and Computer Science, University of Maribor
    Project: EC | EMBEDDIA (825153)

    The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/). The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into chapters, then into pages, these into paragraphs, and then into sentences. The sentence tokens are tagged with morphosyntactically descriptions (detailed part-of-speech tags) and the words lemmatised. As opposed to the previous version 1.0, the KAS corpus of Slovene academic writing 2.0 is cleaner and contains segmentations into chapters. The metadata also contains more information about research fields of each work. Both versions consist of the same number of BSc/BA, MSc/MA, and PhD theses, however, the processing was done from scratch for 2.0, so the number of e.g. pages and tokens is different. Note also that the new version does not contain links to the PNG pictures of individual pages , nor does it contain annotated terms, both present in version 1.0. It is, unlike 1.0, also not mounted on the CLARIN.SI concordancers. The new version is distributed in the canonical TEI encoding, JSON, and as plain text files. In the TEI format, chapter names are denoted with the tag. Each entry in JSON files have a string ID and a list containing names of chapters as its first element and texts as its second element. Chapters without text are represented as an empty string. The plain text files contain only text bodies without segmentation information. References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228

  • Research data . 2020 . Embargo End Date: 30 Oct 2020
    Open Access
    Authors: 
    Armendariz, Carlos; Matthew, Purver; Ulčar, Matej; Pollak, Senja; Ljubešić, Nikola; Robnik-Šikonja, Marko; Granroth-Wilding, Mark; Vaik, Kristiina;
    Publisher: Queen Mary University
    Project: EC | EMBEDDIA (825153)

    The dataset contains human similarity ratings for pairs of words. The annotators were presented with contexts that contained both of the words in the pair and the dataset features two different contexts per pair. The words were sourced from the English, Croatian, Finnish and Slovenian versions of the original Simlex dataset.

  • Research data . 2017 . Embargo End Date: 21 Jun 2017
    Open Access
    Authors: 
    Dobrišek, Simon; Žganec Gros, Jerneja; Žibert, Janez; Mihelič, France; Pavešić, Nikola;
    Publisher: Faculty of Electrical Engineering, University of Ljubljana
    Project: EC | FLUINHIBIT (201634)

    The SOFES speech database (Spoken Flight Enquiries in Slovene) is a collection of transcribed and segmented audio recordings of spoken flight-information enquiries in Slovene. SOFES is built on the basis of the GOPOLIS speech database, which was acquired and compiled by the members of LUKS at the Faculty of Electrical Engineering, University of Ljubljana in the period 1996–1998. The main purpose of the GOPOLIS speech database was the development of an automatic spoken-dialogue system for users who are enquiring about flight information over the telephone. The content of SOFES is, however, sufficiently diverse to allow for the development of more generalized acoustic models of spoken Slovene, which are the key components of various speech technologies, such as speech recognizers and speech synthesizers, as well as biometric speaker-recognition systems, etc.

  • Research data . 2021 . Embargo End Date: 19 May 2021
    Open Access
    Authors: 
    Purver, Matthew; Shekhar, Ravi; Pranjić, Marko; Pollak, Senja; Martinc, Matej;
    Publisher: Styria Media Group
    Project: EC | EMBEDDIA (825153)

    The 24sata news portal consists of a portal with daily news and several smaller portals covering news from specific topics, such as automotive news, health, culinary content, and lifestyle advice. The dataset contains over 650,000 articles in Croatian from 2007 to 2019, as well as assigned tags. Description of the Dataset The dataset consists of 11 columns and 657806 rows. Each row represents a single news article published on the 24sata news portals. Besides the 'www.24sata.hr', the biggest news portal, articles from other niche portals affiliated with 24sata are also included. Columns: 'article_id' - Public id of the article on the new site. The article can be accessed by concatenating the site URL and article_id. For example, to access the article with article_id 614684, you can access it on 'www.24sata.hr/--614684'. This id is, by itself, not unique across the dataset - articles from different portals can share the same article_id. 'site' - The location of the portals where the article came from. There are eight different portals covering topics of daily news, to the more focused portals about automotive technologies and trends, health and wellness, culinary trends and recipes, or lifestyle advices. 'title' - The title of the news article. 'lead' - Lead text, a short introduction to the content of an article. Can be empty. 'content' - The content of the news article, contains the bulk of the text. Can be empty if the whole article could fit in the lead text. 'tags' - Tags, zero or more, separated with a '|' character. Article tags are chosen by the author of the article. 'section' - The main section of the news portal where the article was posted (does not need to be set). The most frequent section is 'Vijesti' (News). 'subsection' - The subsection of the section where the article was posted (does not need to be set). Each section can have multiple subsections. 'authors' - Article authors, zero or more, separated with a '|' character. The author does not need to sign the article if he chooses not to so this can be empty. 'published_from' - A date when this article appeared on the portal. Journalists can write the article in advance and pick a future date and time when it will appear on the site. Due to this strategy, the 'published_from' can be much later than the 'date_created'. 'date_created' - A date when this article was originally written. For all articles published before 2nd Feb 2010 the 'date_created' is set to 2nd Feb 2010 - this is the date when the portal was redesigned and the database with news articles recreated.

  • Research data . 2016 . Embargo End Date: 05 Mar 2016
    Open Access
    Authors: 
    Ljubešić, Nikola; Klubička, Filip;
    Publisher: Faculty of Humanities and Social Sciences, University of Zagreb
    Project: EC | ABU-MATRAN (324414)

    hrLex is an large inflectional lexicon of Serbian language where each entry consists of a (wordform, lemma, MSD) triple. The MSD tagset follows the revised MULTEXT-East V4 tagset for Croatian and Serbian, available at https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping.

  • Research data . 2016 . Embargo End Date: 09 Mar 2016
    Restricted
    Authors: 
    Ljubešić, Nikola; Esplà-Gomis, Miquel; Ortiz Rojas, Sergio; Klubička, Filip; Toral, Antonio;
    Publisher: Jožef Stefan Institute
    Project: EC | ABU-MATRAN (324414)

    The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 80% and on the word level around 84%.

  • Research data . 2014 . Embargo End Date: 22 May 2015
    Open Access
    Authors: 
    Erjavec, Tomaž;
    Publisher: Jožef Stefan Institute
    Project: EC | IMPACT (215064)

    The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains extensive meta-data, per-page links to facsimiles, and hand-corrected transcriptions with structural and editorial annotations. These texts were annotated to be used as a language corpus. In the corpus each word is marked-up with its modernised form, lemma, and morphosyntactic description (fine grained PoS tag). Note that the annotations are automatic, so they contain a fair amount of errors. The digital library is available in source TEI P5 XML and derived HTML. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers, e.g. CWB and Sketch Engine. Note that the vertical format does not contain all the information from the source TEI.