Quick search
Advanced search in
Research outcomes
Field to searchTerm
Add rule
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
Download Results
53 research outcomes, page 1 of 6
  • research data . 2021 . Embargo End Date: 18 Jun 2021
    Open Access
    Authors:
    Macháček, Dominik; Žilinec, Matúš; Bojar, Ondřej;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | ELITR (825460)

    ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations. The corpus contains source English videos and audios. The...

    Add to ORCID
  • research data . 2021 . Embargo End Date: 24 May 2021
    Open Access
    Authors:
    Novák, Michal; Zouhar, Vilém; Bojar, Ondřej;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | Bergamot (825303)

    The dataset used for the Ptakopět experiment on outbound machine translation. It consists of screenshots of web forms with user queries entered. The queries are available also in a text form. The dataset comprises two language versions: English and Czech. Whereas the En...

    Add to ORCID
  • research data . 2021 . Embargo End Date: 11 Mar 2021
    Open Access
    Authors:
    Nedoluzhko, Anna; Novák, Michal; Popel, Martin; Žabokrtský, Zdeněk; Zeman, Daniel;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | Bergamot (825303)

    CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.1 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morpho...

    Add to ORCID
  • research data . 2020 . Embargo End Date: 02 Jul 2020
    Open Access
    Authors:
    Çano, Erion;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | ELITR (825460)

    OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. ...

    Add to ORCID
  • research data . 2020 . Embargo End Date: 19 Jun 2020
    Open Access
    Authors:
    Barančíková, Petra; Bojar, Ondřej;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | Bergamot (825303)

    Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of...

    Add to ORCID
  • research data . 2020 . Embargo End Date: 14 Aug 2020
    Open Access
    Authors:
    Parida, Shantipriya; Bojar, Ondřej;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | ROXANNE (833635)

    Data ---- Hindi Visual Genome 1.1 is an updated version of Hindi Visual Genome 1.0. The update concerns primarily the text part of Hindi Visual Genome, fixing translation issues reported during WAT 2019 multimodal task. In the image part, only one segment and thus one i...

    Add to ORCID
  • research data . 2019 . Embargo End Date: 05 Dec 2019
    Open Access
    Authors:
    Barančíková, Petra; Bojar, Ondřej;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | Bergamot (825303)

    COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The dataset consist of 4,262 unique sentences with average length of 10 words,...

    Add to ORCID
  • research data . 2019 . Embargo End Date: 31 Oct 2019
    Open Access
    Authors:
    Çano, Erion;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | ELITR (825460)

    OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples...

    Add to ORCID
  • research data . 2019 . Embargo End Date: 21 Oct 2019
    Open Access
    Authors:
    Çano, Erion;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | ELITR (825460)

    OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release ver...

    Add to ORCID
  • research data . 2019 . Embargo End Date: 12 Sep 2019
    Open Access
    Authors:
    Çano, Erion;
    Persistent Identifiers
    Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    Project: EC | ELITR (825460)

    OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are...

    Add to ORCID
53 research outcomes, page 1 of 6