research data . Dataset . 2022

Data of the Shared Task on the Disambiguation of German Verbal Idioms at KONVENS 2021

Ehren, Rafael; Lichte, Timm; Waszczuk, Jakub; Kallmeyer, Laura;
Open Access German
  • Published: 30 Jan 2022
  • Publisher: Zenodo
Abstract
This dataset was used in the Shared Task on the Disambiguation of German Verbal Idioms (VID) at KONVENS 2021. For further details, please refer to the description paper of the shared task: Ehren, Rafael, Timm Lichte, Jakub Waszczuk & Laura Kallmeyer. 2021. Shared Task on the Disambiguation of German Verbal Idioms at KONVENS 2021. In Proceedings of the Shared Task on the Disambiguation of German Verbal Idioms at KONVENS 2021. https://doi.org/10.5281/zenodo.5730322. https://konvens.org/proceedings/2021/index.html. Please cite this paper when using the dataset. The content of the zip file is identical to that of the data directory in the Github repository of the shared task. The dataset consists of 9901 instances of a German VID type or its literal counterpart in context. The set of VID types was pre-selected, thus it constitutes a lexical sample data set. It is a merger of two datasets: COLF-VID (instances with T*) German SemEval-2013 task 5b (instances with S*) The data comes in tsv files and every line has the following format: Instance_ID \t VID_type \t label \t text Consider this example: T890202.28.4077 in wasser fallen figuratively Der Streit ums Hormonfleisch zwischen USA und EG provozierte den Polizeieinsatz . Aber nicht nur der Steakverkauf , auch die Aktionen gegen den Hormonstand , auf die sich Gruppen der Bauernopposition schon vorbereitet hatten , <b>fielen</b> <b>ins</b> <b>Wasser</b> . Die Fleischexporteure der USA wollten ihrerseits die " Grüne Woche " zur " Aufklärung " nutzen . So the first column contains the ID (T890202.28.4077 in the example), the second the VID type (in wasser fallen), the third the label (figuratively) and the fourth the sentence with either the instance of the VID type or its literal counterpart (and two additional context sentences). The parts of the target expression are marked with the <b> tag (<b>fielen</b> <b>ins</b> <b>Wasser</b>). There are four possible labels: figuratively literally undecidable both The first two should be self-explanatory. The label undecidable was used by the annotators if it was not possible to disambiguate an instance given the context. The label both was applied when both the literal and the idiomatic readings were active.
Subjects
free text keywords: Natural Language Processing, Shared Task, Multiword Expressions
Communities
  • Digital Humanities and Cultural Heritage
Download fromView all 2 versions
Open Access
ZENODO
Dataset . 2022
Providers: Datacite
Open Access
ZENODO
Dataset . 2022
Providers: ZENODO
1 research outcomes, page 1 of 1
Any information missing or wrong?Report an Issue