Contributing metadata to the Ethnic and Migrant Minorities' (EMM) Survey Registry as a professional polling/survey company A training video targeting professional polling/survey companies to entice them to document their surveys on the EMM Survey Registry Target Audience for the video: Professional polling/survey companies producing quantitative surveys on ethnic and migrant minorities’ integration and/or inclusion
Anonymized responses to the ARIADNEplus questionnaire to gather information for the aggregation of metadata about archaelogical resources to be included in the ARIADNEplus Knowledge Base and portal (https://portal.ariadne-infrastructure.eu/). The csv includes only the plain responses as provided by 31 archaelogical content providers until 18 October 2021. The excel file includes also two additional sheets where the responses about the formats and the aggregation update schedule have been normalised. The responses are discussed in deliverable D12.4 "Final report on data integration" currently under preparation.
The project provides the digital edition of the libretti staged for the election of the Council of the Elders in the Republic of Lucca. The celebration, known as funzione delle Tasche, was repeated every three years from 1636 to 1797. The present edition collects the works from 1636 to 1705 in order to analyze changes and recurring motifs throughout the 17th century in a republican context.
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in a CONLL-like tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt.
MALDI-TOF-MS spectra of extracted collagen from modern reference and archaeological bone samples to develop markers for Zooarchaeology by Mass Spectrometry (ZooMS) to distinguish between Equus species. For each sample digestions were done in both trypsin and chymotrypsin separately. Information about the species of the samples can be found in 'sample metadata.csv' file. Information on the extraction and digestion protocol can be found in the associated manuscript. The sequence data contains alignments of the proteins COL1A1 and COL1A2 for available Equus collagen protein sequences. More information on these files can be found in the corresponding manuscript to this dataset.
SILKNOW Multimodal Cultural Heritage Dataset. Includes text descriptions, images, labels, and predictions made by individual modality classifiers. The data resulted from an export of the SILKNOW Knowledge Graph. See: https://zenodo.org/record/5743090 Repository with code using this dataset available at: https://github.com/silknow/multimodal_cultural_heritage
Goma Ra Kimbi [Woman Power] is the hybrid opera created by Alicja Pilarczyk and Shangazi Masika, based on the oral histories of Kenyan women. The series of Masika’s poems about the everyday life of Kenyan women are sung by the Tumaini choir (made up of women from local villages around Kilifi town) and students from Pwani University to a score written by Pilaczyk that features elements of rap, pop and classical music styles. The piece is also danced to an accompaniment of handmade rattles. Gama Ra Kimbi [Woman Power] reflects the important role that storytelling plays in the life of the tribe.
In 2018 the IPERION-CH Grounds Database was presented to examine how the data produced through the scientific examination of historic painting preparation or grounds samples, from multiple institutions could be combined in a flexible digital form. Exploring the presentation of interrelated high resolution images, text, complex metadata and procedural documentation. The original main user interface is live, though password protected at this time. Work within the SSHOC project aimed to reformat the data to create a more FAIR data-set, so in addition to mapping it to a standard ontology, to increase Interoperability, it has also been made available in the form of open linkable data combined with a SPARQL end-point. A draft version of this live data presentation can been found Here. This is a draft data-set and further work is planned to debug and improve its semantic structure.This deposit contains the CIDOC-CRM mapped data formatted in XML and an example model diagram representing some of the key relationships covered in the data-set. Live access to this data, with documentation and worked examples, can be found at: https://rdf.ng-london.org.uk/sshoc
In 2007 the Raphael Research Resource project began to examine how complex conservation, scientific and art historical research could be combined in a flexible digital form. Exploring the presentation of interrelated high resolution images and text, along with how the data could be stored in relation to an event driven ontology in the form of RDF triples. The original main user interface is still live, In 2021/21 as part of the SSHOC Project the raw data stored within the system was mapped to the CIDOC CRM using a custom set of Python scripts (https://doi.org/10.5281/zenodo.6461654). The SSHOC work aimed to make this data more FAIR so in addition to mapping it to a standard ontology, to increase Interoperability, it has also been made available in the form of open linkable data combined with a SPARQL end-point. This live data presentation can been found Here. This deposit contains the CIDOC-CRM mapped data formatted in XML and an example model diagram representing some of the key relationships covered in the data-set. Live access to this data, with documentation and worked examples, can be found at: https://rdf.ng-london.org.uk/sshoc
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Project: EC | Bergamot (825303)
Document-level testsuite for evaluation of gender translation consistency. Our Document-Level test set consists of selected English documents from the WMT21 newstest annotated with gender information. Czech unnanotated references are also added for convenience. We semi-automatically annotated person names and pronouns to identify the gender of these elements as well as coreferences. Our proposed annotation consists of three elements: (1) an ID, (2) an element class, and (3) gender. The ID identifies a person's name and its occurrences (name and pronouns). The element class identifies whether the tag refers to a name or a pronoun. Finally, the gender information defines whether the element is masculine or feminine. We performed a series of NLP techniques to automatically identify person names and coreferences. This initial process resulted in a set containing 45 documents to be manually annotated. Thus, we started a manual annotation of these documents to make sure they are correctly tagged. See README.md for more details.