Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
3 Research products, page 1 of 1

  • Digital Humanities and Cultural Heritage
  • Publications
  • Research data
  • Dataset
  • Infoscience - EPFL scientific publications

Relevance
arrow_drop_down
  • Open Access
    Authors: 
    Barman, Raphaël; Ehrmann, Maud; Clematide, Simon; Oliveira;
    Country: Switzerland
    Project: SNSF | Media Monitoring of the P... (173719), SNSF | Media Monitoring of the P... (173719)

    This record contains the datasets and models used and produced for the work reported in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers" (link). Please cite this paper if you are using the models/datasets or find it relevant to your research: @article{barman_combining_2020, title = {{Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers}}, author = {Raphaël Barman and Maud Ehrmann and Simon Clematide and Sofia Ares Oliveira and Frédéric Kaplan}, journal= {Journal of Data Mining \& Digital Humanities}, volume= {HistoInformatics} DOI = {10.5281/zenodo.4065271}, year = {2021}, url = {https://jdmdh.episciences.org/7097}, } Please note that this record contains data under different licenses. 1. DATA Annotations (json files): JSON files contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in VIA format. The following licenses apply: luxwort.json: those annotations are under a CC0 1.0 license. Please refer to the right statement specified for each image in the file. GDL.json, IMP.json and JDG.json: those annotations are under a CC BY-SA 4.0 license. Image files: The archive images.zip contains the Swiss titles image files (GDL, IMP, JDG) used for the experiments described in the paper. Those images are under copyright (property of the journal Le Temps and of ArcInfo) and can be used for academic research or educational purposes only. Redistribution, publication or commercial use are not permitted. These terms of use are similar to the following right statement: http://rightsstatements.org/vocab/InC-EDU/1.0/ 2. MODELS Some of the best models are released under a CC BY-SA 4.0 license (they are also available as assets of the current Github release). JDG_flair-FT: this model was trained on JDG using french Flair and FastText embeddings. It is able to predict the four classes presented in the paper (Serial, Weather, Death notice and Stocks). Luxwort_obituary_flair-bpemb: this model was trained on Luxwort using multilingual Flair and Byte-pair embeddings. It is able to predict the Death notice class. Luxwort_obituary_flair-FT_indomain: this model was trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data). It is also able to predict the Death notice class. Those models can be used to predict probabilities on new images using the same code as in the original dhSegment repository. One needs to adjust three parameters to the predict function: 1) embeddings_path (the path to the embeddings list), 2) embeddings_map_path(the path to the compressed embedding map), and 3) embeddings_dim (the size of the embeddings). Please refer to the paper for further information or contact us. 3. CODE: https://github.com/dhlab-epfl/dhSegment-text 4. ACKNOWLEDGEMENTS We warmly thank the journal Le Temps (owner of La Gazette de Lausanne and the Journal de Genève) and the group ArcInfo (owner of L'Impartial) for accepting to share the related datasets for academic purposes. We also thank the National Library of Luxembourg for its support with all steps related to the Luxemburger Wort annotation release. This work was realized in the context of the impresso - Media Monitoring of the Past project and supported by the Swiss National Science Foundation under grant CR- SII5_173719. 5. CONTACT Maud Ehrmann (EPFL-DHLAB) Simon Clematide (UZH)

  • Open Access
    Authors: 
    Ehrmann, Maud; Romanello, Matteo; Doucet, Antoine; Clematide, Simon;
    Publisher: Zenodo
    Country: Switzerland
    Project: EC | NewsEye (770299), EC | NewsEye (770299)

    HIPE-2022 datasets used for the HIPE 2022 shared task on named entity recognition and classification (NERC) and entity linking (EL) in multilingual historical documents. HIPE-2022 datasets are based on six primary datasets assembled and prepared for the shared task. Primary datasets are composed of historical newspapers and classic commentaries covering ca. 200 years, feature several languages and different entity tag sets and annotation schemes. They originate from several European cultural heritage projects, from HIPE organizers’ previous research project, and from the previous HIPE-2020 campaign. Some are already published, others are released for the first time for HIPE-2022. The HIPE-2022 shared task assembles and prepares these primary datasets in HIPE-2022 release(s), which correspond to a single package composed of neatly structured and homogeneously formatted files. Primary datasets undergo the following preparation steps: conversion to the HIPE format (with correction of data inconsistencies and metadata consolidation); rearrangement or composition of train and dev splits. Please also refer to: HIPE-2022 shared task website: https://hipe-eval.github.io/HIPE-2022/ HIPE-2022 data repository: https://github.com/hipe-eval/HIPE-2022-data Here is an overview of the primary datasets: Dataset alias Readme Document type Languages Suitable for Project hipe2020 link historical newspapers de, fr, en NERC-Coarse, NERC-Fine, EL CLEF-HIPE-2020 newseye link historical newspapers de, fi, fr, sv NERC-Coarse, NERC-Fine, EL NewsEye sonar link historical newspapers de NERC-Coarse, EL SoNAR letemps link historical newspapers fr NERC-Coarse, NERC-Fine LeTemps topres19th link historical newspapers en NERC-Coarse, EL Living with Machines ajmc link classical commentaries de, fr, en NERC-Coarse, NERC-Fine, EL AjMC The HIPE-2022 team expresses her greatest appreciation to the partnering projects, namely AJMC, impresso, HIPE-2020, Living with Machines, NewsEye, and SoNAR, for contributing their NE-annotated datasets (and hiding a part thereof for the time of the evaluation campaign). New releases are planned. Check the HIPE-2022 website for updates.

  • Open Access
    Authors: 
    Ehrmann, Maud; Romanello, Matteo; Clematide, SImon; Fl��ckiger, Alex;
    Publisher: Zenodo
    Country: Switzerland

    CLEF-HIPE-2020 (Identifying Historical People, Places and other Entities) is a evaluation campaign on named entity processing on historical newspapers in French, German and English, which was organized in the context of the impresso project and run as a CLEF 2020 Evaluation Lab. Data consists of manually annotated historical newspapers in French, German and English. For more information, please refer to: the CLEF-HIPE-2020 website; the CLEF-HIPE-2020-eval repository, for the necessary material to replicate the results of the shared task; the CLEF-HIPE-2020 poster presented at CLEF 2019 in Lugano, Switzerland; the CLEF-HIPE-2020 participation guidelines (v1.1); the impresso Named Entity Annotation Guidelines (v2.2.0); the CLEF-HIPE-2020 Extended Overview paper (bibtex below); the participant team CEUR working note papers; the workshop presentation video records; A second edition of HIPE is organised in 2022: https://hipe-eval.github.io/HIPE-2022/ Please cite this paper if you are using the datasets or find the shared task results relevant to your research: @inproceedings{ehrmann_extended_2020, title = {Extended {Overview} of {CLEF HIPE} 2020: {Named Entity Processing} on {Historical Newspapers}}, booktitle = {{CLEF 2020 Working Notes}. {Working Notes} of {CLEF} 2020 - {Conference} and {Labs} of the {Evaluation Forum}}, author = {Ehrmann, Maud and Romanello, Matteo and Fl{\"u}ckiger, Alex and Clematide, Simon}, editor = {Cappellato, Linda and Eickhoff, Carsten and Ferro, Nicola and N{\'e}v{\'e}ol, Aur{\'e}lie}, year = {2020}, volume = {2696}, pages = {38}, publisher = {{CEUR-WS}}, address = {{Thessaloniki, Greece}}, doi = {10.5281/zenodo.4117566}, url = {https://infoscience.epfl.ch/record/281054}, } CLEF-HIPE-2020 data v1.3 was used during the shared task; v1.4 is a post-evaluation release (with sentence splitting).