research data . Dataset . 2021

Datasets and Models for Historical Newspaper Article Segmentation

Barman, Raphaël; Ehrmann, Maud; Clematide, Simon; Oliveira;
Open Access
  • Published: 30 Jan 2021
  • Country: Switzerland
Abstract
This record contains the datasets and models used and produced for the work reported in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers" (link). Please cite this paper if you are using the models/datasets or find it relevant to your research: @article{barman_combining_2020, title = {{Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers}}, author = {Raphaël Barman and Maud Ehrmann and Simon Clematide and Sofia Ares Oliveira and Frédéric Kaplan}, journal= {Journal of Data Mining \& Digital Humanities}, volume= {HistoInformatics} DOI = {10.5281/zenodo.4065271}, year = {2021}, url = {https://jdmdh.episciences.org/7097}, } Please note that this record contains data under different licenses. 1. DATA Annotations (json files): JSON files contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in VIA format. The following licenses apply: luxwort.json: those annotations are under a CC0 1.0 license. Please refer to the right statement specified for each image in the file. GDL.json, IMP.json and JDG.json: those annotations are under a CC BY-SA 4.0 license. Image files: The archive images.zip contains the Swiss titles image files (GDL, IMP, JDG) used for the experiments described in the paper. Those images are under copyright (property of the journal Le Temps and of ArcInfo) and can be used for academic research or educational purposes only. Redistribution, publication or commercial use are not permitted. These terms of use are similar to the following right statement: http://rightsstatements.org/vocab/InC-EDU/1.0/ 2. MODELS Some of the best models are released under a CC BY-SA 4.0 license (they are also available as assets of the current Github release). JDG_flair-FT: this model was trained on JDG using french Flair and FastText embeddings. It is able to predict the four classes presented in the paper (Serial, Weather, Death notice and Stocks). Luxwort_obituary_flair-bpemb: this model was trained on Luxwort using multilingual Flair and Byte-pair embeddings. It is able to predict the Death notice class. Luxwort_obituary_flair-FT_indomain: this model was trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data). It is also able to predict the Death notice class. Those models can be used to predict probabilities on new images using the same code as in the original dhSegment repository. One needs to adjust three parameters to the predict function: 1) embeddings_path (the path to the embeddings list), 2) embeddings_map_path(the path to the compressed embedding map), and 3) embeddings_dim (the size of the embeddings). Please refer to the paper for further information or contact us. 3. CODE: https://github.com/dhlab-epfl/dhSegment-text 4. ACKNOWLEDGEMENTS We warmly thank the journal Le Temps (owner of La Gazette de Lausanne and the Journal de Genève) and the group ArcInfo (owner of L'Impartial) for accepting to share the related datasets for academic purposes. We also thank the National Library of Luxembourg for its support with all steps related to the Luxemburger Wort annotation release. This work was realized in the context of the impresso - Media Monitoring of the Past project and supported by the Swiss National Science Foundation under grant CR- SII5_173719. 5. CONTACT Maud Ehrmann (EPFL-DHLAB) Simon Clematide (UZH)
Subjects
free text keywords: optical layout recognition, multimodal learning, historical newspaper segmentation, deep learning, digital humanities, historical newspapers, article segmentation
Related Organizations
Communities
  • Digital Humanities and Cultural Heritage
  • NEANIAS Space Research Community
Funded by
SNSF| Media Monitoring of the Past
Project
  • Funder: Swiss National Science Foundation (SNSF)
  • Project Code: CRSII5_173719
  • Funding stream: Programmes | Sinergia
Any information missing or wrong?Report an Issue