Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Subject
arrow_drop_down
includes
arrow_drop_down
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
368 Research products (1 rule applied)

  • Digital Humanities and Cultural Heritage
  • French

10
arrow_drop_down
Relevance
arrow_drop_down
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    Authors: Gutehrlé Nicolas; Atanassova Iana;

    Dataset for Logical-layout analysis on French Historical Newspapers This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF). Description This dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types: 1c: documents where the text is displayed in one column, as in books; 2c: documents where the text is displayed into two columns; 3c+: documents where there are at least 3 columns of text, as in newspapers. Each of these folders contain subfolders starting with the letters ‘cb’. These are the identifier of a newspaper collection such as « Le Petit Semeur ». An XML describing the collection is contained in each of these folder, but is not linked to the logical-layout analysis purpose. They also contain subfolders starting with the letters ‘bpt’, which contain the following files: XXX.xml : the original XML film as gathered from Gallica. truelabels_block: A CSV file where the True labels for each TextBlock tag is given. Each line contains the page, the block_id, the first and last line of text of the block and its label truelabels_line: A CSV file where the True labels for each TextLine tag is given. Each line contains the page, the line_id, the text of the line and its label XXX_docbook.xml: the document after having been processed by a Logical Layout recognition system. The original XML gathers multiple information about the document, especially metadata (described using the DublinCore schema), the page numbering and the OCR which is described with the XML ALTO format. As such, the files already provide the physical layout analysis and the reading order of the documents. The XML ALTO format provides the text content and physical layout of documents in the following manner. The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes: Id: the tag's identifier Height, Width: the text height and width Vpos: the vertical position of the text on the page. The higher the value, the lower the word is on the page Hpos: the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page Language: the language of the text (only for TextBlock tags). The blocks of text are labelled either as Text, Title, Header or Other. The lines of text are labelled either as Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. These labels are used in the truelabel_lines.csv, trulabel_blocks.csv and XXX_docbook.xml files. You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing the <IDENTIFIER> part with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/<IDENTIFIER>

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    visibility59
    visibilityviews59
    downloaddownloads12
    Powered by Usage counts
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    Authors: Gutehrlé Nicolas; Atanassova Iana;

    Dataset for Logical-layout analysis on French historical newspapers This dataset is intended for training and testing Logical Layout Analysis and recognition system on French historical documents published between 1900 and 1950. The original data is part of the "Fond régional: Franche-Comté", which is curated by Gallica, the digital portal of the Bibliothèque Nationale de France (BnF). This dataset has the following structure: ├── train ├── 1c ├── cb32836282t ├── cb32836282t.xml ├── bpt6k112325g ├── bpt6k112325g.xml ├── truelabels_block.csv ├── truelabels_line.csv ├── … ├── … ├── 2c ├── 3c+ └── test ├── 1c ├── 2c └── 3c+ The dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche-Comté" dataset. To do so, we have divided them into three layout types: • 1c: documents where the text is displayed in one column, as in books; • 2c: documents where the text is displayed into two columns; • 3c+: documents where there are at least 3 columns of text, as in newspapers. Each of the 1c, 2c, and 3c+ folder contains subfolders prefixed by ‘cb’, which contain a collection of documents. For instance, « cb32836282t » is the identifier used in Gallica for « Le Petit écho du 21e Régiment d'infanterie », a French military periodical published during WWI. An XML file with the same name, for instance «cb32836282t.xml », contains metadata about the collection, such as its title, publisher, creator, number of issues, etc. This XML file serves only to describe the collection, and is not to be used for Logical-Layout analysis. The issues in each collection can be found in the subfolders prefixed with « bpt ». For instance, « bpt6k112325g » is the identifier used in Gallica for an issue published in September 1917 of « Le Petit écho du 21e Régiment d'infanterie ». The information about each issue is given in three files, which are described below: 1-bptXXXXXXXXXX.xml The original data, as collected from Gallica. The most important tags of this document and their values are described below: • oai: metadata about the document, such as its author, title, publisher, original publication date, number of issues, … • image_url: the url to the document’s scan (in high resolution) • pagination: a description of each page in the document (size of the page, if it contains a table of content or not, …) • num_pages: the total number of pages in the document • ocr: the OCR representation of the document in the XML ALTO format The XML ALTO format provides the text content and physical layout of documents in the following manner. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes: • Id : the tag’s identifier • Height, Width : the text height and width • Vpos : the vertical position of the text on the page. The higher the value, the lower the word is on the page • Hpos : the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page • Language : the language of the text (only for TextBlock tags). Among the attributes listed above, some TextBlock tags also have a Type attribute. This attribute contains logical labels of the lines in the block. In this dataset it appears most often for tables or advertisements. Overall, TextBlock tags that have a Type attribute are rare in this dataset (about 4 % only). Note: The original scan of every document is accessible on the Gallica website, using the URL https://gallica.bnf.fr/ark:/12148/<IDENTIFIER>, where <IDENTIFIER> should be replaced by the id of the document (e.g.: bpt6k112325g) or the collection (e.g.: cb32836282t). 2-truelabels_block.csv A CSV file where each line corresponds to a TextBlock tag from the file bptXXXXXXXXXX.xml. This CSV file contains the following columns: • page: the page on which the TextBlock tag is located • block_id: the id of the TextBlock tag • first_last_line: the text content of the first and last TextLine tags inside this TextBlock tag • classes: the logical label(s) associated with this TextBlock tag The possible values in the column classes are : Text, Title, Header and Other. 3-truelabels_line.csv A CSV file where each line corresponds to a TextLine tag from the file bptXXXXXXXXXX.xml. This CSV file contains the following columns: • page: the page where the TextLine tag is located • block_id: the id of the TextBlock tag that contains this TextLine tag • line_id: the id of this TextLine tag • text_line: the text content of this TextLine tag • classes: the logical label(s) associated with this TextLine tag The possible values in the column classes are : Text, Firstline, Title, Header and Other. Firstline indicates the « first line » of a paragraph.

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: Datacite
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    visibility144
    visibilityviews144
    downloaddownloads11
    Powered by Usage counts
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: Datacite
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • Authors: Schild, Erwan; Adler, Marie;

    [EN] Subset of 'MLSUM: The Multilingual Summarization Corpus' for constraints annotation experiment. Description: MLSUM is a dataset of newspappers articles aimed at training summaring model. We use it for a constraints annotation experiment on newspapper titles according to their topic classification. Content: For constraints annotation experiment based on data similarity, this dataset have been subsetted (randomly pick 75 articles in the following 14 most used topics: 'economie', 'politique', 'sport', 'planete' (renamed in 'ecologie'), 'sciences', 'police-justice', 'disparitions', 'emploi', 'sante', 'musiques', 'arts', 'educations', 'climat' (renamed in 'meteo'), 'immobilier') and filtered (keep articles that have an obvious topics regarding their titles, without their bodies). Two reviewers have working on this task in order to limit the subjectivity of the filtering. This subsetted dataset is used (1) to estimate needed time to annotate titles similarity with constraints (MUST-LINK, CANNOT-LINK) and (2) to test interactive clustering methodology (constraints annotation and constrained clustering). Origin: The dataset is bassed on the original 'MLSUM: The Multilingual Summarization Corpus' dataset (https://doi.org/10.48550/arXiv.2004.14900). [FR] Echantillon de 'MLSUM: The Multilingual Summarization Corpus' pour une expérience d'annotation de contraintes. Description : MLSUM est un ensemble de données d'articles de journaux destinés à l'entraînement d'un modèle de résumé automatique. Nous l'utilisons pour une expérience d'annotation de contraintes sur des titres de journaux en fonction de leur classification thématique. Contenu : Pour une expérience d'annotation de contraintes basée sur la similarité des données, cet ensemble de données a été échantillonné (sélectionner au hasard de 75 articles dans les 14 sujets les plus utilisés : 'économie', 'politique', 'sport', 'planète' (renommé en « écologie »). ), 'sciences', 'police-justice', 'disparitions', 'emploi', 'sante', 'musiques', 'arts', 'éducations', 'climat' (renommé en 'meteo'), 'immobilier' ) et filtré (conserver les articles qui ont un sujet évident par rapport à leur titre, sans leur corps). Deux relecteurs ont travaillé sur cette tâche afin de limiter la subjectivité du filtrage. Ce sous-ensemble de données est utilisé (1) pour estimer le temps nécessaire pour annoter la similarité des titres avec des contraintes (MUST-LINK, CANNOT-LINK) et (2) pour tester la méthodologie de clustering interactif (annotation de contraintes et clustering contraint). Origine : L'ensemble de données est basé sur l'ensemble de données original 'MLSUM : The Multilingual Summarization Corpus' (https://doi.org/10.48550/arXiv.2004.1490).

    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    more_vert
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/

    [EN] French training dataset for chatbots dealing with usual requests on bank cards. Description: This dataset represents examples of common customer requests relating to bank cards management. It can be used as a training set for a small chatbot intended to process these usual requests. Content: The questions are asked in French. The dataset is divided into 10 intents of 100 questions each, for a total of 1 000 questions. Intents scope: Intents are constructed in such a way that all questions arising from the same intention have the same response or action. The scope covered concerns: loss or theft of cards; the swallowed card; the card order; consultation of the bank balance; insurance provided by a card; card unlocking; virtual card management; management of bank overdraft; management of payment limits; management of contactless mode. Origin: Intents scope is inspired by a chatbot currently in production, and the wording of the questions are inspired by the usual customers requests. [FR] Jeu d'entraînement en français d'assistants conversationnels traitant des demandes courantes sur les cartes bancaires. Description : Cet ensemble de données représente des exemples de demandes usuelles des clients concernant la gestion des cartes bancaires. Il peut être utilisé comme jeu d'entraînement pour un assistant conversationnel destiné à traiter ces demandes courantes. Contenu : Les questions sont formulées en français. L'ensemble de données est divisé en 10 intentions de 100 questions chacune, pour un total de 1 000 questions. Périmètre des intentions : Les intentions sont construites de telle manière que toutes les questions issues d'une même intention ont la même réponse ou action. Le périmètre couvert concerne : la perte ou le vol de cartes ; la carte avalée ; la commande des cartes ; la consultation du solde bancaire ; l'assurance fournie par une carte ; le déverrouillage de la carte ; la gestion de cartes virtuelles ; la gestion du découvert bancaire ; la gestion des plafonds de paiement ; la gestion du mode sans contact. Origine : Le périmètre des intentions est inspiré par un chatbot actuellement en production, et la formulation des questions est inspirée de demandes courantes de clients.

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2022
    License: CC BY
    Data sources: ZENODO
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    ZENODO
    Dataset . 2022
    License: CC BY
    Data sources: Datacite
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    1
    citations1
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    visibility221
    visibilityviews221
    downloaddownloads96
    Powered by Usage counts
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2022
      License: CC BY
      Data sources: ZENODO
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      ZENODO
      Dataset . 2022
      License: CC BY
      Data sources: Datacite
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • Authors: Bros, Victor; Gatica-Perez, Daniel;

    The dataset contains 130 155 articles sourced from the websites of three Swiss francophone newspapers: Arc Info, La Cote, and Le Nouvelliste, spanning the time period from 01/01/2015 to 30/06/2022.

    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    more_vert
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    Authors: François, Thomas;

    This thesis offers a critical analysis of the methodological means to be used to implement a readability formula specific to the context of French as a foreign language (FFL). A readability formula is a tool that allows to assess automatically the reading complexity of a text, using a set of linguistic features such as syntactic complexity, lexical load, etc.. After summarizing the current views on the way a FFL learner understands a text, this study offers a review as complete as possible of the readability studies in French, but also English. This state-of-the-art is followed by an inventory of the linguistic features that may generate reading difficulties. These are mainly from the literature, but several of them are new, inspired by recent work in cognitive psychology of reading. In a second step, based on these methodological considerations and a corpus of texts taken from FFL textbooks, the ability of these linguistic features to adequately predict the difficulty of texts for FFL is evaluated using 406 variables. This work ends with the development of a new readability formula for FFL, that uses technologies from natural language processing. Cette thèse propose une réflexion critique sur les moyens méthodologiques à mettre en oeuvre pour concevoir une formule de lisibilité spécifique à la lecture en français langue étrangère (FLE). Une formule de lisibilité est un outil permettant d'évaluer automatiquement la complexité d'un texte à la lecture à partir d'un ensemble de caractéristiques linguistiques des textes, telles que la complexité syntaxique, la charge lexicale, etc. Après avoir résumé les positions actuelles sur la manière dont un apprenant de FLE appréhende un texte, cette étude propose un panorama aussi complet que possible des études en lisibilité du français, mais aussi de l'anglais. Cet état de l'art est suivi d'un inventaire des caractéristiques linguistiques susceptibles de poser des difficultés à la lecture. Celles-ci sont principalement tirées de la littérature, mais on y compte plusieurs nouvelles dimensions, inspirés des travaux récents en psychologie cognitive de la lecture. Dans un second temps, s'appuyant sur ces réflexions méthodologiques et sur un corpus de textes extraits de manuels de FLE, la capacité de ces caractéristiques linguistiques à prédire adéquatement la difficulté de textes pour le FLE est évaluée à l'aide de 406 variables. Ce travail se termine par le développement d'une nouvelle formule de lisibilité pour le FLE, qui repose sur des technologies de traitement automatique des langues. (LING 3) -- UCL, 2011

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Dépôt Institutionel ...arrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Dépôt Institutionel ...arrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • Authors: Bros, Victor; Gatica-Perez, Daniel;

    Description The dataset contains 130 155 articles sourced from the websites of three Swiss francophone newspapers: Arc Info, La Cote, and Le Nouvelliste, spanning the time period from 01/01/2015 to 30/06/2022. The dataset, compiled from the temporary data feeds provided by the press agency, consists of the articles in their entirety including the title, headline, and content, along with metadata for each article. The collected articles are primarily in French language and are categorically sorted by topics and region, everything encoded in the JSON format. Reference If you use this dataset, please cite the following publication: Victor Bros and Daniel Gatica-Perez, The Suisse Romande Local News Dataset, Idiap Technical Report, 2023. 

    ZENODOarrow_drop_down
    ZENODO
    Dataset . 2023
    Data sources: ZENODO
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    more_vert
      ZENODOarrow_drop_down
      ZENODO
      Dataset . 2023
      Data sources: ZENODO
  • Authors: Zweigenbaum, Pierre;

    International audience; La parole est linéaire : elle s'écoule dans le temps. Cette linéarité est en réalité brisée à de nombreux niveaux, par contingence (contraintes d'écriture) ou intrinsèquement (unités linguistiques). Elle est de plus une forme de transmission par un locuteur qui génère un énoncé, d'une pensée et d'un matériau linguistique qui sont a priori non-linéaires, que l'interlocuteur doit reconstituer lorsqu'il analyse l'énoncé à partir de cette forme linéaire intermédiaire.La linguistique informatique, ou traitement automatique des langues, vise à modéliser informatiquement les phénomènes linguistiques et à automatiser le traitement d'énoncés langagiers par des ordinateurs : correction orthographique, traduction automatique, extraction d'information en sont des exemples. Elle doit, de ce fait, concevoir des algorithmes pour résoudre automatiquement de multiples problèmes de passage de lignes brisées à des lignes continues et inversement de segmentation de lignes continues en lignes brisées (ligne continue vs brisée) ou de reconstitution des structures non-linéaires sous-jacentes à la langue (ligne continue vs non-ligne).Nous verrons, d'une part, comment l'informatique met en place de multiples niveaux de représentation d'un texte dont certains donnent la vision d'une ligne continue alors que d'autres en font une ligne brisée. Nous présenterons, d'autre part, la façon dont le traitement automatique des langues découpe la ligne d'un texte en segments selon les unités linguistiques qui le composent, et au-delà de ces segments cherche à recouvrer l'arbre ou le graphe des relations entre ces unités linguistiques.

    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    more_vert
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    Authors: Lemaire, Nathalie; François, Thomas; Debongnie, Jean-Claude; De Meyere, Damien; +5 Authors

    Nous décrivons la collaboration tripartite terminologie/TALN/sciences de la santé développée dans le cadre du projet iMediate (Interoperability of Medical Data through Information Extraction and Term Encoding). Cette recherche vise à mettre au point des technologies propres à réaliser automatiquement une représentation structurée des données des patients à partir de textes hospitaliers non structurés et de données cliniques structurées. Le partenaire terminologique intervient en amont de la structuration de documents médicaux peu ou pas structurés. Son rôle est, par repérage des variations observables dans l’usage, d’enrichir les ressources terminologiques qui seront exploitées par les trois applications – annotateur, catégoriseur et moteur de recherche – développées en aval du projet. Les nomenclatures médicales ne suffisant pas à rendre compte de la variation rencontrée dans les textes médicaux libres, l’enrichissement terminologique consiste à relever les variations d’usage dans un corpus, et à les associer, par regroupement synonymique, aux entrées des nomenclatures officielles. La taille du corpus iMediate confinant au traitement de données massives (big data), une adaptation de la méthodologie classique d’exploitation semi- automatique de corpus a été nécessaire. Plutôt qu’un traitement séquentiel des unités terminologiques, les partenaires ont élaboré une méthodologie d’extraction et de validation des candidats termes par cycles d’amélioration continue des processus manuels et automatiques.

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Dépôt Institutionel ...arrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Dépôt Institutionel ...arrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    Authors: Camps, Jean-Baptiste; Gabay, Simon; Clérice, Thibault; Cafiero, Florian;

    Pie Model for Classical French, for lemmatisation. Trained on a corpus of Classical French Theatre and Frantext Open Access Data. More information: - corpus: Camps, Jean-Baptiste, & Cafiero, Florian. (2019). Stylometric Analysis of Classical French Theatre [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3353421. - F. Cafiero and J.B. Camps, Why Molière most likely did write his plays, Science Advances, 27 Nov 2019: Vol. 5, no. 11, eaax5489, DOI: 10.1126/sciadv.aax5489, https://advances.sciencemag.org/content/5/11/eaax5489/. - J.B. Camps, S. Gabay, Th. Clérice and F. Cafiero, Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre, to be published. Current results on test data: ::: Evaluation report for task: lemma ::: all: accuracy: 0.9909 precision: 0.9427 recall: 0.9414 support: 4181 ambiguous-tokens: accuracy: 0.9802 precision: 0.9307 recall: 0.9301 support: 857 unknown-targets: accuracy: 0.5714 precision: 0.4 recall: 0.4 support: 14 unknown-tokens: accuracy: 0.7188 precision: 0.5443 recall: 0.5443 support: 64 {"references": ["Cafiero and Camps (2019). Why Moli\u00e8re most likely did write his plays, Science Advances, 27 Nov 2019: Vol. 5, no. 11, eaax5489, DOI: 10.1126/sciadv.aax5489,", "Camps, Gabay, Cl\u00e9rice and Cafiero (to be published). Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre."]}

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Article . 2020
    License: CC BY
    Data sources: ZENODO
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Article . 2020
    License: CC BY
    Data sources: ZENODO
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Article . 2020
    License: CC BY
    Data sources: Datacite
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    visibility27
    visibilityviews27
    downloaddownloads168
    Powered by Usage counts
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Article . 2020
      License: CC BY
      Data sources: ZENODO
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Article . 2020
      License: CC BY
      Data sources: ZENODO
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Article . 2020
      License: CC BY
      Data sources: Datacite
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Subject
arrow_drop_down
includes
arrow_drop_down
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
368 Research products (1 rule applied)
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    Authors: Gutehrlé Nicolas; Atanassova Iana;

    Dataset for Logical-layout analysis on French Historical Newspapers This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF). Description This dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types: 1c: documents where the text is displayed in one column, as in books; 2c: documents where the text is displayed into two columns; 3c+: documents where there are at least 3 columns of text, as in newspapers. Each of these folders contain subfolders starting with the letters ‘cb’. These are the identifier of a newspaper collection such as « Le Petit Semeur ». An XML describing the collection is contained in each of these folder, but is not linked to the logical-layout analysis purpose. They also contain subfolders starting with the letters ‘bpt’, which contain the following files: XXX.xml : the original XML film as gathered from Gallica. truelabels_block: A CSV file where the True labels for each TextBlock tag is given. Each line contains the page, the block_id, the first and last line of text of the block and its label truelabels_line: A CSV file where the True labels for each TextLine tag is given. Each line contains the page, the line_id, the text of the line and its label XXX_docbook.xml: the document after having been processed by a Logical Layout recognition system. The original XML gathers multiple information about the document, especially metadata (described using the DublinCore schema), the page numbering and the OCR which is described with the XML ALTO format. As such, the files already provide the physical layout analysis and the reading order of the documents. The XML ALTO format provides the text content and physical layout of documents in the following manner. The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes: Id: the tag's identifier Height, Width: the text height and width Vpos: the vertical position of the text on the page. The higher the value, the lower the word is on the page Hpos: the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page Language: the language of the text (only for TextBlock tags). The blocks of text are labelled either as Text, Title, Header or Other. The lines of text are labelled either as Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. These labels are used in the truelabel_lines.csv, trulabel_blocks.csv and XXX_docbook.xml files. You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing the <IDENTIFIER> part with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/<IDENTIFIER>

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    visibility59
    visibilityviews59
    downloaddownloads12
    Powered by Usage counts
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    Authors: Gutehrlé Nicolas; Atanassova Iana;

    Dataset for Logical-layout analysis on French historical newspapers This dataset is intended for training and testing Logical Layout Analysis and recognition system on French historical documents published between 1900 and 1950. The original data is part of the "Fond régional: Franche-Comté", which is curated by Gallica, the digital portal of the Bibliothèque Nationale de France (BnF). This dataset has the following structure: ├── train ├── 1c ├── cb32836282t ├── cb32836282t.xml ├── bpt6k112325g ├── bpt6k112325g.xml ├── truelabels_block.csv ├── truelabels_line.csv ├── … ├── … ├── 2c ├── 3c+ └── test ├── 1c ├── 2c └── 3c+ The dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche-Comté" dataset. To do so, we have divided them into three layout types: • 1c: documents where the text is displayed in one column, as in books; • 2c: documents where the text is displayed into two columns; • 3c+: documents where there are at least 3 columns of text, as in newspapers. Each of the 1c, 2c, and 3c+ folder contains subfolders prefixed by ‘cb’, which contain a collection of documents. For instance, « cb32836282t » is the identifier used in Gallica for « Le Petit écho du 21e Régiment d'infanterie », a French military periodical published during WWI. An XML file with the same name, for instance «cb32836282t.xml », contains metadata about the collection, such as its title, publisher, creator, number of issues, etc. This XML file serves only to describe the collection, and is not to be used for Logical-Layout analysis. The issues in each collection can be found in the subfolders prefixed with « bpt ». For instance, « bpt6k112325g » is the identifier used in Gallica for an issue published in September 1917 of « Le Petit écho du 21e Régiment d'infanterie ». The information about each issue is given in three files, which are described below: 1-bptXXXXXXXXXX.xml The original data, as collected from Gallica. The most important tags of this document and their values are described below: • oai: metadata about the document, such as its author, title, publisher, original publication date, number of issues, … • image_url: the url to the document’s scan (in high resolution) • pagination: a description of each page in the document (size of the page, if it contains a table of content or not, …) • num_pages: the total number of pages in the document • ocr: the OCR representation of the document in the XML ALTO format The XML ALTO format provides the text content and physical layout of documents in the following manner. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes: • Id : the tag’s identifier • Height, Width : the text height and width • Vpos : the vertical position of the text on the page. The higher the value, the lower the word is on the page • Hpos : the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page • Language : the language of the text (only for TextBlock tags). Among the attributes listed above, some TextBlock tags also have a Type attribute. This attribute contains logical labels of the lines in the block. In this dataset it appears most often for tables or advertisements. Overall, TextBlock tags that have a Type attribute are rare in this dataset (about 4 % only). Note: The original scan of every document is accessible on the Gallica website, using the URL https://gallica.bnf.fr/ark:/12148/<IDENTIFIER>, where <IDENTIFIER> should be replaced by the id of the document (e.g.: bpt6k112325g) or the collection (e.g.: cb32836282t). 2-truelabels_block.csv A CSV file where each line corresponds to a TextBlock tag from the file bptXXXXXXXXXX.xml. This CSV file contains the following columns: • page: the page on which the TextBlock tag is located • block_id: the id of the TextBlock tag • first_last_line: the text content of the first and last TextLine tags inside this TextBlock tag • classes: the logical label(s) associated with this TextBlock tag The possible values in the column classes are : Text, Title, Header and Other. 3-truelabels_line.csv A CSV file where each line corresponds to a TextLine tag from the file bptXXXXXXXXXX.xml. This CSV file contains the following columns: • page: the page where the TextLine tag is located • block_id: the id of the TextBlock tag that contains this TextLine tag • line_id: the id of this TextLine tag • text_line: the text content of this TextLine tag • classes: the logical label(s) associated with this TextLine tag The possible values in the column classes are : Text, Firstline, Title, Header and Other. Firstline indicates the « first line » of a paragraph.

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: Datacite
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    visibility144
    visibilityviews144
    downloaddownloads11
    Powered by Usage counts
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: Datacite
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • Authors: Schild, Erwan; Adler, Marie;

    [EN] Subset of 'MLSUM: The Multilingual Summarization Corpus' for constraints annotation experiment. Description: MLSUM is a dataset of newspappers articles aimed at training summaring model. We use it for a constraints annotation experiment on newspapper titles according to their topic classification. Content: For constraints annotation experiment based on data similarity, this dataset have been subsetted (randomly pick 75 articles in the following 14 most used topics: 'economie', 'politique', 'sport', 'planete' (renamed in 'ecologie'), 'sciences', 'police-justice', 'disparitions', 'emploi', 'sante', 'musiques', 'arts', 'educations', 'climat' (renamed in 'meteo'), 'immobilier') and filtered (keep articles that have an obvious topics regarding their titles, without their bodies). Two reviewers have working on this task in order to limit the subjectivity of the filtering. This subsetted dataset is used (1) to estimate needed time to annotate titles similarity with constraints (MUST-LINK, CANNOT-LINK) and (2) to test interactive clustering methodology (constraints annotation and constrained clustering). Origin: The dataset is bassed on the original 'MLSUM: The Multilingual Summarization Corpus' dataset (https://doi.org/10.48550/arXiv.2004.14900). [FR] Echantillon de 'MLSUM: The Multilingual Summarization Corpus' pour une expérience d'annotation de contraintes. Description : MLSUM est un ensemble de données d'articles de journaux destinés à l'entraînement d'un modèle de résumé automatique. Nous l'utilisons pour une expérience d'annotation de contraintes sur des titres de journaux en fonction de leur classification thématique. Contenu : Pour une expérience d'annotation de contraintes basée sur la similarité des données, cet ensemble de données a été échantillonné (sélectionner au hasard de 75 articles dans les 14 sujets les plus utilisés : 'économie', 'politique', 'sport', 'planète' (renommé en « écologie »). ), 'sciences', 'police-justice', 'disparitions', 'emploi', 'sante', 'musiques', 'arts', 'éducations', 'climat' (renommé en 'meteo'), 'immobilier' ) et filtré (conserver les articles qui ont un sujet évident par rapport à leur titre, sans leur corps). Deux relecteurs ont travaillé sur cette tâche afin de limiter la subjectivité du filtrage. Ce sous-ensemble de données est utilisé (1) pour estimer le temps nécessaire pour annoter la similarité des titres avec des contraintes (MUST-LINK, CANNOT-LINK) et (2) pour tester la méthodologie de clustering interactif (annotation de contraintes et clustering contraint). Origine : L'ensemble de données est basé sur l'ensemble de données original 'MLSUM : The Multilingual Summarization Corpus' (https://doi.org/10.48550/arXiv.2004.1490).

    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    0
    citations0
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    more_vert
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/

    [EN] French training dataset for chatbots dealing with usual requests on bank cards. Description: This dataset represents examples of common customer requests relating to bank cards management. It can be used as a training set for a small chatbot intended to process these usual requests. Content: The questions are asked in French. The dataset is divided into 10 intents of 100 questions each, for a total of 1 000 questions. Intents scope: Intents are constructed in such a way that all questions arising from the same intention have the same response or action. The scope covered concerns: loss or theft of cards; the swallowed card; the card order; consultation of the bank balance; insurance provided by a card; card unlocking; virtual card management; management of bank overdraft; management of payment limits; management of contactless mode. Origin: Intents scope is inspired by a chatbot currently in production, and the wording of the questions are inspired by the usual customers requests. [FR] Jeu d'entraînement en français d'assistants conversationnels traitant des demandes courantes sur les cartes bancaires. Description : Cet ensemble de données représente des exemples de demandes usuelles des clients concernant la gestion des cartes bancaires. Il peut être utilisé comme jeu d'entraînement pour un assistant conversationnel destiné à traiter ces demandes courantes. Contenu : Les questions sont formulées en français. L'ensemble de données est divisé en 10 intentions de 100 questions chacune, pour un total de 1 000 questions. Périmètre des intentions : Les intentions sont construites de telle manière que toutes les questions issues d'une même intention ont la même réponse ou action. Le périmètre couvert concerne : la perte ou le vol de cartes ; la carte avalée ; la commande des cartes ; la consultation du solde bancaire ; l'assurance fournie par une carte ; le déverrouillage de la carte ; la gestion de cartes virtuelles ; la gestion du découvert bancaire ; la gestion des plafonds de paiement ; la gestion du mode sans contact. Origine : Le périmètre des intentions est inspiré par un chatbot actuellement en production, et la formulation des questions est inspirée de demandes courantes de clients.

    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2022
    License: CC BY
    Data sources: ZENODO
    image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
    ZENODO
    Dataset . 2021
    License: CC BY
    Data sources: ZENODO
    ZENODO
    Dataset . 2022
    License: CC BY
    Data sources: Datacite
    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.
    1
    citations1
    popularityAverage
    influenceAverage
    impulseAverage
    BIP!Powered by BIP!
    visibility221
    visibilityviews221
    downloaddownloads96
    Powered by Usage counts
    more_vert
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2022
      License: CC BY
      Data sources: ZENODO
      image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
      ZENODO
      Dataset . 2021
      License: CC BY
      Data sources: ZENODO
      ZENODO
      Dataset . 2022
      License: CC BY
      Data sources: Datacite
      addClaim

      This Research product is the result of merged Research products in OpenAIRE.

      You have already added works in your ORCID record related to the merged Research product.
  • Authors: Bros, Victor; Gatica-Perez, Daniel;

    The dataset contains 130 155 articles sourced from the websites of three Swiss francophone newspapers: Arc Info, La Cote, and Le Nouvelliste, spanning the time period from 01/01/2015 to 30/06/2022.

    addClaim

    This Research product is the result of merged Research products in OpenAIRE.

    You have already added works in your ORCID record related to the merged Research product.