Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
1,275 Research products, page 1 of 128

  • Digital Humanities and Cultural Heritage
  • Research data
  • Research software
  • Open Access
  • Dataset
  • ZENODO
  • SEDICI (UNLP) - Universidad Nacional de La Plata

10
arrow_drop_down
Date (most recent)
arrow_drop_down
  • Open Access English
    Authors: 
    Dhrangadhariya, Anjani; Müller, Henning;
    Publisher: Dryad

    This upload contains four main zip files. ds_cto_dict.zip: This zip file contains the four distant supervision dictionaries (P: participant.txt, I = intervention.txt, intervetion_syn.txt, O: outcome.txt) generated from clinicaltrials.gov using the Methodology described in Distant-CTO (https://aclanthology.org/2022.bionlp-1.34/). These dictionaries were used to create distant supervision labelling functions as described in the Labelling sources subsection of the Methodology. The data was derived from https://clinicaltrials.gov/ handcrafted_dictionaries.zip: This zip folder contains three files 1) gender_sexuality.txt: a list of possible genders and sexual orientations found across the web. The list needs to be more comprehensive. 2) endpoints_dict.txt: contains outcome names and the names of questionnaires used to measure outcomes assembled from PROM questionnaires and PROMs. and 3) comparator_dict: contains a list of idiosyncratic comparator terms like a sham, saline, placebo, etc., compiled from the literature search. The list needs to be more comprehensive. test_ebm_correctedlabels.tsv: EBM-PICO is a widely used dataset with PICO annotations at two levels: span-level or coarse-grained and entity-level or fine-grained. Span-level annotations encompass the full information about each class. Entity-level annotations cover the more fine-grained information at the entity level, with PICO classes further divided into fine-grained subclasses. For example, the coarse-grained Participant span is further divided into participant age, gender, condition and sample size in the randomised controlled trial. This dataset comes pre-divided into a training set (n=4,933) annotated through crowd-sourcing and an expert annotated gold test set (n=191) for evaluation. The EBM-PICO annotation guidelines caution about variable annotation quality. Abaho et al. developed a framework to post-hoc correct EBM-PICO outcomes annotation inconsistencies. Lee et al. studied annotation span disagreements suggesting variability across the annotators. Low annotation quality in the training dataset is excusable, but the errors in the test set can lead to faulty evaluation of the downstream ML methods. We evaluate 1% of the EBM-PICO training set tokens to gauge the possible reasons for the fine-grained labelling errors and use this exercise to conduct an error-focused PICO re-annotation for the EBM-PICO gold test set. The file 'test_ebm_correctedlabels.tsv' has error corrected EBM-PICO gold test set. This dataset could be used as a complementary evalution set along with EBM-PICO test set. error_analysis.zip: This .zip file contains three .tsv files for each PICO class to identify possible errors in about 1% (about 12,962 tokens) of the EBM-PICO training set. Objective: PICO (Participants, Interventions, Comparators, Outcomes) analysis is vital but time-consuming for conducting systematic reviews (SRs). Supervised machine learning can help fully automate it, but a lack of large annotated corpora limits the quality of automated PICO recognition systems. The largest currently available PICO corpus is manually annotated, which is an approach that is often too expensive for the scientific community to apply. Depending on the specific SR question, PICO criteria are extended to PICOC (C-Context), PICOT (T-timeframe), and PIBOSO (B-Background, S-Study design, O-Other) meaning the static hand-labelled corpora need to undergo costly re-annotation as per the downstream requirements. We aim to test the feasibility of designing a weak supervision system to extract these entities without hand-labelled data. Methodology: We decompose PICO spans into its constituent entities and re-purpose multiple medical and non-medical ontologies and expert-generated rules to obtain multiple noisy labels for these entities. These labels obtained using several sources are then aggregated using simple majority voting and generative modelling approaches. The resulting programmatic labels are used as weak signals to train a weakly-supervised discriminative model and observe performance changes. We explore mistakes in the currently available PICO corpus that could have led to inaccurate evaluation of several automation methods. Results: We present Weak-PICO, a weakly-supervised PICO entity recognition approach using medical and non-medical ontologies, dictionaries and expert-generated rules. Our approach does not use hand-labelled data. Conclusion: Weak supervision using weak-PICO for PICO entity recognition has encouraging results, and the approach can potentially extend to more clinical entities readily. All the datasets could be opened using text editors or Google sheets. The .zip files in the dataset can be opened using the archive utility on Mac OS and unzip functionality in Linux. (All Windows and Apple operating systems support the use of ZIP files without additional third-party software)

  • Open Access
    Authors: 
    Leonardo Santiago Benitez Pereira;
    Publisher: Zenodo

    Collection of 300 support tickets manually labeled for semantic similarity, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The labeling process was performed in 2022 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish. All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag indicating the original content, for instance: the sentence "this text was written by Leonardo" is converted to "this text was written by [NAME]"). The removal was performed in three steps: first, the automated machine learning-based tool AWS Comprehend PII Removal was used; then, a sequence of custom regular expressions was applied; last, the entire corpus was manually verified.

  • Open Access
    Authors: 
    Jan Moens; Koen De Groote;
    Publisher: Zenodo

    Bijlage bij 'Moens J. & De Groote K. 2022: Ieper - De Meersen. Deel 2. De studie van het leer', een onderzoeksrapport van het agentschap Onroerend Erfgoed: - volledige inventaris van de leervondsten (als .xlsx-bestand)

  • Open Access English
    Authors: 
    Sarker, Abeed;
    Publisher: Zenodo

    This dataset accompanies the article titled: "Can accurate demographic information about people who use prescription medications non-medically be derived from Twitter?" submitted to PNAS. See the README.txt file for more details.

  • Open Access English
    Authors: 
    Plutniak, Sébastien;
    Publisher: Zenodo

    This table contains the different persistent identifier (PID) related to the digital version of the articles of the Dialektikê. Cahiers de typologie analytique journal of archaeology (ISSN 1147-114X and 1169-0046). These articles are referenced by multiple services: HAL – Hyper Articles en Ligne (https://hal.archives-ouvertes.fr). The Lithic Types blog (https://lithictypes.hypotheses.org). Isidore (https://isidore.science). Worldcat (https://www.worldcat.org). FRANTIQ (https://catalogue.frantiq.fr) Each row of the table corresponds to an article and present the following data: authors: authors of the article. year: publication year of the article. doi: doi of the digital version of the article stored in Zenodo. hdl-hypotheses: Isidore handle of the reference to the article on the Lithic types blog. oclc-hypotheses: OCLC unique identifier of the reference to the article on the Lithic types blog. hal: HAL unique identifier to the reference to the article. hdl-hal: Isidore handle of the reference to the article in HAL. ark-frantiq: ark of the reference to the article in FRANTIQ.

  • Open Access English
    Authors: 
    Leonardo Santiago Benitez Pereira;
    Publisher: Zenodo

    Collection of 2229 support tickets manually classified into 7 categories, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The classification process was performed in 2020 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish. All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag indicating the original content, for instance: the sentence "this text was written by Leonardo" is converted to "this text was written by [NAME]"). The removal was performed in three steps: first, the automated machine learning-based tool AWS Comprehend PII Removal was used; then, a sequence of custom regular expressions was applied; last, the entire corpus was manually verified.

  • Research data . 2022
    Open Access Italian
    Authors: 
    Shibingfeng, Zhang; Francesco, Fernicola; Federico, Garcea; Alberto, Barrón-Cedeño; Paolo, Bonora; Angelo, Pompilio;
    Publisher: Zenodo

    The corpus AriEmozione 2.0 contains a selection of operas composed between 1655 and 1765, with each verse annotated with an emotion. The annotation of AriEmozione 2.0 is conducted in a self-learning manner leveraging on the AriEmozione 1.0 corpus. Six emotion labels are used, namely Amore (Love) Gioia (Joy) Ammirazione (Admiration) Rabbia (Anger) Tristezza (Sadness) Paura (Fear) This corpus contains about 89k verses. Each line in the tsv file is composed of: Verse ID : unique aria and verse ID. Each ID is composed of an aria ID and a verse ID. For example, ZAP1590034_00 means the first verse of aria ZAP1590034 Verse text: the text of the verse in the aria Emotion: one of the six emotions AriEmozione 2.0 is a subset of the materials collected by project CORAGO. How to cite: @article{zhang2022ariemozione, title={AriEmozione 2.0: Identifying Emotions in Opera Verses and Arias}, author={ Zhang, Shibingfeng and Fernicola, Francesco and Garcea, Federico and Bonora, Paolo and Barr{\'o}n-Cede\~no, Alberto}, journal={Italian Journal of Computational Linguistics},volume={},issue_date = {} year={in press} } {"references": ["Zhang, S., Fernicola, F., Garcea, F., Bonora, P., Barr\u00f3n-Cede\u00f1o, A., (in press). AriEmozione 2.0: Identifying Emotions in Opera Verses and Arias. Italian Journal of Computational Linguistics"]}

  • Research data . 2022
    Open Access Italian
    Authors: 
    Shibingfeng, Zhang; Francesco, Fernicola; Federico, Garcea; Alberto, Barrón-Cedeño; Paolo, Bonora; Angelo, Pompilio;
    Publisher: Zenodo

    The corpus AriEmozione 2.0 contains a selection of operas composed between 1655 and 1765, with each verse annotated with an emotion. The annotation of AriEmozione 2.0 is conducted in a self-learning manner leveraging on the AriEmozione 1.0 corpus. Six emotion labels are used, namely Amore (Love) Gioia (Joy) Ammirazione (Admiration) Rabbia (Anger) Tristezza (Sadness) Paura (Fear) This corpus contains about 89k verses. Each line in the tsv file is composed of: Verse ID : unique aria and verse ID. Each ID is composed of an aria ID and a verse ID. For example, ZAP1590034_00 means the first verse of aria ZAP1590034 Verse text: the text of the verse in the aria Emotion: one of the six emotions AriEmozione 2.0 is a subset of the materials collected by project CORAGO. How to cite: @article{zhang2022ariemozione, title={AriEmozione 2.0: Identifying Emotions in Opera Verses and Arias}, author={ Zhang, Shibingfeng and Fernicola, Francesco and Garcea, Federico and Bonora, Paolo and Barr{\'o}n-Cede\~no, Alberto}, journal={Italian Journal of Computational Linguistics},volume={},issue_date = {} year={in press} }

  • Open Access German
    Authors: 
    Schumacher, Mareike; Flüh, Marie;
    Publisher: Zenodo

    This repository contains the datasets on which the article 'Made to be a woman. A case study on the categorization of gender using an individuation-based approach in the analysis of literary texts' is based on.

  • Open Access
    Authors: 
    Ziku, Mariana; Bettina Fabos;
    Publisher: Zenodo

    Dataset containing 27 digital community heritage initiatives, defined as online initiatives where community members or users contribute to a community heritage-related common cause, which promotes the interests of the community(ies) and/or the greater public. Access the report: Ziku, M., & Fabos, B. (2022). Digital Community Heritage and Open Access. CC Open Culture Working Group Digital Community Heritage. https://doi.org/10.21428/9eb74dbf.0c46e6be The dataset has been collected in the framework of the Creative Commons Open Culture Working Group "Digital Community Heritage", funded by the Creative Commons.

Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
1,275 Research products, page 1 of 128
  • Open Access English
    Authors: 
    Dhrangadhariya, Anjani; Müller, Henning;
    Publisher: Dryad

    This upload contains four main zip files. ds_cto_dict.zip: This zip file contains the four distant supervision dictionaries (P: participant.txt, I = intervention.txt, intervetion_syn.txt, O: outcome.txt) generated from clinicaltrials.gov using the Methodology described in Distant-CTO (https://aclanthology.org/2022.bionlp-1.34/). These dictionaries were used to create distant supervision labelling functions as described in the Labelling sources subsection of the Methodology. The data was derived from https://clinicaltrials.gov/ handcrafted_dictionaries.zip: This zip folder contains three files 1) gender_sexuality.txt: a list of possible genders and sexual orientations found across the web. The list needs to be more comprehensive. 2) endpoints_dict.txt: contains outcome names and the names of questionnaires used to measure outcomes assembled from PROM questionnaires and PROMs. and 3) comparator_dict: contains a list of idiosyncratic comparator terms like a sham, saline, placebo, etc., compiled from the literature search. The list needs to be more comprehensive. test_ebm_correctedlabels.tsv: EBM-PICO is a widely used dataset with PICO annotations at two levels: span-level or coarse-grained and entity-level or fine-grained. Span-level annotations encompass the full information about each class. Entity-level annotations cover the more fine-grained information at the entity level, with PICO classes further divided into fine-grained subclasses. For example, the coarse-grained Participant span is further divided into participant age, gender, condition and sample size in the randomised controlled trial. This dataset comes pre-divided into a training set (n=4,933) annotated through crowd-sourcing and an expert annotated gold test set (n=191) for evaluation. The EBM-PICO annotation guidelines caution about variable annotation quality. Abaho et al. developed a framework to post-hoc correct EBM-PICO outcomes annotation inconsistencies. Lee et al. studied annotation span disagreements suggesting variability across the annotators. Low annotation quality in the training dataset is excusable, but the errors in the test set can lead to faulty evaluation of the downstream ML methods. We evaluate 1% of the EBM-PICO training set tokens to gauge the possible reasons for the fine-grained labelling errors and use this exercise to conduct an error-focused PICO re-annotation for the EBM-PICO gold test set. The file 'test_ebm_correctedlabels.tsv' has error corrected EBM-PICO gold test set. This dataset could be used as a complementary evalution set along with EBM-PICO test set. error_analysis.zip: This .zip file contains three .tsv files for each PICO class to identify possible errors in about 1% (about 12,962 tokens) of the EBM-PICO training set. Objective: PICO (Participants, Interventions, Comparators, Outcomes) analysis is vital but time-consuming for conducting systematic reviews (SRs). Supervised machine learning can help fully automate it, but a lack of large annotated corpora limits the quality of automated PICO recognition systems. The largest currently available PICO corpus is manually annotated, which is an approach that is often too expensive for the scientific community to apply. Depending on the specific SR question, PICO criteria are extended to PICOC (C-Context), PICOT (T-timeframe), and PIBOSO (B-Background, S-Study design, O-Other) meaning the static hand-labelled corpora need to undergo costly re-annotation as per the downstream requirements. We aim to test the feasibility of designing a weak supervision system to extract these entities without hand-labelled data. Methodology: We decompose PICO spans into its constituent entities and re-purpose multiple medical and non-medical ontologies and expert-generated rules to obtain multiple noisy labels for these entities. These labels obtained using several sources are then aggregated using simple majority voting and generative modelling approaches. The resulting programmatic labels are used as weak signals to train a weakly-supervised discriminative model and observe performance changes. We explore mistakes in the currently available PICO corpus that could have led to inaccurate evaluation of several automation methods. Results: We present Weak-PICO, a weakly-supervised PICO entity recognition approach using medical and non-medical ontologies, dictionaries and expert-generated rules. Our approach does not use hand-labelled data. Conclusion: Weak supervision using weak-PICO for PICO entity recognition has encouraging results, and the approach can potentially extend to more clinical entities readily. All the datasets could be opened using text editors or Google sheets. The .zip files in the dataset can be opened using the archive utility on Mac OS and unzip functionality in Linux. (All Windows and Apple operating systems support the use of ZIP files without additional third-party software)

  • Open Access
    Authors: 
    Leonardo Santiago Benitez Pereira;
    Publisher: Zenodo

    Collection of 300 support tickets manually labeled for semantic similarity, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The labeling process was performed in 2022 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish. All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag indicating the original content, for instance: the sentence "this text was written by Leonardo" is converted to "this text was written by [NAME]"). The removal was performed in three steps: first, the automated machine learning-based tool AWS Comprehend PII Removal was used; then, a sequence of custom regular expressions was applied; last, the entire corpus was manually verified.

  • Open Access
    Authors: 
    Jan Moens; Koen De Groote;
    Publisher: Zenodo

    Bijlage bij 'Moens J. & De Groote K. 2022: Ieper - De Meersen. Deel 2. De studie van het leer', een onderzoeksrapport van het agentschap Onroerend Erfgoed: - volledige inventaris van de leervondsten (als .xlsx-bestand)

  • Open Access English
    Authors: 
    Sarker, Abeed;
    Publisher: Zenodo

    This dataset accompanies the article titled: "Can accurate demographic information about people who use prescription medications non-medically be derived from Twitter?" submitted to PNAS. See the README.txt file for more details.

  • Open Access English
    Authors: 
    Plutniak, Sébastien;
    Publisher: Zenodo

    This table contains the different persistent identifier (PID) related to the digital version of the articles of the Dialektikê. Cahiers de typologie analytique journal of archaeology (ISSN 1147-114X and 1169-0046). These articles are referenced by multiple services: HAL – Hyper Articles en Ligne (https://hal.archives-ouvertes.fr). The Lithic Types blog (https://lithictypes.hypotheses.org). Isidore (https://isidore.science). Worldcat (https://www.worldcat.org). FRANTIQ (https://catalogue.frantiq.fr) Each row of the table corresponds to an article and present the following data: authors: authors of the article. year: publication year of the article. doi: doi of the digital version of the article stored in Zenodo. hdl-hypotheses: Isidore handle of the reference to the article on the Lithic types blog. oclc-hypotheses: OCLC unique identifier of the reference to the article on the Lithic types blog. hal: HAL unique identifier to the reference to the article. hdl-hal: Isidore handle of the reference to the article in HAL. ark-frantiq: ark of the reference to the article in FRANTIQ.

  • Open Access English
    Authors: 
    Leonardo Santiago Benitez Pereira;
    Publisher: Zenodo

    Collection of 2229 support tickets manually classified into 7 categories, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The classification process was performed in 2020 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish. All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag indicating the original content, for instance: the sentence "this text was written by Leonardo" is converted to "this text was written by [NAME]"). The removal was performed in three steps: first, the automated machine learning-based tool AWS Comprehend PII Removal was used; then, a sequence of custom regular expressions was applied; last, the entire corpus was manually verified.

  • Research data . 2022
    Open Access Italian
    Authors: 
    Shibingfeng, Zhang; Francesco, Fernicola; Federico, Garcea; Alberto, Barrón-Cedeño; Paolo, Bonora; Angelo, Pompilio;
    Publisher: Zenodo

    The corpus AriEmozione 2.0 contains a selection of operas composed between 1655 and 1765, with each verse annotated with an emotion. The annotation of AriEmozione 2.0 is conducted in a self-learning manner leveraging on the AriEmozione 1.0 corpus. Six emotion labels are used, namely Amore (Love) Gioia (Joy) Ammirazione (Admiration) Rabbia (Anger) Tristezza (Sadness) Paura (Fear) This corpus contains about 89k verses. Each line in the tsv file is composed of: Verse ID : unique aria and verse ID. Each ID is composed of an aria ID and a verse ID. For example, ZAP1590034_00 means the first verse of aria ZAP1590034 Verse text: the text of the verse in the aria Emotion: one of the six emotions AriEmozione 2.0 is a subset of the materials collected by project CORAGO. How to cite: @article{zhang2022ariemozione, title={AriEmozione 2.0: Identifying Emotions in Opera Verses and Arias}, author={ Zhang, Shibingfeng and Fernicola, Francesco and Garcea, Federico and Bonora, Paolo and Barr{\'o}n-Cede\~no, Alberto}, journal={Italian Journal of Computational Linguistics},volume={},issue_date = {} year={in press} } {"references": ["Zhang, S., Fernicola, F., Garcea, F., Bonora, P., Barr\u00f3n-Cede\u00f1o, A., (in press). AriEmozione 2.0: Identifying Emotions in Opera Verses and Arias. Italian Journal of Computational Linguistics"]}

  • Research data . 2022
    Open Access Italian
    Authors: 
    Shibingfeng, Zhang; Francesco, Fernicola; Federico, Garcea; Alberto, Barrón-Cedeño; Paolo, Bonora; Angelo, Pompilio;
    Publisher: Zenodo

    The corpus AriEmozione 2.0 contains a selection of operas composed between 1655 and 1765, with each verse annotated with an emotion. The annotation of AriEmozione 2.0 is conducted in a self-learning manner leveraging on the AriEmozione 1.0 corpus. Six emotion labels are used, namely Amore (Love) Gioia (Joy) Ammirazione (Admiration) Rabbia (Anger) Tristezza (Sadness) Paura (Fear) This corpus contains about 89k verses. Each line in the tsv file is composed of: Verse ID : unique aria and verse ID. Each ID is composed of an aria ID and a verse ID. For example, ZAP1590034_00 means the first verse of aria ZAP1590034 Verse text: the text of the verse in the aria Emotion: one of the six emotions AriEmozione 2.0 is a subset of the materials collected by project CORAGO. How to cite: @article{zhang2022ariemozione, title={AriEmozione 2.0: Identifying Emotions in Opera Verses and Arias}, author={ Zhang, Shibingfeng and Fernicola, Francesco and Garcea, Federico and Bonora, Paolo and Barr{\'o}n-Cede\~no, Alberto}, journal={Italian Journal of Computational Linguistics},volume={},issue_date = {} year={in press} }

  • Open Access German
    Authors: 
    Schumacher, Mareike; Flüh, Marie;
    Publisher: Zenodo

    This repository contains the datasets on which the article 'Made to be a woman. A case study on the categorization of gender using an individuation-based approach in the analysis of literary texts' is based on.

  • Open Access
    Authors: 
    Ziku, Mariana; Bettina Fabos;
    Publisher: Zenodo

    Dataset containing 27 digital community heritage initiatives, defined as online initiatives where community members or users contribute to a community heritage-related common cause, which promotes the interests of the community(ies) and/or the greater public. Access the report: Ziku, M., & Fabos, B. (2022). Digital Community Heritage and Open Access. CC Open Culture Working Group Digital Community Heritage. https://doi.org/10.21428/9eb74dbf.0c46e6be The dataset has been collected in the framework of the Creative Commons Open Culture Working Group "Digital Community Heritage", funded by the Creative Commons.