- home
- Advanced Search
3 Research products, page 1 of 1
Loading
- Research software . 2021Open Access EnglishAuthors:Giovanni Spitale; Federico Germani; Nikola Biller - Andorno;Giovanni Spitale; Federico Germani; Nikola Biller - Andorno;Publisher: Zenodo
The purpose of this tool is performing NLP analysis on Telegram chats. Telegram chats can be exported as .json files from the official client, Telegram Desktop (v. 2.9.2.0). The files are parsed, the content is used to populate a message dataframe, which is then anonymized. The software calculates and displays the following information: user count (n of users, new users per day, removed users per day); message count (n and relative frequency of messages, messages per day); autocoded messages (anonymized message dataframe with code weights assigned to each message based on a customizable set of regex rules); prevalence of codes (n and relative frequency); prevalence of lemmas (n and relative frequency); prevalence of lemmas segmented by autocode (n and relative frequency); mean sentiment per day; mean sentiment segmented by autocode. The software outputs: messages_df_anon.csv - an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender, and the text; usercount_df.csv - user count dataframe; user_activity_df.csv - user activity dataframe; messagecount_df.csv - message count dataframe; messages_df_anon_coded.csv - an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender, the text, the codes, and the sentiment; autocode_freq_df.csv - general prevalence of codes; lemma_df.csv - lemma frequency; autocode_freq_df_[rule_name].csv - lemma frequency in coded messages, one file per rule; daily_sentiment_df.csv - daily sentiment; sentiment_by_code_df.csv - sentiment segmented by code; messages_anon.txt - anonymized text file generated from the message data frame, for easy import in other software for text mining or qualitative analysis; messages_anon_MaxQDA.txt - anonymized text file generated from the message data frame, formatted specifically for MaxQDA (to track speakers and codes). Dependencies: pandas (1.2.1) json random os re tqdm (4.62.2) datetime (4.3) matplotlib (3.4.3) Spacy (3.1.2) + it_core_news_md wordcloud (1.8.1) Counter feel_it (1.0.3) torch (1.9.0) numpy (1.21.1) transformers (4.3.3) This code is optimized for Italian, however: Lemma analysis is based on spaCy, which provides several other models for other languages ( https://spacy.io/models ) so it can easily be adapted. Sentiment analysis is performed using FEEL-IT: Emotion and Sentiment Classification for the Italian Language (Kudos to Federico Bianchi <f.bianchi@unibocconi.it>; Debora Nozza <debora.nozza@unibocconi.it>; and Dirk Hovy <dirk.hovy@unibocconi.it>). Their work is specific for Italian. To perform sentiment analysis in other languages one could consider nltk.sentiment The code is structured in a Jupyter-lab notebook, heavily commented for future reference. The software comes with a toy dataset comprised of Wikiquotes copy-pasted in a chat created by the research group. Have fun exploring it. {"references": ["Bianchi F, Nozza D, Hovy D. FEEL-IT: Emotion and Sentiment Classification for the Italian Language. In: Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics; 2021. https://github.com/MilaNLProc/feel-it"]}
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Research software . 2020Open Access EnglishAuthors:Duchemin, Louis; Veber, Philippe; Boussau, Bastien;Duchemin, Louis; Veber, Philippe; Boussau, Bastien;Publisher: Zenodo
Code base for the analysis presented in the manuscript "Bayesian investigation of SARS-CoV-2-related mortality in France" : https://www.medrxiv.org/content/10.1101/2020.06.09.20126862v5
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Research software . 2020Open Access EnglishAuthors:Spitale, Giovanni;Spitale, Giovanni;Publisher: Zenodo
Changelog v2.0.0 / what's new: - rtf to txt conversion and merging is now done in the notebook and does not depend on external sw - rewritten the parser due to changes in Factiva's output - rewritten the NLP pipeline to process data with different temporal depth - streamlined and optimized here and there :) The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries. The aim of this software is to provide the means to analyze this material rapidly. Data are retrieved from Factiva and downloaded by hand(...) in RTF. The RTF files are then converted to TXT. Parser: Takes as input files numerically ordered in a folder. This is not fundamental (in case of multiple retrieves from Factiva) because the parser orders the article by date using the date field contained in each of the articles. Nevertheless, it is important to reduce duplicates (because they increase the computational time needed for processing the corpus), so before adding new articles in the folder, be sure to retrieve them from a timepoint that does not overlap with the articles already retrieved. In any case, in the last phase the dataframe is checked for duplicates, that are counted and removed, but still the articles are processed by the parser and this takes computational time. The parser removes search summaries, segments the text, and cleans it using regex rules. The resulting text is exported in a complete dataframe as a CSV file; a subset containing only title and text is exported as TXT, ready to be fed to the NLP pipeline. The parser is language agnostic; just change the path to the folder containing the documents to parse. NLP pipeline The NLP pipeline imports the files generated by the parser (divided by month to put less load on the memory) and analyses them. It is not language agnostic: correct linguistic settings must be specified in "setting up", "NLP" and "additional rules". First some additional rules for NER are defined. Some are general, some are language-specific, as specified in the relevant section. The files are opened and preprocessed, then lemma frequency and NE frequency are calculated per each month and in the whole corpus. All the dataframes are exported as CSV files for further analysis or for data visualization. This code is optimized for English, German, French and Italian. Nevertheless, being based on spaCy, which provides several other models ( https://spacy.io/models ) could easily be adapted to other languages. The whole software is structured in Jupyter-lab notebooks, heavily commented for future reference. This work is part of the PubliCo research project. This work is part of the PubliCo research project, supported by the Swiss National Science Foundation (SNF). Project no. 31CA30_195905
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.
3 Research products, page 1 of 1
Loading
- Research software . 2021Open Access EnglishAuthors:Giovanni Spitale; Federico Germani; Nikola Biller - Andorno;Giovanni Spitale; Federico Germani; Nikola Biller - Andorno;Publisher: Zenodo
The purpose of this tool is performing NLP analysis on Telegram chats. Telegram chats can be exported as .json files from the official client, Telegram Desktop (v. 2.9.2.0). The files are parsed, the content is used to populate a message dataframe, which is then anonymized. The software calculates and displays the following information: user count (n of users, new users per day, removed users per day); message count (n and relative frequency of messages, messages per day); autocoded messages (anonymized message dataframe with code weights assigned to each message based on a customizable set of regex rules); prevalence of codes (n and relative frequency); prevalence of lemmas (n and relative frequency); prevalence of lemmas segmented by autocode (n and relative frequency); mean sentiment per day; mean sentiment segmented by autocode. The software outputs: messages_df_anon.csv - an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender, and the text; usercount_df.csv - user count dataframe; user_activity_df.csv - user activity dataframe; messagecount_df.csv - message count dataframe; messages_df_anon_coded.csv - an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender, the text, the codes, and the sentiment; autocode_freq_df.csv - general prevalence of codes; lemma_df.csv - lemma frequency; autocode_freq_df_[rule_name].csv - lemma frequency in coded messages, one file per rule; daily_sentiment_df.csv - daily sentiment; sentiment_by_code_df.csv - sentiment segmented by code; messages_anon.txt - anonymized text file generated from the message data frame, for easy import in other software for text mining or qualitative analysis; messages_anon_MaxQDA.txt - anonymized text file generated from the message data frame, formatted specifically for MaxQDA (to track speakers and codes). Dependencies: pandas (1.2.1) json random os re tqdm (4.62.2) datetime (4.3) matplotlib (3.4.3) Spacy (3.1.2) + it_core_news_md wordcloud (1.8.1) Counter feel_it (1.0.3) torch (1.9.0) numpy (1.21.1) transformers (4.3.3) This code is optimized for Italian, however: Lemma analysis is based on spaCy, which provides several other models for other languages ( https://spacy.io/models ) so it can easily be adapted. Sentiment analysis is performed using FEEL-IT: Emotion and Sentiment Classification for the Italian Language (Kudos to Federico Bianchi <f.bianchi@unibocconi.it>; Debora Nozza <debora.nozza@unibocconi.it>; and Dirk Hovy <dirk.hovy@unibocconi.it>). Their work is specific for Italian. To perform sentiment analysis in other languages one could consider nltk.sentiment The code is structured in a Jupyter-lab notebook, heavily commented for future reference. The software comes with a toy dataset comprised of Wikiquotes copy-pasted in a chat created by the research group. Have fun exploring it. {"references": ["Bianchi F, Nozza D, Hovy D. FEEL-IT: Emotion and Sentiment Classification for the Italian Language. In: Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics; 2021. https://github.com/MilaNLProc/feel-it"]}
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Research software . 2020Open Access EnglishAuthors:Duchemin, Louis; Veber, Philippe; Boussau, Bastien;Duchemin, Louis; Veber, Philippe; Boussau, Bastien;Publisher: Zenodo
Code base for the analysis presented in the manuscript "Bayesian investigation of SARS-CoV-2-related mortality in France" : https://www.medrxiv.org/content/10.1101/2020.06.09.20126862v5
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Research software . 2020Open Access EnglishAuthors:Spitale, Giovanni;Spitale, Giovanni;Publisher: Zenodo
Changelog v2.0.0 / what's new: - rtf to txt conversion and merging is now done in the notebook and does not depend on external sw - rewritten the parser due to changes in Factiva's output - rewritten the NLP pipeline to process data with different temporal depth - streamlined and optimized here and there :) The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries. The aim of this software is to provide the means to analyze this material rapidly. Data are retrieved from Factiva and downloaded by hand(...) in RTF. The RTF files are then converted to TXT. Parser: Takes as input files numerically ordered in a folder. This is not fundamental (in case of multiple retrieves from Factiva) because the parser orders the article by date using the date field contained in each of the articles. Nevertheless, it is important to reduce duplicates (because they increase the computational time needed for processing the corpus), so before adding new articles in the folder, be sure to retrieve them from a timepoint that does not overlap with the articles already retrieved. In any case, in the last phase the dataframe is checked for duplicates, that are counted and removed, but still the articles are processed by the parser and this takes computational time. The parser removes search summaries, segments the text, and cleans it using regex rules. The resulting text is exported in a complete dataframe as a CSV file; a subset containing only title and text is exported as TXT, ready to be fed to the NLP pipeline. The parser is language agnostic; just change the path to the folder containing the documents to parse. NLP pipeline The NLP pipeline imports the files generated by the parser (divided by month to put less load on the memory) and analyses them. It is not language agnostic: correct linguistic settings must be specified in "setting up", "NLP" and "additional rules". First some additional rules for NER are defined. Some are general, some are language-specific, as specified in the relevant section. The files are opened and preprocessed, then lemma frequency and NE frequency are calculated per each month and in the whole corpus. All the dataframes are exported as CSV files for further analysis or for data visualization. This code is optimized for English, German, French and Italian. Nevertheless, being based on spaCy, which provides several other models ( https://spacy.io/models ) could easily be adapted to other languages. The whole software is structured in Jupyter-lab notebooks, heavily commented for future reference. This work is part of the PubliCo research project. This work is part of the PubliCo research project, supported by the Swiss National Science Foundation (SNF). Project no. 31CA30_195905
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.