publication . Article . 2018

TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees

Sauvage, Thomas; Plouviez, Sophie; Schmidt, William E.; Fredericq, Suzanne;
Open Access English
  • Published: 01 Mar 2018 Journal: BMC Research Notes, volume 11, issue 1, pages 1-6 (issn: 1756-0500, Copyright policy)
  • Publisher: BMC
Abstract
Objective The body of DNA sequence data lacking taxonomically informative sequence headers is rapidly growing in user and public databases (e.g. sequences lacking identification and contaminants). In the context of systematics studies, sorting such sequence data for taxonomic curation and/or molecular diversity characterization (e.g. crypticism) often requires the building of exploratory phylogenetic trees with reference taxa. The subsequent step of segregating DNA sequences of interest based on observed topological relationships can represent a challenging task, especially for large datasets. Results We have written TREE2FASTA, a Perl script that enables and expedites the sorting of FASTA-formatted sequence data from exploratory phylogenetic trees. TREE2FASTA takes advantage of the interactive, rapid point-and-click color selection and/or annotations of tree leaves in the popular Java tree-viewer FigTree to segregate groups of FASTA sequences of interest to separate files. TREE2FASTA allows for both simple and nested segregation designs to facilitate the simultaneous preparation of multiple data sets that may overlap in sequence content. Electronic supplementary material The online version of this article (10.1186/s13104-018-3268-y) contains supplementary material, which is available to authorized users.
Persistent Identifiers
Fields of Science and Technology classification (FOS)
03 medical and health sciences, 0301 basic medicine, 030104 developmental biology
Sustainable Development Goals (SDG)
16. Peace & justice
Subjects
free text keywords: Barcoding, Biodiversity, Clone, Contaminant, Cryptic, Environmental, Research Note, FigTree, Forensic, Metabarcoding, OTU, Phylogeny, Systematics, General Biochemistry, Genetics and Molecular Biology, General Medicine, lcsh:Medicine, lcsh:R, lcsh:Biology (General), lcsh:QH301-705.5, lcsh:Science (General), lcsh:Q1-390, Natural language processing, computer.software_genre, computer, Phylogenetics, Sequence (medicine), Java, computer.programming_language, Context (language use), Perl, Phylogenetic tree, Artificial intelligence, business.industry, business, Tree (data structure), Computer science, Sorting
Related Organizations
Communities
  • Digital Humanities and Cultural Heritage
  • Science and Innovation Policy Studies
Funded by
NSF| Collaborative Research: ARTS: Integrative Research and Training in Tropical Taxonomy
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1455569
  • Funding stream: Directorate for Biological Sciences | Division of Environmental Biology
16 references, page 1 of 2

Wheeler, Q. The new taxonomy. Systematics association special. 2008

Bickford, D, Lohman, DJ, Sodhi, NS, Ng, PKL, Meier, R, Winker, K, Ingram, KK, Das, I. Cryptic species as a window on diversity and conservation. Trends Ecol Evol. 2007; 22: 148-155 [OpenAIRE] [PubMed] [DOI]

Baldauf, S. Phylogeny for the faint of the heart: a tutorial. Trends Genet. 2003; 19: 345-351 [PubMed] [DOI]

4.Felsenstein’s J. Newick format definition webpage. http://evolution.genetics.washington.edu/phylip/newicktree.html. Accessed 1 Feb 2018

Paradis, E, Claude, J, Strimmer, K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004; 20: 289-290 [OpenAIRE] [PubMed] [DOI]

6.Rambaut’s A. FigTree downloa d page. http://tree.bio.ed.ac.uk/software/figtree. Accessed 1 Feb 2018

7.Galaxy server tool shed. https://usegalaxy.org/. Accessed 1 Feb 2018

Li, H, Handsaker, B, Wysoker, A, Fennell, T, Ruan, J, Homer, N, Marth, G, Abecasis, G. Durbin R and 1000 genome project data processing subgroup. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009; 25: 2078-2079 [OpenAIRE] [PubMed] [DOI]

Camacho, C, Coulouris, G, Avagyan, V, Ma, N, Papadopoulos, J, Bealer, K, Madden, TL. BLAST+: architecture and applications. BMC Bioinform. 2008; 10: 421 [OpenAIRE] [DOI]

Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014; 30: 1312-1313 [OpenAIRE] [PubMed] [DOI]

Maddison, DR, Swofford, DL, Maddison, WP. Nexus: an extensible file format for systematic information. Syst Biol. 1997; 46: 590-621 [OpenAIRE] [PubMed] [DOI]

Edgar, RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32: 1792-1797 [OpenAIRE] [PubMed] [DOI]

Sauvage, T, Schmidt, WE, Suda, S, Fredericq, S. Data from: a metabarcoding framework for facilitated survey of endolithic phototrophs with tufA. BMC Ecol. 2016; 16: 8 [OpenAIRE] [PubMed] [DOI]

Sauvage, T, Schmidt, WE, Suda, S, Fredericq, S. Data from: A metabarcoding framework for facilitated survey of endolithic phototrophs with tufA. BMC Ecol. 2016 [OpenAIRE] [PubMed]

Decelle, J, Romac, S, Stern, RF, el Bendif, M, Zingone, A, Audic, S, Guiry, MD, Guillou, L, Tessier, D, Le Gall, F, Gourvil, P, Dos Santos, AL, Probert, I, Vaulot, D, de Vargas, C, Christen, R. PhytoREF: a reference database of the plastidial 16S rRNA gene of photosynthetic eukaryotes with curated taxonomy. Mol Ecol Resour. 2015; 15: 1435-1445 [OpenAIRE] [PubMed] [DOI]

16 references, page 1 of 2
Abstract
Objective The body of DNA sequence data lacking taxonomically informative sequence headers is rapidly growing in user and public databases (e.g. sequences lacking identification and contaminants). In the context of systematics studies, sorting such sequence data for taxonomic curation and/or molecular diversity characterization (e.g. crypticism) often requires the building of exploratory phylogenetic trees with reference taxa. The subsequent step of segregating DNA sequences of interest based on observed topological relationships can represent a challenging task, especially for large datasets. Results We have written TREE2FASTA, a Perl script that enables and expedites the sorting of FASTA-formatted sequence data from exploratory phylogenetic trees. TREE2FASTA takes advantage of the interactive, rapid point-and-click color selection and/or annotations of tree leaves in the popular Java tree-viewer FigTree to segregate groups of FASTA sequences of interest to separate files. TREE2FASTA allows for both simple and nested segregation designs to facilitate the simultaneous preparation of multiple data sets that may overlap in sequence content. Electronic supplementary material The online version of this article (10.1186/s13104-018-3268-y) contains supplementary material, which is available to authorized users.
Persistent Identifiers
Fields of Science and Technology classification (FOS)
03 medical and health sciences, 0301 basic medicine, 030104 developmental biology
Sustainable Development Goals (SDG)
16. Peace & justice
Subjects
free text keywords: Barcoding, Biodiversity, Clone, Contaminant, Cryptic, Environmental, Research Note, FigTree, Forensic, Metabarcoding, OTU, Phylogeny, Systematics, General Biochemistry, Genetics and Molecular Biology, General Medicine, lcsh:Medicine, lcsh:R, lcsh:Biology (General), lcsh:QH301-705.5, lcsh:Science (General), lcsh:Q1-390, Natural language processing, computer.software_genre, computer, Phylogenetics, Sequence (medicine), Java, computer.programming_language, Context (language use), Perl, Phylogenetic tree, Artificial intelligence, business.industry, business, Tree (data structure), Computer science, Sorting
Related Organizations
Communities
  • Digital Humanities and Cultural Heritage
  • Science and Innovation Policy Studies
Funded by
NSF| Collaborative Research: ARTS: Integrative Research and Training in Tropical Taxonomy
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1455569
  • Funding stream: Directorate for Biological Sciences | Division of Environmental Biology
16 references, page 1 of 2

Wheeler, Q. The new taxonomy. Systematics association special. 2008

Bickford, D, Lohman, DJ, Sodhi, NS, Ng, PKL, Meier, R, Winker, K, Ingram, KK, Das, I. Cryptic species as a window on diversity and conservation. Trends Ecol Evol. 2007; 22: 148-155 [OpenAIRE] [PubMed] [DOI]

Baldauf, S. Phylogeny for the faint of the heart: a tutorial. Trends Genet. 2003; 19: 345-351 [PubMed] [DOI]

4.Felsenstein’s J. Newick format definition webpage. http://evolution.genetics.washington.edu/phylip/newicktree.html. Accessed 1 Feb 2018

Paradis, E, Claude, J, Strimmer, K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004; 20: 289-290 [OpenAIRE] [PubMed] [DOI]

6.Rambaut’s A. FigTree downloa d page. http://tree.bio.ed.ac.uk/software/figtree. Accessed 1 Feb 2018

7.Galaxy server tool shed. https://usegalaxy.org/. Accessed 1 Feb 2018

Li, H, Handsaker, B, Wysoker, A, Fennell, T, Ruan, J, Homer, N, Marth, G, Abecasis, G. Durbin R and 1000 genome project data processing subgroup. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009; 25: 2078-2079 [OpenAIRE] [PubMed] [DOI]

Camacho, C, Coulouris, G, Avagyan, V, Ma, N, Papadopoulos, J, Bealer, K, Madden, TL. BLAST+: architecture and applications. BMC Bioinform. 2008; 10: 421 [OpenAIRE] [DOI]

Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014; 30: 1312-1313 [OpenAIRE] [PubMed] [DOI]

Maddison, DR, Swofford, DL, Maddison, WP. Nexus: an extensible file format for systematic information. Syst Biol. 1997; 46: 590-621 [OpenAIRE] [PubMed] [DOI]

Edgar, RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32: 1792-1797 [OpenAIRE] [PubMed] [DOI]

Sauvage, T, Schmidt, WE, Suda, S, Fredericq, S. Data from: a metabarcoding framework for facilitated survey of endolithic phototrophs with tufA. BMC Ecol. 2016; 16: 8 [OpenAIRE] [PubMed] [DOI]

Sauvage, T, Schmidt, WE, Suda, S, Fredericq, S. Data from: A metabarcoding framework for facilitated survey of endolithic phototrophs with tufA. BMC Ecol. 2016 [OpenAIRE] [PubMed]

Decelle, J, Romac, S, Stern, RF, el Bendif, M, Zingone, A, Audic, S, Guiry, MD, Guillou, L, Tessier, D, Le Gall, F, Gourvil, P, Dos Santos, AL, Probert, I, Vaulot, D, de Vargas, C, Christen, R. PhytoREF: a reference database of the plastidial 16S rRNA gene of photosynthetic eukaryotes with curated taxonomy. Mol Ecol Resour. 2015; 15: 1435-1445 [OpenAIRE] [PubMed] [DOI]

16 references, page 1 of 2
Any information missing or wrong?Report an Issue