Phylogenetic sequence data sets across 61,108 eukaryotic genera
GenBank rel. 194 clusters now available (alignments and trees under construction)
This database provides a snapshot of the current taxonomic distribution of nucleotide sequences in GenBank.
Its purpose is to convey information about the potential phylogenetic data sets (clusters, or sets of homologous sequences) that can be constructed from the database for taxa of interest. It mirrors the NCBI
The number of clusters is estimated by all-against-all BLAST searches and sequence clustering algorithms (for all nodes with < 35,000 sequences, and excluding sequences > 7500 nt in length).
Model organisms are defined as any node (not subtree) having >100 clusters or more than 10,000 sequences. By default, sequence tallies for model organisms propogate upward in the tree along with nonmodel organisms, but this information can be excluded, so that users can get a sense of taxonomic breadth of the
sequence diversity in the database. Note, however, that the bulk of "genomic" data for model organisms is not entered in the database at all (see below for types of sequences included).
Cluster tallies are linked to a view of the data availability matrix for that node in the taxonomy tree, which can provide useful guidance for supermatrix and supertree construction. Sequences for each cluster can be downloaded as an unaligned FASTA file for further analysis. Provisional alignments and phylogenetic trees are also provided.
To see a list of "biodiversity research hotspots" (families with the largest increase in species since the last release) click
For a list of model organisms click
For more information on how the clustering was implemented click
For downloads of this or previous releases of the entire database, or downloads of trees only (new!), click
Finally, for more information about the developers, how to cite, etc., click here
Types of sequences included: Only "core" nucleotide data are included, which excludes ESTs, STSs, and other kinds of bulk or high-throughput sequences. Taxonomic coverage: At present the database contains sequences from eukaryotes. These represent the PLN, MAM, PRI, ROD, VRT, and INV divisions of GenBank.
GenBank release:194 (Feb. 15, 2013) Number of sequences in this database:7306016 Number of nodes in our subtree(s) of the NCBI taxonomy tree:601559 Number of terminal nodes:508456 Number of nodes clustered (usually terminal taxa):423425 Number of subtrees clustered (always internal nodes):90477 Number of nodes with sequences that can be clustered:504921
Total number of clusters:3505181
Number of phylogenetically informative clusters (TIs >= 4):192763
Number of singleton clusters (GIs = 1):2505852
Number of large clusters (GIs >= 100):35576
Number of large clusters (TIs >= 100):8194
Size of largest cluster (w.r.t. GIs):27299
Size of largest cluster (w.r.t. TIs):6048
Supported by a grant from the US NSF Assembling the Tree of Life Program --- Questions or comments? Contact Mike Sanderson (email@example.com)