Uncategorized

New version of tree, APIs

We have just released a new version of the synthetic tree, along with new versions of the APIs.

Tree

The biggest change in this version is that we have completely replaced the synthesis method used to produce the tree. We are still using neo4j to serve the tree, but have moved synthesis out of the graph database and into a make-based pipeline that uses a C++ library.  This new method is improves efficiency, reproducibility, and allows us to more clearly connect input sources with edges in the tree. In addition to support statements, the new pipeline also produces conflict statements about the inputs that do not support a given edge (we are working to get these displayed on the properties panel for each node).

You can view the new version here and read the release notes. We want to particularly highlight the self-documenting nature of the new method. Primary credit for the new method goes to Mark Holder and Ben Redelings – paper coming soon!

APIs

The largest change is the node IDs. We have previously mentioned issues with node stability, and in this version of the tree & APIs, we use either Open Tree Taxonomy IDs for taxa nodes or mrca statements for non-taxa nodes rather than unstable neo4j node IDs which will transfer (or fail gracefully) for new versions of the tree. We have also made input and output parameters more stable across methods.

We also make public the verbose subtree format that we use to build the tree browser – rather than simply a newick string, you can obtain the tree with all provenance information, including support and conflict.

All v2 methods should continue to work, but we plan to deprecate the v2 methods in June 2016.

Take a look at the API docs  and release notes for more information.


FuturePhy clade workshops

OpenTree, FuturePhy and Arbor jointly held the first round of clade workshops in Gainesville at the end of February.  There were three taxon-focused groups taking part, studying barnacles, beetles, and catfish – each with a very diverse set of participants. Expertise in the room included taxonomy, systematics, ecology, phylogenetic methods, bioinformatics, genomes, ontologies, and scientific illustration (to name a few). While each group had different goals for progress in understanding the biology of their taxon of interest, each group required a unified tree merging taxonomic and phylogenetic information for their clade. Sounds like a job for OpenTree! In advance of the workshop, we created tree collections (ranked lists of published trees in OpenTree) for each clade, and completed the beta version of our new synthesis algorithm. While there was only limited curation of new studies in the lead up to the meeting, during the workshop participants imported more than 40 new published phylogenies into the OpenTree database and curated tree collections. Not only will this burst of skilled curation improve accuracy of the synthetic tree in the future, we were able to use our new rapid synthesis method to produce on-the-fly custom synthesis trees for each clade collection during the workshop. By reviewing these synthetic trees and updating the input trees and rankings, participant groups were able to simultaneously achieve a better understanding of the relationships in their clade of interest, and of the OpenTree synthesis procedure. These clade synthetic trees were an efficient and reproducible methods for providing a unified view of taxon relationships which could then be compared to publications and to expert-curated supertrees produced by grafting existing trees.

On the last day, we asked each group to list the top features they want from OpenTree. This list included:

  • Better conflict visualization – between published input trees, the synthetic tree of life, and the Open Tree Taxonomy
  • Ways to summarize / visualize the annotations file created along with the synthetic tree (this file includes information about sources that support & conflict with each edge).
  • Provide a method for proposing new taxa for tips in a phylogeny that cannot be mapped to OTT (and for collecting supporting information about these new taxa)
  • Include branch length and / or time information in the synthetic tree
  • More fine-grained control over synthesis (be able to mask part of an input tree, suppress poorly-supported branches)

Many thanks to all of the participants, and in particular to Nico Cellinese and Rob Guralnick for local logistics, and to the University of Florida Informatics Institute for hosting. It was an enjoyable and productive meeting for the OpenTree crew, and hopefully for all the attendees!


Publication of first draft of the tree of life

We are excited to publish the first draft of the Open Tree of Life in PNAS:

http://www.pnas.org/content/early/2015/09/16/1423041112.abstract

Scientists have used gene sequences and morphological data to construct tens of thousands of evolutionary trees that describe the evolutionary history of animals, plants, and microbes. This study is the first, to our knowledge, to apply an efficient and automated process for assembling published trees into a complete tree of life. This tree and the underlying data are available to browse and download from the Internet, facilitating subsequent analyses that require evolutionary trees. The tree can be easily updated with newly published data. Our analysis of coverage not only reveals gaps in sampling and naming biodiversity but also further demonstrates that most published phylogenies are not available in digital formats that can be summarized into a tree of life.

This is only a first draft, and there are plenty of places where the tree does not represent what we know about phylogenetic relationships. We can improve this tree through incorporation of new taxonomic and phylogenetic data. Our data store of trees (which contains many more trees than are included in the draft tree of life) is also a resource for other analyses. If you want to contribute a published tree for synthesis (or for analyses of coverage, conflict, etc), you can upload it through our curation interface.

Other pages and links:

Many thanks to all of the people that provided data, discussion, review, curation, and code and of course to NSF Biology for funding this work!


Proposal for OpenTree node stability

Currently, OpenTree has two different types of node IDs. Taxonomy (OTT) IDs are assigned to named nodes when we construct a taxonomy release, and phylogenetic node IDs are assigned by the treemachine neo4j graph database for nodes that do not align to an OTT ID (i.e. nodes added due to phylogenetic resolution). The OTT IDs are fairly stable over time, but the neo4j node IDs are definitely not stable, and the same neo4j ID may point to a completely unrelated node in future versions of the graph.

This system is problematic because we expose both types of IDs in the APIs (and also in URLs for the tree browser). The lack of neo4j node stability therefore affects API calls that use nodeIDs, browser bookmarks to nodes in the synthetic tree, and feedback left by users about specific nodes in the tree (see feedback issue #63 and treemachine issue #183). The OTT IDs are problematic as well: it is not straightforward to document when we reuse an existing OTT ID, mint a new ID, or delete an existing ID, when going from one version of the taxonomy version to the next.

At our recent face-to-face meeting, we discussed a proposal for a node identifier registry and are looking for feedback. We don’t intend this system to be a universally-used set of node definitions (i.e. we aren’t trying making a PhyloCode registry). We want a lightweight system that prevents exposure of unstable nodeIDs through the APIs to clients (including our own web application) and provides some measure of predictability. Feeedback on this proposal would be greatly appreciated.

Requirements

  • be able to use the same node ID definitions across OTT and the synthetic tree
  • transparency about when we re-use a nodeID from a previous version of tree or taxonomy (or not)
  • users get an error when using a node ID from a previous version where there is no current node that fits that definition
  • fixing errors (such as moving a snail found in a worm taxon to its proper location) should not involve massive numbers of ID changes
  • generation of node definitions based on a given taxonomy must be automated and efficient
  • application of node definitions to an existing tree / taxonomy must be automated and efficient

Proposal

Develop a lightweight registry of node definitions based on the structure of the OpenTree taxonomy. For each new version of the taxonomy and synthetic tree, use the registry to decide when to re-use existing node IDs and when to register a new definition + ID.

Leaf nodes will be assigned IDs during creation of OTT based on name (together with enough taxonomic context to separate homonyms).

The definition of the ID for a non-leaf node will include a list of IDs for nodes that are descendents of the intended clade, a list that are excluded from being descendents, and (optionally) a taxonomic name.

Definitions would never be deleted from the registry, although not all definitions will be used in any given tree / taxonomy.

Implementation questions

  • How many descendant and excluded nodes to include in the definitions: The definition needs some specificity but also can’t assume a complete list due to future addition of new species. Perhaps, for example, four descendants and three exclusions would be a decent compromise between one and thousands?
  • How to choose the specific nodes in the lists of descendants and exclusions: Should be ‘popular’  (should occur in as many sources as possible) and informative (if T has children T1 and T2 then at least one definition descendant should be taken from T1, and at least one from T2). Excluded nodes should be ‘near misses’ rather than arbitrarily chosen.
  • What to do when >1 node meets the definition: Add an option of adding constraints to the registered definition in order to remove the ambiguity while preserving the ID.
  • What to do when >1 definition matches a node: Ambiguous assignments can be resolved either by the addition of constraints, or by the creation of new ids.
  • Modification / versioning of definitions: If we add constraints to a definition (for example, to resolve ambiguity), does this mint a new ID or version the existing definition?

 


Workshop: Barriers to assembling phylogeny and data layers across the tree of life

The challenges to completing the Tree of Life and integrating data layers (NSF GoLife goals) are huge and vary across clades. Some groups have a nearly-complete tree but lack publicly available data layers, whereas other groups lack phylogenetic resolution or the resources to support tree / data integration. Partnering with Open Tree of Life and Arbor Workflows, FuturePhy will support a series of clade-based workshops to identify and solve specific challenges in tree of life synthesis and data layer integration.

RFP: 2 page proposals to fund small workshops and/or hackathons on completing the tree of life and integrating data layers for specific clades.
Proposal deadline: Nov. 1, 2015
Meeting dates: Feb 20-23 26-28, 2016 *note changed dates!*
Location: Gainesville, University of Florida
Participants per workshop: 10 maximum funded (virtual attendees possible)
Contacts: mwestneat@uchicago.edu (FuturePhy), karen.cranston@gmail.com (OpenTree), lukejharmon@gmail.com (Arbor)

The full call for participation and a link to a proposal template is available at the FuturePhy website.

Have questions about this or future workshops? Attend our webinar Thursday, September 17 at 1 pm EDT. See details on how to connect.


The Open Tree of Life’s education and outreach site

Screen Shot 2015-06-26 at 1.50.43 PM

A little known side element to the Open Tree of Life project is the “Edu Tree of Life,” an interactive educational experience to engage the public. Nearing completion, our goal with this website has been to educate young students as well as the general public on topics surrounding evolution and phylogenetic trees. Our approach is to visually inform and engage users with colorful and entertaining animation, interactive features, and contextualization of facts and figures.

Our educational site is composed of three unique, interactive views of the ToL:

1) A “Big Picture” tree provides a zoomed-out timeline perspective of life’s history on earth and explains key elements of the tree of life using a stylized, graphic visualization. This ‘macro’ view presents the evolutionary history of Earth, starting from the creation of our planet and spanning all the way to present day. As the user moves up the timeline, the tree ‘grows’ in front of them revealing historical information; each new screen also offers a detailed explanation of one of several core concepts surrounding evolution. Video explanations containing animations live narrators explain each of these core concepts. Along with the videos, ‘pop-up’ information boxes also offer information.

Key elements:

  • A macro View
  • Key Concepts
  • Timeline of Life
  • Chaptered Format/Parallax Scrolling
  • Videos and Animation

The core concepts we explore are:

  • The Origin of Life
  • The Three Domains of Life
  • Common Ancestors
  • Extinction
  • Biodiversity
  • Lateral/Horizontal Gene Transfer & Genes

2) The page titled ““Categorizing Life on Earth” is a mid-sized view of life, a data-driven interactive tree with a focus on the groupings of species (clades). This Tree uses a sampling of data to illustrate hierarchy with a familiar ‘tree’ structure that employs branching lines of evolution. It pulls images from Phylopic and data from EOL for descriptions. A user can expand and contract nodes to view clades they find interesting. Still to come: we are exploring ways to illustrate LGT and are working on connecting nodes back via their common ancestry, so that clicking any two nodes will show you a visualization of how those species are connected through the whole tree of life.

Key elements:

  • Mid-sized view of major clades
  • Data-driven interactive
  • Shows Common Ancestry, Phylogeny and Clades
  • Species groupings ending in Clades

3) The “Explore Species” page is our ‘micro view’ of species on Earth. This interactive spinning wheel allows a user to select any of about 180 species to learn about. The 180 species were chosen as exemplary based on many factors: some were chosen for their relative familiarity with the general public, but many were chosen due to specific scientific breakthroughs associated with them. Many were the first species within their field of study to be gene-sequenced, some are keystone species with important evolutionary relatives, and others have strange or unique characteristics worthy of mention.

The information offered for each species includes an image (when available), scientific and common names, the major domain within which the species resides, and then a brief description of the species. This was achieved using the Encyclopedia of Life’s online API, which allowed us to pull information and other resources off of their site to show on ours. As a way of opening an educational portal between the two, any species you click on in the Wheel of Life can also be visited on its parent page at the Encyclopedia of Life, where much more information about all species can be found. We hope that this partnership will prove very fruitful for bringing in casual interest and turning it into a burning passion for evolutionary science and history. Even if we only end up with a few more zoologists, we’ll be happy.

Key elements:

  • Micro View
  • Exemplary/representative species
  • Connects to EoL API, a gateway for further learning
  • Catalogue of interesting species.­­
  • Some info on Major Domains.
  • Fun, introductory look into species and their connections.

We welcome your comments. —John Allison and Karl Gude


FuturePhy

This is the first in a series of posts about several  phylogeny initiatives newly-funded by NSF focused on both technical and community aspects of phylogeny.  Plenty of potential for mutually beneficial work with OpenTree, and we are excited to help.

First up… FuturePhy!

FuturePhy is an NSF-sponsored, three-year program of conferences, workshops and hackathons on the Tree of Life. The project aims to promote novel, integrative data analyses and visualization, interdisciplinary syntheses of phylogenetic sciences, and cross-cutting uses of phylogenetics to develop and address new research questions and applications.

The first phase of this mission is critical: to bring together a broad community of people from diverse backgrounds who are active in phylogenetics research, who use the tree of life in research or education, who will benefit in applied or practical ways from a comprehensive tree of life, or who come from a background that offers new perspectives on defining, addressing or transcending key challenges in phylogenetics.

Help accelerate progress in all aspects of phylogenetics research by joining FuturePhy today. Diverse opportunities will be available to attend FuturePhy sessions in person or virtually, and to link FuturePhy to existing projects and initiatives.

  1. We invite you to participate in the project in several ways:
    Register on futurephy.org. Scientists from all aspects of the phylogenetic sciences, educators, members of the tree-using community, and others interested in phylogenetics are welcome.
  2. Take the community survey and let FuturePhy what workshop and hackathon topics they should fund.
  3. Contribute to the discussion forum on futurephy.org. This is the best way to log your interest and contribute ideas.
  4. Send email at contact@futurephy.org with ideas or comments
  5. Tweet to the FuturePhy community: @FuturePhy
  6. Comment in the FuturePhy phylobabble thread

Update on synthesis methods

The current Open Tree of Life synthesis methods are based on the Tree Alignment Graphs described by Smith et al 2013. The examples presented in that paper used much simpler datasets than the dataset that is used for draft tree synthesis by the Open Tree of Life (which contains hundreds of original source trees and the entire OTT taxonomy with over 2.3 million terminal taxa). To accommodate the goals of synthesis, some modifications were made to the methods presented in Smith et al 2013. The current version of the draft tree (v2, which is presented at http://tree.opentreeoflife.org as of February 2015 and described in a preprint on bioRxiv), was built using these modified methods. The changes to synthesis that were introduced since Smith et al 2013 are not well-described elsewhere, so we present them below in this document.

We are continually testing and improving the methods we use to develop synthesis trees, and through this process we have recently discovered some methodological properties of the modified TAG procedures that are undesirable for our synthesis goals. We are making progress toward fixing them for the next version of the draft tree, and there are details at the end of this post.

General background on the Open Tree of Life project and the draft tree

The overall goal of OpenTree is to summarize what is known about phylogenetic relationships in a transparent manner with a clear connection to analyses and the published studies that support different clades. Comprehensive coverage of published phylogenetic statements is a very long term goal which would require work from a large community of biologists. The short-term goal for the supertree presented on the tree browser is to summarize a small set of well-curated inputs in a clear manner.

Background on Tree Alignment Graph methods

The current synthesis method uses a Tree Alignment Graph (TAG), described in Smith et al 2013. We have been using TAGs because:

  • These graphs can provide a view on conflict and congruence among input trees.
  • TAG-based are computationally tractable on the scale which the open tree of life project operates (2.3 million tips on the tree, and hundreds of input trees).
  • TAG-based approaches provide a straightforward way to handle inputs in which tips of a tree are assigned to higher taxa (any taxon above the species level). It is fairly common for published phylogenies to have tips mapped at the genus level (or higher).
  • When coupled with expert knowledge in the form of ranking of input trees, TAG methods can produce a sensible summary of our (rather limited) input trees. At this point in the project, our data store does not contain a large number of trees sufficiently curated* to be included in the supertree operations.

* Sufficiently curated = 1. tips mapped to taxa in the Open Tree Taxonomy; 2. rooted as described in the publication; 3. ingroup noted. Incorrect rootings and assignments of tips to taxa can introduce a lot of noise in the estimate, so we have opted for careful vetting of input trees rather scraping together every estimate available. We are hopeful that community involvement in the curation will get us to a point of having enough input trees to allow more traditional supertree approaches to work well, so that we can present multiple estimates of the tree of life.

Methods used to produce the v2 draft tree

The open tree of life project has been alternating between phases where we (1) add more trees to our set of curated input trees, and then (2) generate new versions of the “synthetic” draft tree of life. Thus far two versions of the tree have been publicly posted to http://tree.opentreeoflife.org. The process of generating a new public draft tree involves the creation and critical review of many unpublished draft trees in order to detect errors or problems with the process (which could be due to misspecified taxa in input trees, software bugs, etc.).

This process has led to a few modifications of the TAG procedure as it was described in the PLoS Comp. Bio. paper. These modifications have been made to our treemachine software, and they include:

  • In the original paper, conflict was assessed by whether there was conflicting overlap among the descendant taxa of the nodes, not the edes. The software that produced the v2 tree assessed conflict between edges of the graph by looking for conflict based on the taxon sets contributed by each tree. This change is referred to as the “relationship taxa” rule in this issue on GitHub).
  • The supertree operation moves from root to tips, and occasionally a species attaches to a node via a series of low ranking relationships. When all of these are rejected (due to conflict with higher ranking trees), the species would be absent in the full tree if we followed the original TAG description faithfully. Instead, the treemachine version for v2 tree reattached these taxa based on their taxonomy after sweeping over the full tree.
  • The “Partially overlapping taxon sets” section of the paper described a procedure for eliminating order-dependence of the input trees. We have recently discovered a case in which the structure of a TAG built according to those procedures would differ depending on the input order of the trees. We have implemented a new procedure that pre-processes all the input trees, which removes this order-dependence (code for the new procedure can be accessed in the find-mrcas-when-creating-nodes branch of the treemachine repo on github).
  • To increase the overlap between different input trees, an additional step was implemented in treemachine that mapped the tips of an input tree to deeper nodes in the taxonomy that they may have represented. This was done by determining the most inclusive taxon that a tip could belong to without including any other tips in the tree, and then mapping the tip to that taxon instead of the taxon actually specified for the tip in the input tree itself. For example if the only primate in a tree was Homo sapiens, but the tree contained other mammals from the taxon sister to Primates (in the taxonomy), then the Homo sapiens tip would be assigned to the taxon Primates.

Undesirable properties of the procedures used to produce v2

  • It was possible for edges to exist in the draft tree that were not supported by any of the input trees. There were a very small number (111) of such groups in the v2 tree; this GitHub issue discusses the issue more thoroughly. This is not an unusual property for a supertree method to have – in fact most supertree methods can produce such groups. And under some definitions of support (e.g. induced triples) these groupings would probably have had support in our input trees. However, not being able to link every branch in the supertree to an branch in at least one supporting branch in an input tree made the draft tree more difficult to understand. We are working on modifications to the procedure that do not produce these groupings.
  • There were 22 taxonomic groupings mislabeled in the supertree (see issue 154 for details) and the definition of support used to indicate when an input tree “supported” a particular edge in the synthesis could be counterintuitive in some cases. The current view of the tree reports an input tree in the “supported by” panel if the branch in the draft tree passes along an edge that is parallel to an edge contributed by that input tree. Because some of the included taxa may have been culled from the group and reattached in a position closer to the root, the input tree can be in conflict with a grouping but still be listed as supporting it (see issues 155 and 157).

The draft tree contains over 2 million tips and many hundreds of thousands of internal edges. Thus, the undesirable properties mentioned above affected less than 0.0001% of the draft tree v2. Nonetheless, we are in the process of developing fixes for these problems, which should further improve the interpretability as well as the biological accuracy of future versions.


Preprint: Synthesizing phylogeny and taxonomy into a comprehensive tree of life

We’ve just posted a preprint on bioRXiv of our submitted manuscript on how we are combining taxonomy and phylogeny into a comprehensive tree of life:

http://www.biorxiv.org/content/early/2014/12/05/012260

You can browse the complete tree at http://tree.opentreeoflife.org

Comments welcome (either here or on bioRXiv). Note that the authorship list is woefully incomplete – biorxiv only allows 20 authors in the submission process. Here is the complete list:

Stephen A. Smith, Karen A. Cranston, James F. Allman, Joseph W. Brown, Gordon Burleigh, Ruchi Chaudhary, Lyndon M. Coghill, Keith A. Crandall, Jiabin Deng, Bryan T. Drew, Romina Gazis, Karl Gude, David S. Hibbett, Cody Hinchliff, Laura A. Katz, H. Dail Laughinghouse IV, Emily Jane McTavish, Christopher L. Owen, Richard Ree, Jonathan A. Rees, Douglas E. Soltis, Tiffani Williams


Tree-for-All hackathon series: Taxon sampling, part 1 

Sampling taxa with Python and Perl scripts

This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.  To read the whole series, go to the Introduction page.

More specifically, this is the first of two posts addressing the outputs of the “Sampling taxa” team, consisting of Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (OpenTree) and Arlin Stoltzfus (NIST).[1]

The “taxon sampling” idea

Although users seeking a tree may have a predetermined set of species in mind, often the user is focused on taxon T without having a prior list of species. For instance, the typical user interested in a tree of mammals does not really want the full tree of > 5000 known species of mammals, but some subset, e.g., a tree with a random subset of 100 species, or a tree of the 94 species with known genomes in NCBI, or a tree with one species for each of ~150 mammal families.

If we think about this more broadly, we can identify a number of different types of sampling, depending on what kinds of information we are using, and how we are using it. First, sampling T by sub-setting is simply getting all the species in T that satisfy some criterion, e.g., being on the IUCN red list of endangered species,  or having a genome entry in NCBI genomes, a species page in EOL, or an image in phylopic.org (organism silhouettes for adorning trees).

Second, we might use a kind of hierarchical taxonomic sampling to get 1 (or more) species from each genus (or family, order, etc.).

sampling_taxa_poster

Poster from hackathon day 1, making the pitch for sampling taxa as a hackathon target

Third, we could reduce the complexity of a taxon or clade without using any outside information— what we might call down-sampling—, e.g., get a random sample of N species from taxon T, down-sample nodes according to subnode density, or choose N species to maximize phylogenetic diversity.

Finally, we can imagine a kind of relevance sampling, where we choose (from taxon T) the top N species based on some external measure of importance or relevance, e.g., the number of occurrence records in iDigBio (or GBIF, iNaturalist, etc.), the number of google hits (i.e., popular species), or the number of PubMed hits (i.e., biomedically relevant species).

Products of the “taxon sampling” team

At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases:

  • sub-setting: get species in T with entries in NCBI genomes
  • down-sampling: get a random sample of N species from T
  • relevance sampling: get the N species in T with the most records in iDigBio

Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to match species names to OpenTree taxon identifiers (ottIds), and the induced_tree service to get a tree for species designated by these identifiers.

Here I’ll describe two projects based on command-line scripts in Python and Perl.  In the next post, I’ll describe how taxon sampling was implemented within an existing platform with a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE, and Arbor.

Down-sampling in Python

A simple down-sampling approach via random choice is implemented in the random_sample.py script developed by Dilrini De Silva (Oxford) and Jonathan Rees (OpenTree), as in this example:

python random_sample.py -t Mammalia -m random -n 50 -o my_induced_tree.nwk

Here, “Mammalia” can be replaced by another taxon name, “50” may be replaced by another number, and the -o flag is used to specify an output file. The script calls on the OpenTree functions via the ‘opentreelib’ python library (another hackathon product available on github) to interact with OpenTree. It retrieves the unique OTTid of a higher taxon specified via the -t flag, and queries OpenTree to retrieve a subtree under that node. It parses the subtree to identify the implicated species, selects a random sample of the species, and requests the induced subtree, writing this to a newick file.
my_induced_subtree_example_mammalia copy
This script also invokes a rendering library to create a graphic image of the tree from the command-line, as in the example (figure) showing a random sample of 10 mammals.

Sub-setting in Perl

The specific sub-setting challenge that the team picked was to get a tree for those species (in a named taxon) that have a genome entry in NCBI genomes. NCBI offers a programmable web-services interface called “eutils” to access its databases. Because NCBI searches can be limited to a named taxon, it is possible to query the genomes database with the “esearch” service for “Mammalia” (or Carnivora, Reptilia, Carnivora, Felidae, Thermoprotei), cross-link to NCBI’s taxonomy database using the “elink” service, get the species names using the “esummary” service, and then use OpenTree services (as described in the Introduction) to match names and extract the induced tree.

This 5-step workflow, which illustrates the potential for chaining together web services to build useful tools, was implemented by Arlin Stoltzfus (NIST) as a set of Perl scripts. The master script invokes 5 other standalone scripts, one for each step. The last 2 scripts are simply command-line wrappers for OpenTree’s match_names and induced_subtree methods. All the scripts are available in the Perl subdirectory of the team’s github repo. They are demonstrated in the brief (<2 min) screencast below.

Next

The taxon sampling group produced several other products.  In the next post, I’ll describe how taxon sampling was implemented within environments that provide a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE (phylogeographic visualization), and Arbor (phylogeny workflows).


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.

 


Tree-for-All hackathon series: taxon sampling, part 2

Sampling taxa in PhyloJiVE, Open Refine, and Arbor

This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.  To read the whole Tree-for-All series, go to the Introduction page.

More specifically, this is the second of two posts on work of the “taxon sampling” team: Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (Open Tree) and Arlin Stoltzfus (NIST).[1]  The team got significant help from Arbor team members Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis).

Products of the “taxon sampling” team

At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases of sampling up to N species from a taxon T:

  • sub-setting: get species in T with entries in NCBI genomes
  • down-sampling: get a random sample of N species from T
  • relevance sampling: get the N species in T with the most records in iDigBio

Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to convert species names to OT taxon ids, and the induced_tree service to get a tree for species designated by ids.

In the previous post, I described 2 projects based on command-line scripts in Python and Perl.  Below, I’ll describe how taxon sampling was implemented within existing platforms with graphical user interfaces, including Open Refine (spreadsheets), PhyloJiVE (phylogeographic visualizations), and Arbor (phylogeny workflows).

Relevance sampling in PhyloJIVE

Previously we defined “relevance sampling” as finding a subset of species in some taxon that is the most relevant by some external measure, e.g., number of hits in google (popular species).  In particular, the taxon-sampling team defined its target relative to iDigBio (Integrated Digitized Biocollections), which makes data and images for millions of biological specimens available in electronic format for the research community, government agencies, students, educators, and the general public.  The challenge is to get a tree for the N species in taxon T with the most records in iDigBio.  Because iDigBio has its own web-services interface, we can query it automatically using scripts.

A version of relevance sampling was implemented by Andréa Matsunaga (U. Florida) to show how phylogenies can be integrated into an environment for analyzing biodiversity data. For this demonstration, OpenTree services were invoked from within PhyloJIVE (Phylogeny Javascript Information Visualiser and Explorer), a web-based application that places biodiversity information (aggregated from many sources) onto compact phylogenetic trees.

phylojive

PhyloJIVE live demo software developed by hackathon participant Andrea Matsunaga.  Choosing the “top 10 Felidae” menu item queries iDigBio for the cats with the most records, then obtains a tree on the fly by querying Open Tree.  Clicking on Leopardus pardalis (ocelot) on the resulting tree  opens up a map viewer showing the locations associated with records (red dots).

A live demo provides access to several pre-configured queries. For instance, choosing the “top 10 Felidae” menu item returns an OpenTree phylogeny for the 10 cat species most frequently implicated by iDigBio records. In the resulting view (above), mousing over the boxes reveals the number of records for each species. Clicking on a species (e.g., Leopardus pardalis above), shows a map of occurrence records.

Sub-setting and relevance-sampling in Open Refine

OpenRefine spreadsheet populated with counts of occurrence records captured by invoking iDigBio webservices directly

OpenRefine spreadsheet populated with counts of occurrence records captured by invoking iDigBio webservices directly

Open Refine (formerly Google Refine) is an open-source data management tool with an interface like a spreadsheet, but with some of the features of a database.  Nicky Nicolson (Kew Gardens) teamed up with Andréa Matsunaga (U. Florida) to explore how Open Refine’s scriptable features can be used to populate a spreadsheet with occurrence data from iDigBio (obtained via iDigBio’s web services), as shown above.

Phylogeny view generated from within OpenRefine by invoking a javascript phylogeny viewer

Phylogeny view generated from within OpenRefine by invoking a javascript phylogeny viewer

Further scripting can be used to generate a column of OpenTree taxonomy ids from a column of species names, by invoking the tnrs/match_names service. Finally, one can submit a query for the induced tree for a selected column of species identifiers. The image above shows a custom “OpenTree” item that has been added to the menu of Open Refine, to retrieve a tree, which is then visualized using a JavaScript viewer here (image at right).

The value of this demonstration, explained more fully on the refine-opentree project wiki, is that the user has considerable flexibility to create and manage a set of data using the Open Refine spreadsheet features, but also has the power to invoke external web services from iDigBio and OpenTree.

Sub-setting and relevance-sampling in Arbor

Arbor (http://arborworkflows.com) provides a framework for constructing and executing workflows used in evolutionary analysis.   Andréa Matsunaga (U. Florida) and Kayce Bell (U. New Mexico) worked with Arbor developers Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis) to implement approaches to sub-setting and relevance-sampling by producing code and workflows in python/Arbor.  A live demo of Arbor that includes OpenTree menu items is accessible at arbor.kitware.com.   One of the nice things about Arbor is that it provides a graphical workflow editor, allowing you to piece together workflows from modules, by connecting inputs and outputs.  The workflow shown below begins by querying iDigBio, and ends with generating an image of a tree.

arborworkflow

High-level view of Arbor workflow to capture iDigBio records, and then acquire matching taxon names and the induced tree from OpenTree

To view the OpenTree-specific menu items on the public Arbor instance hosted at kitware.com, you must click on the view (eye) icon next to “OpenTree.”  Be warned that, at present, menu items are undergoing changes. The menu item currently entitled “Get ranked scoped scientific names from iDigBio” will return a list of species names that can then be used to retrieve a tree from OpenTree. The analysis takes the scientific names of various ranks (or scope) or as a taxonomic search (leave scope at _all), and will return a list of species of the specified size, consisting either of the top-ranked species (most records) or a random set of species that meet the criteria, depending on what you specify. This also has been incorporated into a menu item (“Workflow to get an induced tree from a configurable iDigBio query”) with the specifications for the iDigBio search as the input— this is the workflow shown above.   In the screencast below (bottom of page), Kayce Bell explains exactly how to carry out the individual steps in Arbor.

As Arbor’s interface is designed to allow users to execute a variety of analyses on user-supplied data, there are ways to upload your own tabular data for processing. Currently data is expected to be in CSV format. Algorithms exist in Arbor to match species names against the OpenTree TNRS, request a tree matching specific taxa, and perform comparative analysis on trees and tables. Some auto discovery of tabular taxa names is supported, but it is recommended to have a first column entitled “species”, “name”, or “scientific name”. Online documentation for Arbor is currently being developed, and will be available through the Arbor website.

 


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.


Tree-for-All hackathon series: Introduction

The Tree-for-All: Introduction

Welcome to the first in a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich., Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.   This post is written by Arlin Stoltzfus (NIST)[1], one of the hackathon organizers (but not affiliated with Open Tree in any other way).  Below, I’m going to introduce the rationale and aims of the hackathon, describe the process, and summarize some of the projects.  In subsequent posts, we will discuss products and lessons learned.  The list of forward links will be updated as new posts appear:

Motivation: bridging the accessibility gap

The Open Tree of Life project aims to provide data resources for the scientific community, including

  • a grand synthetic tree covering millions of species, generated from thousands of source trees
  • a database of the source trees, published species trees used to generate the synthetic tree
  • a reference taxonomy used (among other things) to align names from different sources

The premise of synthesizing a grand Tree of Life, and making it available with source studies and a reference taxonomy, is that these resources are valuable.  To assess the value of these resources right now would be premature— we will  return to that question later.  For now, I will just point out that, until recently, when scientists in the bioinformatics community have needed a tree broadly covering the kingdoms of life, they have used the NCBI taxonomy hierarchy (multiple examples are cited by Stoltzfus, et al., 2012), an approach that causes phylogeneticists and systematists to groan.  Surely we are better off now, but determining how much better off we are probably will require further analysis.

For the present, it is important to understand that the value of a community resource is predicated on accessibility.  Most users would not know how to handle a tree with 3 million species, useful or not.  For the value of OpenTree’s resources to be realized, it is important to anticipate the needs of users, and support them with appropriate tools.

The aim of the recent Tree-for-all hackathon was to begin bridging this accessibility gap.  More specifically, the aim of the hackathon was to build capacity for the community to leverage Open Tree’s resources via their recently announced web services API (Application Programming Interface).   This enhanced capacity may take the form of end-user tools, library code, standards, and designs.

Technology: web services

Web services are a natural choice for accessibility, because they provide programmable access to a resource to anyone with a networked computer.  Most of the time when you use the web, you are sending a request for a specific page, and receiving results in HTML that are rendered by your browser.  But more generally, web services work by a standard protocol that allows you to send data and commands, and receive results.

Some services are so simple that you can access them just by typing in the URL box of your browser.  For instance, TreeBASE has a web-services API that allows you to access data with commands such as

http://purl.org/phylo/treebase/phylows/tree/TB2:Tr2026?format=nexus

which retrieves a particular tree in NEXUS format.  When that isn’t enough, you can use a command-line tool such as  cURL (command-line URL), found on most computer systems.   I’ll give an example using cURL, then explain how to use a Chrome extension called DHC that provides a graphical user interface.

Open Tree’s web API can do many things, but let’s start with something simple: find out what the synthetic tree implies about the relationships of a set of named species, “Panthera tigris”, “Sorex araneus”, “Erinaceus europaeus”.   To get the tree, we need to chain together a workflow based on 2 web services, the match_names service (click to read the docs) to convert species names to OT taxon identifiers, and the induced_tree service to get a tree for species designated by identifiers.  In the first step, using cURL, we issue this command:

curl -X POST http://api.opentreeoflife.org/v2/tnrs/match_names \
-H "content-type:application/json" \
-d '{"names":["Panthera tigris","Sorex araneus","Erinaceus europaeus"]}'

This command matches our list of input names with the names in OpenTree’s taxonomy. If a species is in the tree, it will have an id in the taxonomy. The output of this command yields the matching identifiers 633213, 796660, and 42314.  To find them, scroll through the output and look for the “ottId” field, which refers to Open Tree taxonomy ids.  Once we have those ids, the next step is to use them to request the tree:

curl -X POST http://api.opentreeoflife.org/v2/tree_of_life/induced_subtree \
-H "content-type:application/json" \
-d '{"ott_ids":[633213, 796660, 42314]}'

which returns a Newick tree (embedded in JSON). OpenTree’s interface refers to this as the “induced” tree, though perhaps it is more appropriately called the implied tree: for any set of nodes in the synthetic tree, the structure of the larger tree immediately implies a topology for the subset, e.g., the tree of A, C and E implied by (A,(B,(C,(D,E)))) is (A,(C,E)).

To run these commands in DHC, start with the cURL command above, then copy and paste the service (the “http” part) and the body (after the -d), into the appropriate boxes, click on “JSON” below the body window (or set the header to content-type: application/json), choose “POST”, then hit “Send”.  The output will appear below.

dhc_screenshot

DHC allows you to use web services in a one-off manner, interactively, but the real power of web services starts to emerge when they are invoked and processed in an automated way, within another program.

Process: Hackathon

Open Tree announced version 1 of its web services in May, at the same time we distributed an open call for participation in a “Tree-for-all” hackathon, which took place September 15 to 19 at University of Michigan, Ann Arbor.  The hackathon was organized and funded by Open Tree, the Arbor workflows project and NESCent’s HIP (Hackathons, Interoperability, Phylogenies) working group.

What, exactly, is a hackathon?  A hackathon is an intensive bout of computer programming, usually with a scope that allows for considerable creativity (when the objectives are pre-determined, the event might be called a “code sprint” instead).  Often it involves bringing together people who haven’t worked face-to-face before.

The tree-for-all hackathon followed a plan for a participant-driven 5-day meeting with ~30 people.  The participant pool is seeded with some hand-picked developers, but consists mainly of folks who have responded to an open call.  The people chosen to participate are not all elite super-coders— some are subject-matter experts without advanced coding skills.  On the morning of day 1, these participants hear informational presentations— in this case, about Open Tree’s data and services (above), the Arbor workflow project, and HIP’s vision of an interoperable web of evolutionary resources.  This is followed by open discussion of possible projects, a process that typically begins (via email list) long before the hackathon.

On the afternoon of Day 1 comes the make-or-break moment: pitching and team-formation.  Participants with ideas stand up, make a pitch for a software development target, and post it on the wall using a giant sticky note.  Others move from pitch to pitch, critiquing, suggesting ideas, and trying to find where they could contribute (or learn) the most.  Pitches evolve through this process, and eventually a set of teams emerges.  From this point on— days 2 to 5 of the hackathon— the meeting belongs to the teams.  The hackathon will succeed or fail, depending on the strength of the teams.

Hackathon participants gather to hear a progress report.  Left to right: Matt Yoder, Stephen Smith, Cody Hinchliff (standing), Andréa Matsunaga, Joseph Brown, Zack Galbreath (standing), Chodon Sass, Alex Harkess, Julienne Ng (eyes only),  Katie Lyons, Gaurav Vaidya (standing), Jorrit Poelen, Shan Kothari (facing left), David Winter, Julie Allen (standing), Karolis Ramanauskas, Nicky Nicolson, Josef Uyeda, Miranda Sinnott-Armstrong (standing), Rachel Warnock, François Michonneau, Luke Harmon, Kayce Bell, Jon Hill's right arm.

Hackathon participants gather to hear a progress report. Left to right: Matt Yoder, Stephen Smith, Cody Hinchliff (standing), Andréa Matsunaga, Joseph Brown, Zack Galbreath (standing), Chodon Sass, Alex Harkess, Julienne Ng (eyes only), Katie Lyons, Gaurav Vaidya (standing), Jorrit Poelen, Shan Kothari (facing left), David Winter, Julie Allen (standing), Karolis Ramanauskas, Nicky Nicolson, Josef Uyeda, Miranda Sinnott-Armstrong (standing), Rachel Warnock, François Michonneau, Luke Harmon, Kayce Bell, Jon Hill’s right arm.

Outcomes: Hackathon team projects

Over the coming weeks, I’m going to write about hackathon team projects and, ideally, provoke some other hackathon participants to do the same.  Hackathon teams are instructed (and cajoled) to focus on tangible outcomes, and the Tree-for-All hackathon produced a lot of them!  For now, here is a brief synopsis.

Integration of Trees and Traits involved hackathon participants Jeff Cavner (remote), Luke Harmon, Zack Galbreath, Jorrit Poelen, Julienne Ng, Alex Harkess, Chodon Sass, Shan Kothari, and Mark Westneat (remote).   They aimed to develop ways to integrate Open Tree’s resources into workflows for analysis of character data and other data.  They already have a nice presentation on their wiki.

Library wrappers for OT APIs involved Joseph Brown, Mark Holder (remote), Jon Hill, Matt Yoder, François Michonneau, Jeet Sukumaran, David Winter, and Karolis Ramanauskas.  The aim of this group was to develop programmable interfaces to Open Tree’s web services in Python, Ruby and R.  They developed an innovative test scheme in which all the libraries were subjected to the same tests.

Phylogeny visualization style-sheets were the focus of Peter Midford (remote), Jim Allman (remote), Pandurang Kolekar (remote), Daisie Huang, Gaurav Vaidya, Julie Allen, and Mike Rosenberg (remote).  Every year thousands of researchers generate  tree images, import them into a graphics editor, and add the same kinds of adornments (colored branches, numbers on nodes, images at the tips, brackets, etc).   The aim of this group was to develop and implement a scheme to treat graphical markups as styles in a separate document (because most tree formats don’t have room for markup), analogous to stylesheets for web pages.

The taxon sampling team included Andréa Matsunaga, Kayce Bell, Dilrini de Silva, Jonathan Rees, Nicky NIcolson and Arlin Stoltzfus.  This group focused on ways to get a phylogeny that represents a sample from a larger taxon— a sample that integrates some useful data, or is otherwise representative of the taxon.

The branch lengths team, including Lyndon Coghill (remote), Rachel Warnock, Josef Uyeda, Katie Lyons, Miranda Sinnott-Armstrong, Bob Thacker (remote), and Curt Lisle (remote) explored ways to address the challenge of adding branch lengths to the synthetic tree.  Like most supertrees, the synthetic tree lacks branch lengths, which limits its usefulness in many kinds of evolutionary studies.

A major knowledge engineering challenge for the Tree of Life community is to link knowledge to nodes in a comprehensive tree, and then ensure that this knowledge persists (as appropriate) when the tree is updated.  A scheme for addressing this challenge was developed and implemented by the annotation database group, including Cody Hinchliff, Karen Cranston, Stephen Smith, Joseph Brown, Mark Holder (remote), Hilmar Lapp (remote) and Temi Varghese.

Next

Next week, I’ll start to describe the work of the taxon sampling team.  To be sure you hear about future posts, click “Follow” in the WordPress bar above this pane.

 


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.


Accessing OpenTree data

With the soft release of the v 1.0 of the Open Tree of Life (see Karen Cranston’s Evolution talk for details) we also have methods for accessing the data:

* a not-very-pretty but functional page to download the enture 2.5 million tip tree as newick
* API access to subtrees and source trees as well as taxon name services
* clone the github repository of all input trees

A few folks have started to think about ways to interact with the very large newick file, specifically extracting subtrees. Yan Wong posted a perl solution a few weeks ago:

http://yanwong.me/?page_id=1090

Michael Elliot has a C++ package called Gulo which seems to be very efficient (see comments on the post):

http://www.michaelelliot.net/blog/2013/11/09/the-fastest-possible-phylogenetic-deletion-with-phylogenies-of-spotty-animals/

Thrilled to see people working with the data! I note that, despite having APIs to return a subtree or a pruned subtree, downloading all of the data and working with it remotely is still an easy and flexible option for many users. We will continue to make our datasets available, and that download page should have more options and tree metrics soon!


Apply for Tree-for-all: a hackathon to access OpenTree resources

Full call for participation and link to application: http://bit.ly/1ioPPMc

A global “tree of life” will transform biological research in a broad range of disciplines from ecology to bioengineering. To help facilitate that transformation, the OpenTree <http://opentreeoflife.org> project [1] now provides online access to >4000 published phylogenies, and a newly generated tree covering more than 2.5 million species.

The next step is to build tools to enable the community to use these resources.  To meet this aim, OpenTree <http://www.opentreeoflife.org/>, Arbor <http://www.arborworkflows.com/> [2] and NESCent’s HIP<http://www.evoio.org/wiki/HIP> working groups [3] are staging a week-long hackathon September 15 to 19 at U. Michigan, Ann Arbor.  Participants in this “Tree-for-all” will work in small teams to develop tools that use OpenTree’s web services to extract, annotate, or add data in ways useful to the community.  Teams also may focus on testing, expanding and documenting the web services.

How could a global phylogeny be useful in your research or teaching?  What other data from OpenTree would be valuable?  How could OpenTree web services be integrated into familiar workflows and analysis tools?   How could we add to the database of published trees, or enrich it with annotations?

If you can imagine using these resources, and you have the skills to work collaboratively to turn those ideas into products (as a coder, or working side-by-side with coders), we invite you to apply for the hackathon.  The full call for participation (http://bit.ly/1ioPPMc) provides instructions for how to apply, and how to share your ideas with potential teammates (strongly encouraged prior to applying).  Applications are due July 8th. Travel support is provided.  Women and underrepresented minorities are especially encouraged to apply.

If you have questions, contact Karen Cranston (karen.cranston@nescent.org, @kcranstn, OpenTree), Arlin Stoltzfus (arlin@umd.edu, HIP), Julie Allen (juliema@illinois.edu, HIP), or Luke Harmon (lukeh@uidaho.edu, Arbor).

[1] http://www.opentreeoflife.org

[2] http://www.arborworkflows.com/

[3] http://www.evoio.org/wiki/HIP (Hackathons, Interoperability, Phylogenies)


PhyloCode names are not useful for phylogenetic synthesis

Ok, the title is intentionally a bit provocative, but bear with me.

A primary aim of the Open Tree project is to synthesize increasingly comprehensive estimates of phylogeny from “source trees” — published phylogenies constructed to resolve relationships in disparate parts of the tree of life. The general idea is to combine these localized efforts into a unified whole, using clever bioinformatic algorithms.

In this context, a basic operational question is: how do we know if a clade in one source tree is the same as a clade in another source tree? This can be difficult to answer, because source trees are typically constructed from carefully selected samples of individual organisms and their characters (usually DNA sequences). If two source trees are inferred from completely non-overlapping samples of individual organisms, as is commonly the case, is it possible for them to have clades in common, or rather, is it possible for us to determine whether they have clades in common?

I would argue that the answer is yes, with a very important condition: that the organisms sampled for each tree are placed into a common taxonomic hierarchy that embodies a working hypothesis of named clades in the tree of life.

Note an important distinction here: a clade in a source tree depicts common ancestry of selected individual organisms, while a clade in the tree of life is a conceptual group defined by common ancestry that effectively divides all organisms, living and dead, into members and non-members. So a taxon in this sense is a name that refers to a particular tree-of-life clade whose membership is formalized by its position in the comprehensive taxonomic hierarchy.

By placing sampled organisms into a common taxonomic hierarchy, one can compute the relationships between source-tree clades and tree-of-life clades in terms of taxa, a process that I refer to as “taxonomic normalization.”

An idea that emerges from this line of thinking is that the central paradigm of systematics is (or should be) the reciprocal illumination of phylogeny and taxonomy. That is, phylogenetic research tests and refines taxonomic concepts, and those taxonomic concepts in turn guide the selection of individual organisms for future research. I would argue that this, in a nutshell, is “phylogenetic synthesis.”

Which brings me to the title of this post. In the PhyloCode, taxonomic names are not hypothetical concepts that can be refuted or refined by data-driven tests. Instead, they are definitions involving specifiers (designated specimens) that are simply applied to source trees that include those specifiers. This is problematic for synthesis because if two source trees differ in topology, and/or they fail to include the appropriate specifiers, it may be impossible to answer the basic question I began with: do the trees share any clades (taxa) in common? If taxa are functions of phylogenetic topology, then there can be no taxonomic basis for meaningfully comparing source trees that either differ in topology, or do not permit the application of taxon definitions.

So phylogenetic synthesis requires taxa that are explicitly not functions of phylogenetic topology. Instead, taxa should exist independently as hypotheses to be tested by phylogenetic evidence, and as systematists we should strive to construct comprehensive taxonomic hierarchies. I think this is going to be the real key to making progress in answering the question, “what do we know about the tree of life, and how do we know it?”


Data sharing, OpenTree and GoLife

NSF has released GoLife, the new solicitation that replaces both AToL and AVAToL.  From the GoLife text:

The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.

Data completeness, open data and data integration are key components of these proposals – inferring well-sampled trees that are linked with other types of data (molecular, morphological, ecological, spatial, etc) and made easily available to scientific and non-scientific users. The solicitation requires that trees published by GoLife projects are published in a way that allows them to be understood and re-used by Open Tree of Life and other projects:

Integration and standardization of data consistent with three AVAToL projects: Open Tree of Life (www.opentreeoflife.org), ARBOR (www.arborworkflows.com), and Next Generation Phenomics (www.avatol.org/ngp) is required. Other data should be made available through broadly accessible community efforts (i.e., specimen data through iDigBio, occurrence data through BISON, etc).

What does it mean to publish data consistent with Open Tree of Life? We have a short page on data sharing with OpenTree, and a publication in PLOS Currents Tree of Life with best practices for sharing phylogenetic data. Our phylogeny curation application allows you to upload and annotate phylogenies consistent with OpenTree synthesis, and you can quickly import trees from TreeBASE.

If you have questions about a GoLife proposal (or any other data sharing / integration issue), feel free to ask on our mailing list or contact Karen Cranston directly.


Social curation of phylogenetic studies

People associated with the Open Tree of Life effort are busy on several fronts: writing a paper describing the initial draft release of a comprehensive tree of life, continuing their efforts to obtain estimates of different parts of the tree, improving the Open Tree Taxonomy (OTT) used for name matching, experimenting with new methods for building large trees…

In the midst of that activity (and well aware that we missed our initial goal of having the first release in the first year of the grant), we have recently started to redesign the study curation tool. The goal is to build a tool that is built around git and GitHub. This decision could be described using a wide variety of adjectives ranging from “foolish” to “inspired” (and probably including several that are not printable on this family-friendly blog). So, I (Mark Holder is writing this post) thought that I’d explain the rationale behind this decision.

Why do we need to “curate” published trees in the first place?

Unfortunately, even when we can find a phylogenetic estimate in a digital format, some crucial information is often missing. The tasks in the “curation” process typically include:

  • matching the tips of the tree to the appropriate taxon in a taxonomy (OTT in our case);
  • indicating which parts of the tree are rooted with high confidence. Many phylogenetic estimation procedures produce unrooted estimates, and the trees that they emit are often arbitrarily rooted. Properly identifying the “outgroup” is important for the supertree methods that we are using; and
  • describing what the branch lengths and internal node labels on the tree mean.

In our first year of work on the Open Tree of Life project, we’ve also found many cases in which it would be nice if a downstream software tool could annotate the source tree.

For example, if a phylogeny of plants contains a single animal species, this odd sampling of species could be caused by an incorrect matching of names when the study was imported into the Open Tree of Life system (there are valid homonyms in parts of life that are governed by different nomenclatural codes; the wikipedia page on homonyms has a nice discussion of this topic, including the example of the genus name Erica being used for a jumping spider and a large group of flowering plants known as “heath”). The warning signs of incorrect name matching may not be obvious when a new study is added to the Open Tree of Life system. Ideally, these potential errors would be flagged with comments so that a taxonomic expert could double check the name matching.

Why not just build a database driven website with a “page” for each study so that you can update the study information in one place?

This is exactly what we have done. Fortunately for the project, Rick Ree’s lab already had a tool (phylografter) that did many of these tasks. Rick and his group have continued to improve phylografter as a part of the Open Tree of Life project. The fact that we started the project with a nice tool for study curation is a big part of the reason that we were able to get trees from about 2500 studies into the Open Tree of Life system in this first year (the other “big parts” are the herculean efforts of Bryan Drew, Romina Gazis, Jiabin Deng, Chris Owen, Jessica Grant, Laura Katz, and others to import and curate studies).

If it is not broken, why are we trying to “fix” it?

One of the primary goals of the Open Tree of Life project is to enable the community of biologists to collaboratively assemble phylogenetic knowledge. We are trying to build infrastructure for a system that is as inviting as possible to the community of biologists and software developers. Those goals imply that we should track the contribution of users in a fine-grained manner (so people will get the credit that they deserve), and that the system be open to contributions through many avenues (so that developers will not be constrained to work within one tightly integrated code base).

Phylografter is open in many senses: the code is open-source (see its repository), the study data can be exported via web services (this code snippet is an example of using the service), and interested parties can become study curators. However, the fundamental data store used by phylografter is an SQL database. All writing to the core data store has to be done via adding new functionality to the phylografter tool itself. This is certainly not impossible, but it is not very inviting to developers outside the project who want to dabble with the project.

For example, imagine that you wrote a tool that identifies groupings which might be the result of long branch attraction. To integrate that sort of annotation tool into our current architecture, you would need to figure out the SQL tables that would be affected, write an interface for adding this form of annotation, and implement a system for keeping track of the provenance of each change. This is all possible to do, but much more complicated than writing a tool that simply adds an annotation to a file.

Maybe it won’t be too hard to open up the database of phylogenetic studies as versioned text.

Fortunately, the process of adding corrections and annotations to a text file in a collaborative setting is a common problem, and some excellent software tools exist for dealing with this situation. In particular we can use the git content tracker to store the versions of a study in a reliable, secure manner with full history of the file and rich tools that allow many people to collaborate on the same file. GitHub offers some great add-on features (including dealing with authentication of users) and makes it easy to have a core data store that anyone can access. The Open Tree of Life is making heavy use of NexSON already, and that format supports rich annotation (though we do need to iron out the details of a controlled vocabulary). So we should not have to spend much time on designing the format of the files to be managed by git.

We certainly aren’t the first to think of using git as the database for an application (see the gollum project and git-orm, for example). Nor are we the first to think of using GitHub to make data in systematics more open. I love Rutger Vos’ dump of treebase data in https://github.com/rvosa/supertreebase. Ross Mounce has recently started putting many datafiles that he uses in his research on https://github.com/rossmounce/cladistic-data. Rod Page had a nice post a while back titled “Time to put taxonomy into GitHub.” I’m sure there are more examples.

git and GitHub keep coming up in the context of collaboratively editing data, because most software developers who have used the tools recognize how they have really transformed collaborative software development. Implementing a social tool is tough, but git seems to have done it right. Every one gets an entire copy of the data (via git clone). You can make your changes and save them in your own sandbox (via committing to a fork or branch). When you think that you have a set of changes that are of interest to others, you can ask that they get incorporated into the primary version of the data base (via a pull request).

Of course, most biologists won’t want to use the git tool itself. Fortunately we have some very talented developers (Jim Allman, Jonathan “Duke” Leto, and Jonathan Rees) working on a web application that will hide the ugly details from most users. We’re also working on allowing phylografter to receive updated NexSON files, so we won’t have to abandon that tool for curating study data.

It is a bit scary to be adding a new tool this late in our timeline. But we’re really excited about the prospect of having a phylogenetic data curation tool built on top of a proven system for collaboration.

Comments, questions and suggestions are certainly welcome. The software dev page on our wiki has links to many of the communication tools that the Open Tree of Life software developers are using to discuss these (and other) ideas in more detail.

Mark Holder is an associate professor at the University of Kansas’s Department of Ecology and Evolutionary Biology.

Minor edits on Sunday, Oct 6 at 1:30 Eastern: links added for OTT and SQL


Recommending CC0 for GBIF data

GBIF (Global Biodiversity Information Facility) recently issued a request for comment on its data licensing policy. While Open Tree of LIfe does not currently use specimen data, we do use the GBIF classification in order to help resolve names and also as part of the opentree backbone. Jonathan Rees, Karen Cranston, Todd Vision and Hilmar Lapp wrote a response recommending a CC0 waiver for all GBIF data. Here is our summary, and a link to the full response on Figshare.

Summary

As a data aggregator, the goal of GBIF should be to find policies that benefit both its data providers and data reusers. Clearly, a GBIF that has no or few data will have little value, but so will a GBIF full of data that is encumbered with restrictions to an extent that stifles reuse.  Our response follows from the proposition that promoting data reuse should be a shared interest of all the parties: data providers, data users, and GBIF itself. We feel the consultation document missed the opportunity to recognize this shared interest, and that furthering the goal of data reuse should in fact be a primary yardstick by which different licensing options are measured.

Tracking the reuse of data is a critically important goal, as it provides a means of reward to data providers, allows scrutiny of derived results, and enables discovery of related research. Initiatives such as DataCite have have made considerable progress in recent years in enabling tracking of data reuse by addressing sociotechnical obstacles to tracking data reuse. By contrast, the consultation, in our view, puts undue weight on legal requirements for attribution. Legal instruments such as licenses are unsuitable, not designed for, and of little if any benefit for this purpose. Moreover, in most of the world, there is little to no formally recognized intellectual property protection for data, and it is on such protection that licenses rest.

In short, our recommendations are (1) that all data in GBIF be released under Creative Commons Zero (CC0), which is a public domain dedication that waives copyright rather than asserting it; (2) GBIF should set clear expectations in the form of community norms for how the data that it serves is to be referenced when reused, and (3) GBIF should work with partner organizations in promoting standards and technologies that enable the effective tracking of data reuse.

We note that our analysis is based on our understanding of the law; we are not legal professionals and this is not legal advice.

Full response

Response to GBIF request for consultation on data licenses. Karen Cranston, Todd Vision, Hilmar Lapp, Jonathan Rees. figshare.
http://dx.doi.org/10.6084/m9.figshare.799766


Open Tree of Life featured in weekly science journal

tree02Four investigators of the Open Tree of Life project and four postdoc researchers affiliated with the project have published an article in the latest issue of Nature, an international weekly journal of science. They have concluded that, unfortunately, most phylogenetic trees and nucleotide alignments from the past two decades have been irrevocably lost. (“Data deposition: Missing data mean holes in tree of life”)

Click here for the the website of Nature  (subscription required).


A ‘Social’ Tree

Our group is faced with the challenge of gathering scattered research on some two million known species, placing them and their associated data on a single evolutionary tree of life and then providing a way for new species to be added by researchers around the world. (hence the “Open” in our name). Traditional data storage and software are simply not up to the task. With any number of scientists and researchers contributing their work to the Open Tree, standardization is key. The largest existing evolutionary trees (for example the trees in Price et al, 2010) contain around 100,000 species. Creating a system to include twenty times that amount means redefining how and what types of data storage are used.

This is no simple task. For instance, what are the criteria that should be used to distinguish species? Is it genetic material? Is it a variation in certain features? For species that are extinct, like early mammals or dinosaurs, we don’t have enough genetic material to say for certain where those species fit in the tree. Does morphology (the animal’s form or shape) then come into play? Terminology must be standardized. One research team might store their data under one scientific name, while another might use a completely different one.

      (more…)


Here are the slides used by biologist and leader of our Open Tree of Life project, Dr Karen Cranston (from the National Evolutionary Synthesis Center at Duke University), at the recent 2012 Evolution meetings in Ottawa, Canada. They offer a great overview of the project.

Open Tree of Life @Evolution 2012

View more presentations from Karen Cranston

Please follow us on Twitter.

https://twitter.com/#!/opentreeoflife


What the New York Times Has To Say About Our OpenTree Project

Tree of Life Project Aims for Every Twig and Leaf

By CARL ZIMMER
Published: June 4, 2012

(link to original article)

ROOTS A simplified diagram of the tree
of life, representing about 10 percent of known species. The Open Tree of Life project will combine thousands of such trees. (iPlant collaborative)

In 1837, Charles Darwin opened a notebook and drew a simple tree with a few branches. Each branch, which he labeled with a letter, represented a species. In that doodle, he captured his newfound realization that species were related, having evolved from a common ancestor. Across the top of the page he wrote, “I think.”

Two decades later Darwin presented a detailed account of the tree of life in “On the Origin of Species.” And much of evolutionary biology since then has been dedicated to illuminating parts of the tree. Using DNA, fossils and other clues, scientists have been able to work out the relationships of many groups of organisms, making rough sketches of the entire tree of life. “Animals and fungi are in one part of the tree, and plants are far away in another part,” said Laura A. Katz, an evolutionary biologist at Smith College.

Now Dr. Katz and a number of other colleagues are doing something new. They are drawing a tree of life that includes every known species. A tree, in other words, with about two million branches.

(more…)


We’re live! Our press release.

Contact: Robin Ann Smith
rsmith@nescent.org
919-668-4544
National Evolutionary Synthesis Center (NESCent)

Researchers aim to assemble the tree of life for all 2 million named species

The resulting tree will be digital, downloadable, continuously updated

Durham, NC — A new initiative aims to build a grand tree of life that brings together everything scientists know about how all living things are related, from the tiniest bacteria to the tallest tree.

Scientists have been building evolutionary trees for more than 150 years, ever since Charles Darwin drew the first sketches in his notebook. But despite significant progress in fleshing out the major branches of the tree of life, today there is still no central place where researchers can go to browse and download the entire tree.

“Where can you go to see their collective results in one resource? The surprising thing is you can’t — at least not yet,” said Dr. Karen Cranston of the National Evolutionary Synthesis Center.

But now, thanks to a three-year, $5.76 million grant from the U.S. National Science Foundation, a team of scientists and developers from ten universities aims to make that a reality.

 

(more…)