Project: Phylogenetic Terraces and the impact of missing data on phylogenomics inference
TreeSpaceGraph
terraces

In phylogenomics one infers an evolutionary tree for a group of species using genetic information from multiple genes. In mathematical terms this as an optimization problem, where we are searching for a solution, tree topology with branch lengths, among exponential number of candidates.

It was shown that in the presence of missing gene sequences the tree space is split into groups of equally optimal solutions, called Phylogenetic Terraces.

In this project I am developing methods, algorithms and software tools to study, explore and use the notion of equally optimal solutions to improve the inference and to simplify handling of difficult datasets.

Contributions:

  • Generalised the concept of phylogenetic terraces to partial terraces, which occur even more often and lead to trees with very similar score in supermatrix
  • Developed rules to quickly detect partial and complete terraces during tree search
  • Developed data structure and provided the implementation of above rules in IQ-TREE, which lead to speed up and increase of precision in partitioned analysis in supermatrix
  • Developed algorithm Gentrius to generate trees from incomplete unrooted subtrees, which assesses existence and uniqueness of species-trees in supertree and generates all equally scoring species-trees in supermatrix (i.e. trees from one phylogenetic terrace)
  • Provided implementation of Gentrius in IQ-TREE
  • Development of simulator to generate datasets with missing sequences
  • Studied datasets with missing sequences to uncover the characteristics, which lead to large collections of equally optimal species-trees in supertree and supermatrix
  • Collection and analysis of published biological datasets
  • Manuscripts preparation, handling of revisions
  • Results visualisation
Keywords: phylogenomics, supermatrix, missing data, phylogenetic terraces
Bioinformatic Tools: MAFFT, SEQ-GEN, IQ-TREE, RAxML, terraphast, iTOL, FigTree
Data: DNA/protein sequences, multiple-sequence alignments, presence-absence matrices, simulated and biological datasets
Tech/DEV: linux, MacOS, C++, python, R, bash, git, HPC/grid clusters, development and implementation of algorithms in IQ-TREE, custom pipelines, custom simulator
Publications:
  • O. Chernomor, C. Elgert and A. von Haeseler (2023) Gentrius: identifying equally scoring trees in phylogenomics with incomplete data. bioRxiv
  • O. Chernomor, A. von Haeseler, and B.Q. Minh (2016) Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices. Syst. Biol., 65, 997-1008.
  • O. Chernomor, B.Q. Minh, and A. von Haeseler (2015) Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference. J. Comput. Biol., 22, 1129-1142.

Designed and coded by O. Chernomor

© 2023