A. Construction of whole genome phylogeny of living organisms, “Tree of Life”
The first task for this project is to develop one or more methods for comparing whole genome sequences of two organisms, not just a set of highly conserved gene or protein sequences, as currently practiced in Multiple Sequence Alignment (MSA) method. Our starting point was treating each whole genome sequence as a book consisting of a single string of alphabets without spaces between words for each chromosome. My group has developed the “Feature Frequency Profile (FFP)” method, which is a variation of “Word Frequency Profile” method used to compare two books describe in the field of Natural Language Analysis. Using the FFP method we were able to construct phylogenic trees of three domains of Life at an intermediate resolution, and two of the most diverse and large groups of Life, Prokaryotes (Archaea and Bacteria combined), and Fungi, the largest kingdom of Eukarya, at a high resolution. Compared to those trees based on MSA methods, our results revealed high similarities in grouping (clading) at high phylogenic levels, but substantial differences in evolutionary branching order of the clades at deeper evolutionary levels. Our next projects are to construct the phylogenic trees of other large “phylogenic” groups such as protists, Eukaryotic algae, insects, plants and others, and ultimately “Tree of Life” for all living organisms for which whole genome sequences are available.
B. Whole genome variation of human species vs. disease susceptibility
Most regions of genomes of normal human cells have been found to have the same sequences among individuals, but a small fraction, spread throughout the genome, have variations within a population. Of these, the single nucleotide polymorphisms (SNPs) account for the largest number of variations and, have been identified in over 3 million genomic “tag” positions out of 3 billion positions in a whole haploid genome. It has been widely accepted that the analysis of SNPs may be able to allow one to predict the genomic component of the susceptibility of individuals to complex diseases such as cancers, neurological diseases, autoimmune diseases, and other traits. So far, the results from the current analysis methods (e.g. Genome-wide Association Studies method) and interpretation of them have yielded information of limited predictive value of practical utility for making health-related decisions at individual or population level without information of family histories.
Recognizing the complexity and heterogeneity of cancer mechanisms, we have developed, using SNPs, an empirical approach using supervised machine-learning method, a branch of Artificial Intelligence, for predicting the relative genomic susceptibility of an individual to 9 traits consisting of 8 major cancer classes plus a healthy class. The multiclass accuracy of the combined prediction ranges from 33 to 56% depending on cancer classes of testing sets, as compared to 11% for a random prediction among 9 traits. Despite limited SNP data available and absence of rare SNPs in public databases at present, the results suggest that the framework of this approach or its improvement can predict the cancer susceptibility with probability estimates useful for making health-decisions for individuals or for a population. Our next projects are to use similar approaches to predict genomic susceptibility for various neurological diseases and autoimmune diseases.
C. Whole genome variation of non-human species vs. traits
For a longer-term projects we plan to apply similar approaches of machine learning methods as in B above to the genomic variations of various non-human species such as crop and bio-fuel plants, insects, farm animals and other to predict traits such as drought resistance, high growth, insect resistance, disease resistance etc.