[Paleopsych] Gene Finding with Hidden Markov Models

Wed Mar 30 12:14:40 UTC 2005

The application of phylogeny to HMMs is improving gene annotation | By 
Karen Heyman <mailto:kheyman at the-scientist.com>

Haemophilus influenzae made history in 1995 when it became the first 
free-living organism to have its genome completely sequenced. In the decade 
since, some 180 or so organisms have followed suit.
For every one of these genomes, the sequence is only the beginning. The 
challenge for the computational biologists charged with making sense of the 
data: to find the gene sequences hidden within those strings, billions of 
bases long, of As, Cs, Gs, and Ts. The genome annotation strategies these 
computer scientists cum biologists have developed clearly have come a long 
way. The most recent iteration (version 4.0) of the Drosophila genome 
annotation, for instance, updated only 25 predictions out of 13,472 
protein-coding genes.
But improvements can still be made. "If they were 100% reliable, then they 
would have been run on the April 2003 complete human sequence and that 
would've been it. Those would have been your genes," says David Haussler, 
who directs the Center for Biomolecular Science and Engineering at the 
University of California, Santa Cruz. Instead, the human gene count has 
been downsized by nearly one-third, from 35,000 genes to closer to 25,000.
Much of the field's slow progress stems directly from the contradiction at 
the heart of the field of computational biology: the attempt to wed the 
logic of math to the messiness of biology. Algorithms can be simple, 
elegant, and fast, but they may not capture biology, which is often 
anything but simple.
"I think what people overlooked was the complexity of biology," says 
Zhirong Bao, coauthor with Sean Eddy of Recon, a software program for 
identifying repeat sequences. "They start with simple and beautiful graph 
theories and implement them in software, assuming that all their DNA 
sequences are going to behave like their perfect nodes in their perfect 
graphs."
Now as more genomes are sequenced, researchers are looking to bring their 
computations in line with the underlying biology. They are creating 
software that incorporates phylogenetics, the descriptions of evolutionary 
distance, into the field's favorite computational tool, the hidden Markov 
model (HMM).
LEGACY LEGOS Eddy, associate professor at Washington University in St. 
Louis and coauthor of the textbook, Biological Sequence Analysis: 
Probabilistic Models of Proteins and Nucleic Acids (Cambridge University 
Press), has dubbed HMMs "the Legos of computational sequence analysis."1 
HMMs are special instances of graphical models, which were originally 
developed by computer scientists studying machine learning and speech 
recognition. In technical parlance, says Eddy, HMMs "describe a probability 
distribution over an infinite number of sequences." To the uninitiated, 
they resemble a cross between a flow chart and a doodle. In order to 
understand conceptually how HMMs work, consider their origin in speech r  
ecognition, says Haussler. In that field, a computer is asked, given a 
speech wave, what are the phonemes (sounds) that it encodes. The wave is 
the measured signal; the phonemes are the "hidden" signals that give the 
HMM its name.
"There is a probabilistic relationship between phonemes," Haussler 
explains. "After a 'th' sound can easily come an 'r' or an 'ah' or several 
other types of sounds, but not, for example, a 'k' sound. A hidden Markov 
model for speech incorporates all possible phonemes, and for each phoneme 
the probability that it's followed by any other phoneme."
Haussler says the HMM also "models the stochastic relationship between each 
phoneme and the speech wave one might measure for it. In this way it can be 
used to infer the sequence of phonemes that best fits a given segment of 
recorded speech." Translating that to molecular biology, he explains, the 
measured signal is the sequence of nucleotides, while the hidden signal is 
their function. "Biology is trying to speak a language to us, and the HMM 
model is helping us to distinguish the phonemes of that language."
Structurally, says Lincoln Stein of Cold Spring Harbor Laboratory, "HMMs 
work when you have information that moves from left to right, [when] the 
probability of an event to the right is dependent on previous events to the 
left." That logical structure ensures that almost all annotation problems 
can be cast as HMMs, says Eddy. "You've got an observed residue, and 
there's one or more labels [or 'states'] that you can attach to it." He 
continues, "You don't know what the states are. So as you move from left to 
right, you ask, 'what state am I in,' which is 'what label do I attach to 
this residue now?"' So, just as in speech recognition, there is a measured 
signal (the residue) and a hidden one (its label or function).
That's not to say HMMs will solve every problem, says Stein. Noncoding RNA 
genes, for instance, in which function is encoded both in sequence and in 
secondary structure, present a particular problem. "The problem with 
folding is that there's a kind of symmetry. You move left to right and then 
right to left, depending on where you are in the fold," he says.
Taking their cue from computational linguistics, biologists address these 
and similar problems using stochastic context-free grammars. SCFGs are 
really "just HMMs extended to a second dimension," says Eddy. "A lot of the 
concepts are the same in HMMs and SCFGs. SCFG algorithms are just a little 
nastier to work with, much more memory and compute time needed, which is 
the big thing limiting us from deploying SCFGs on every cool computational 
RNA problem we can think of."
Courtesy of Sean Eddy   PLAN 7 FROM OUTER SPACE? This diagram shows the 
Plan7 architecture used in the gene finding software, HMMER. Squares 
indicate match states (modeling consensus positions in the alignment). 
Diamonds indicate insert states (modeling insertions relative to consensus) 
and special random sequence emitting states. Circles indicate delete states 
(modeling deletions relative to consensus) and special begin/end states. 
Arrows indicate state transitions.
HMMS EVOLVE Molecular biologists use the term generically, yet many classes 
or implementations of HMM exist, each of which is adapted to particular 
annotation challenges. Eddy's own work concentrates on identifying 
structural and catalytic RNAs. His lab uses a subclass of HMMs called 
profile HMMs, which were presented in 1992 by Haussler and Anders Krogh as 
an alternative to profiles, the then-popular approach to protein 
classification.
"A profile is the technical term for the model used to represent a family 
of related proteins," says Haussler. Essentially, he says, it is "a 
computer trick" for deciding how to classify a given protein. But profiles 
turned out to be incomplete models, and Krogh and Haussler's innovation was 
to attack the same problem using HMMs. They were not the first researchers 
to apply HMMs to biology, but Krogh and Haussler's work2 on profile HMMs 
electrified the field. "Their technical report flew through the community," 
Eddy recalls. Today, profile HMMs drive the popular protein 
sequence-analysis software, HMMER.
Now Haussler, along with other researchers in the United States and Europe, 
is again generating excitement in the annotation community. This time the 
hubbub surrounds so-called PhyloHMMs, implementations that combine HMMs 
with phylogenetics to drive better gene prediction.3 Among their more 
popular current incarnations is the "conservation graph" on the UCSC Genome 
Browser. Haussler's graduate student, Adam Siepel, created the conservation 
graph.
PhyloHMMs should help with the "sequence weighting" problem that has 
plagued earlier attempts to use HMMs to extract information from multiple 
sequences. "If you just make a naive kind of assumption about these 
sequences all being independent, you introduce biases into your analyses," 
explains Siepel.
The original workaround was to assign arbitrary weights. "People came up 
with these ad hoc ways of adding weights," says Siepel. For instance, if 
the human genome was being compared to those of the chimp and the mouse, 
the researcher might assign less weight to the chimp sequence than the 
mouse, because the two primates are so closely related that it can be 
difficult to identify which sequences are under evolutionary pressure. Eddy 
says that these models have so little genuine statistical basis that he 
calls them "sequence weighting crap."
? 2005 Nature Publishing Group   WHERE'S THE SPLICE SITE? This "toy" HMM is 
trying to find the 5' splice site (5) in a sequence containing an exon 
(E)-intron (I) boundary. (Reprinted with permission from S. Eddy, Nat 
Biotechnol, 22:1315-6, 2004.)
By including phylogeny in the model, though, the weighting begins to make 
biological sense, says Eddy. "With sequence-to-sequence phylogenetic 
correlation, because the species are actually related to each other, the 
number of parameters goes down. There's a phylogenetic tree that now says 
how these sequences are related to each other." Still he warns, "The 
mathematics is a complicated beast, but this is where the field has to go."
Haussler and Siepel4 have used PhyloHMMs to create software they call 
"ExoniPhy," which detects new, novel sequences that are conserved from 
genomes by modeling how codons evolve. "We're trying to find genes that are 
not supported by cDNA evidence, not supported by sequenced mRNAs or ESTs," 
says Siepel, "Many genes, we believe, are being missed by the experimental 
techniques that are used to fish out mRNAs. One of the problems is that 
these genes are very rarely expressed or they're expressed at very low 
levels, so for example you might have a gene that's turned on in the ninth 
day of development for a few hours and then afterwards it's turned off, and 
so it never shows up in these libraries of mRNA."
A phylogenetic approach assumes that a critical gene, even a rarely 
expressed one, will be highly conserved. "Even if it's hard to find the 
mRNA, it's going to have an evolutionary signature in the genome, so we're 
trying to use these evolutionary footprints to help find genes that haven't 
been able to be found," concludes Siepel. He says his team believes it has 
detected "at least several hundred genes" that are strongly supported by 
the evolutionary/cross-species evidence, but they are not supported or only 
very weakly supported by cDNA evidence. They are now working on validating 
their predictions in the lab.
TWINSCAN 3.0 Other researchers, including Michael Brent, are trying similar 
approaches. Brent, a colleague of Eddy's at Washington University, created 
TWINSCAN, a gene-prediction tool that works by comparing two genomes. His 
lab is now working on a TWINSCAN update dubbed N-SCAN.5 As the name 
implies, N-SCAN allows comparisons between unlimited numbers of genomes.
Tests of the program have been encouraging, Brent says. "With N-SCAN, we're 
now up to more than one-third of the known genes [in the human genome] 
predicted exactly right. With more compact genomes like worms, we can get 
more than 60% of the known genomes exactly right."
Courtesy of Adam Siepel   A REAL-WORLD PHYLO-HMM: The conservation graph 
from the University of California, Santa Cruz, human genome browser 
[<http://genome.ucsc.edu>] is based on a PhyloHMM. This plot shows the 
known gene structure alongside the computer gene predictions and genomic 
comparisons with other species for the MET proto-oncogene. Boxes represent 
exons, while lines with chevrons represent introns.
Like ExoniPhy, N-SCAN combines HMM work with phylogenetic concepts, but 
Brent cites important distinctions. "The ExoniPhy model is based on an 
improvement to the classical continuous-time Markov chain model of DNA 
evolution. The key to that is that you have a phylogenetic tree which has 
branch lengths that represent the amount of evolution that's gone on since 
the common ancestor, but the pattern of evolution is the same throughout 
the entire tree."
In other words, the model assumes that the tolerance for substituting a C 
for a T in a particular position doesn't change throughout evolutionary 
history. Instead, what changes is how much time, hence opportunity, there 
has been for such a substitution to actually occur. (Haussler says Siepel 
is currently writing code that relaxes that assumption.)
By contrast, says Brent, "Our model is more flexible; it doesn't have that 
assumption. It allows for one branch to develop a separate pattern of 
evolution at a given site from another branch." Nevertheless, he praises 
ExoniPhy as "elegant" and says, "It has a higher specificity, a lower 
false-positive rate on exons than anybody else."
Such subtle distinctions ensure that no one program will be best suited for 
every application, says Lior Pachter, a mathematician at UC-Berkeley, and 
author of SLAM,7 which does simultaneous gene finding and sequence align  
ment. Indeed, the very idea of comparing various programs amuses Pachter. 
"People like to write programs like SLAM or TWINSCAN and then they compare 
them to other programs, and always in their comparisons they come out 
getting better results," he says. "But of course, it's impossible that 
everybody has the best program."
>From his experiences working on the mouse genome, Pachter observes, "There 
are certain genes that have certain properties that make them easy to 
predict and then we all agree, but there are a lot of cases where we 
disagree." He advises biologists, "It's always a good idea to try all the 
available tools. It's not always that one program is universally better 
than the other in every case."
Consider phylogenetics-based programs, for instance. Roderic Guigo of the 
Universitat Pompeu Fabra in Catalonia, Spain, developer of the comparative 
gene-prediction program called SGP2,6 notes that the phylogenetics data 
that make PhyloHMM-based programs so powerful are also their Achilles' 
heel. "While in general more accurate than purely ab initio methods," he 
writes in an E-mail, "a drawback of the PhyloHMM gene predictors is that 
they require the sequence of one or more genomes at the appropriate 
phylogenetic distance: C. elegans and C. brigasse, human and mouse, Fugu 
and Tetraodon, etc. [On occasion], such a second sequence does not exist."
One thing is certain: Even when these computer tools are finally perfected, 
biology will still be done at the bench, not at the keyboard. "Ultimately 
all computer predictions must be tested in the laboratory," says Haussler, 
"The more you predict, the more wet lab validation there is to be done." 
While lay people may find it ego-deflating that the gene set keeps 
shrinking, experimental biologists may find it something of a relief.
References
1. SR Eddy "What is a hidden Markov model?" Nat Biotechnol 2004, 22: 
1315-6. [PubMed Abstract 
<http://www.biomedcentral.com/pubmed/15470472>][Publisher Full Text 
<http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&cmd=  
prlinks&retmode=ref&id=15470472>]

2. A Krogh et al, "Hidden Markov models in computational biology: 
Applications to protein modeling," J Mol Biol 1994, 235: 1501-31. [PubMed 
Abstract <http://www.biomedcentral.com/pubmed/8107089>][Publisher Full Text 
<http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&cmd=  
prlinks&retmode=ref&id=8107089>]

3. A Siepel, D Haussler "Phylogenetic hidden Markov models," Statistical 
Methods in Molecular Evolution (Edited by: Nielsen R). New York: Springer 
2005.,

4. A Siepel, D Haussler "Computational identification of evolutionarily 
conserved exons," <http://www.cse.ucsc.edu/~acs/recomb2004.pdf> Proceedings 
of the 8th Annual International Conference on Research in Computational 
Biology (RECOMB) 2004.,

5. SS Gross, MR Brent "Using multiple alignments to improve gene 
prediction," Proceedings of RECOMB 2005, in press. ,

6. M Alexandersson et al, "SLAM: Cross-species gene finding and alignment 
with a generalized pair hidden Markov model," Genome Re 2003, 13: 496-502. 
[Publisher Full Text <http://dx.doi.org/10.1101/gr.424203>]

7. G Parra et al, "Comparative gene prediction in human and mouse," Genome 
Res 2003, 13: 108-17. [PubMed Abstract 
<http://www.biomedcentral.com/pubmed/12529313>][Publisher Full Text 
<http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&cmd=  
prlinks&retmode=ref&id=12529313>][PubMed Central Full Text 
<http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedi  
d=12529313>]