Monte Carlo:
Buffons Needle
Question about satellite DNA:
Tandemly Repeated (Satellite) DNA
Central Dogma:
Duplication, transcription(RNA synthesis), translation(protein synthesis).
Purines: A and G
rna
| On/ off, hybridization/nonhybridized |
|
Bernouilli(p) variable. |
| Sum of many such events |
|
Binomial |
and
are the parameters of the distributions,
often
unknown and we will need to estimate it, the estimator
will be
.
Example:
Different codons can encode the same amino acid.
Mycobacterium tuberculosis H37Rv [gbbct]: 3873 CDS's (1321373 codons) ------------------------------------------------------------------------ fields: [triplet] [frequency: per thousand] ([number]) ------------------------------------------------------------------------ UUU 6.2( 8159) UCU 2.2( 2933) UAU 6.1( 8024) UGU 2.2( 2917) UUC 23.3( 30825) UCC 11.5( 15256) UAC 14.6( 19330) UGC 6.6( 8745) UUA 1.6( 2140) UCA 3.5( 4685) UAA 0.5( 609) UGA 1.6( 2120) UUG 17.9( 23684) UCG 19.4( 25611) UAG 0.9( 1144) UGG 14.6( 19342) CUU 5.4( 7188) CCU 3.4( 4457) CAU 6.4( 8488) CGU 8.4( 11130) CUC 17.3( 22811) CCC 17.0( 22503) CAC 15.8( 20910) CGC 28.3( 37392) CUA 4.8( 6278) CCA 6.1( 8085) CAA 8.1( 10700) CGA 7.2( 9539) CUG 50.5( 66756) CCG 31.4( 41507) CAG 22.8( 30081) CGG 24.6( 32515) AUU 6.5( 8551) ACU 3.7( 4837) AAU 5.3( 7003) AGU 3.5( 4684) AUC 33.9( 44767) ACC 35.2( 46519) AAC 19.9( 26348) AGC 14.5( 19147) AUA 2.2( 2893) ACA 4.5( 5992) AAA 5.3( 6944) AGA 1.3( 1682) AUG 18.4( 24348) ACG 15.7( 20683) AAG 15.1( 19889) AGG 3.2( 4203) GUU 8.0( 10578) GCU 10.9( 14424) GAU 15.8( 20815) GGU 18.9( 24945) GUC 32.7( 43214) GCC 59.9( 79118) GAC 42.2( 55775) GGC 51.0( 67348) GUA 4.7( 6274) GCA 12.8( 16947) GAA 16.2( 21373) GGA 9.9( 13114) GUG 40.1( 52998) GCG 48.7( 64319) GAG 30.5( 40322) GGG 19.3( 25455)Proline is
| Codon | o/oo | count | ||
| CCU | 3.4 | ( 4457) | 0.059 | |
| CCC | 17.0 | ( 22503) | 0.294 | |
| CCA | 6.1 | ( 8085) | 0.106 | |
| CCG | 31.4 | ( 41507) | 0.542 | |
| Total | 57.9 | 76552 | 1.001 |
We study the log-likelihood of this :
Each of the counts is binomially distributed with probabilities as described above so that:
multin <-
function(pvec, num)
{
#Function that generates the output of
#putting num balls into length(pvec) boxes,
#with probabilities each given by the vector pvec
run <- runif(num)
gener <- cut(run, cumsum(c(0, pvec)))
return(gener)
}
R code for estimating theta with ML estimate:
> theta<-
function(xvec)
{
return((2 * xvec[3] + xvec[2])/(2 * sum(xvec)))
}
The Bayesian Paradigm can be seen in some ways as an extra step in the modelling world just as parametric modelling is. We have seen how we could use probabilistic models to infer about some unknown aspect either by confidence intervals or by hypothesis testing.
The motivation for any statistical analyses is that some ``target population'' is not well understood- some aspects of it are unknown or unsure.
The idea in this paradigm is to say thta any uncertainty can be modelled in a probabilistic way.
It is true that there are very rarely situations when one doesn't know anything at all, asked to measure the table, you won't want to use a ``pied de coulisse''(callipers) or a 100 yard measuring ribbon.
The probability model that we build can be quite approximate, it reflects one's beliefs and any prior experience we may have, it is described as personal or subjective.
Why ? Because it is different from person to person, examples that are easy to understand are about horse betting, the stock exchange...
So when the uncertainty about the model can be boiled down
to a parameter
the Bayesian statistician treats
as if
it were a random variable
whose distribution describes that
uncertainty.
Elliciting a whole distribution may seem a challenge,
in fact it's done by successive events of the type
,
and does NOT have to be very precise.
A subjective/personal probability is going to be subject to modification upon acquisition of further information supplied by experimental data.
Suppose a distribution with density
describes one's
present uncertainties about some probability model
with density
.
Those uncertainties will change with the acquisition of data
obtained by doing the experiment modelled by
.
Bayes theorem is essential in updating :
The probability of H given the data is called the posterior
probability of H, it is posterior to the data.
The unconditional probability of
:
is the prior
probability of H.
For given data
is the likelihood of H.
For given data we often write :
If it helps (some people have a better understanding of odds):
Represent the data by a random variable Y:
The marginal distribution of Y is not exhibited, it
is the proportionality factor.
It can be written :
Remark: One does NOT have to worry too much about the prior because as soon as the data comes in it is `swamped' in the following sense: Two people with divergent prior opinions but reasonably open-minded will be forced into arbitrarily close agreement about future observations by a sufficient amount of data. We will see an example of this later on.
When advance information is available the Bayesian method provides a routine way for updating uncertainty when new information comes in.
There are several steps to building a prior:
Suppose you prefer the latter.
Another type of thought experiment could be used to
build
for an
increasing sequence of
's.
This is not usually how priors are built though because it seems quite an exhaustive process to build up a whole density prior, instead we are going to use families of priors who have easy updating processes with regards to the specific likelihoods at hand.
Posterior odds = Prior odds
likelihood ratio.
The marginal distribution of Y is not exhibited, it
is the proportionality factor.
It can be written :
We see that an ``objective'' way of building priors for the binomial parameter was to use the `conjugate family' distribution that has the property that the updated distribution is in the same family.
A red billiard ball is then rolled n times under the same uniform assumption. r then denotes the number of times R goes less far than W went. Given X what inference can we make about p ?
Taken another way, we could have rolled n white balls first and then the red balls and looked hat how many balls there were before the red ball.
Or we could have rolled all the balls together and looked at where the red ball was, it is the a'th with probability 1/(n+1).
Let's say this again in our terminology:
We are looking for the posterior distribution
of
given X.
is a number between
and
The prior distribution of p is Uniform(0,1)=Beta(1,1).
For practical reasons, we define the precision as the inverse of
the variance: we denote by
and
The posterior mean is a weighted average of
the prior mean and the data, weights being proportional
to the respective precisions.
With a very gentle prior we would have a very low precision
, a very flat prior and mostly the posterior
is Normal with x as its mean.
Of course what we are usually interested in is the
posterior given an iid sample of size
, what you could
expect happens it is equivalent to adding one observation
from a distribution that has variance
.
In order to go further we need to extend what
we did before for the binomial and its Conjugate Prior to
the multinomial and the the Dirichlet Prior.
This is a probability distribution
on the
simplex
It is a
-dimensional version of the beta density.
The
Dirichlet has a parameter vector:
.
Throughout we write
.
is normalised to have total mass 1
the Dirichlet has density:
From a more practical point of view there are two simple procedures worth recalling here:
Each time, choose a color
with probability proportional
to the number of balls of that color
in the urn. If
is drawn, replace it along with
another ball of the same color.
There is not necessarily a random component in the
original problem that one wants to solve,usually
a problem for which there is no analytical solution.
An unknown parameter (deterministic) is expressed as a parameter
of some random distribution, that is then simulated.
The oldest well-known example is that of the stimation of
by Buffon, in his needle on the floorboards expeiment,
where supposing a needle of the same length as the width between
cracks we have:
There are certain desirable properties that we want estimators to have, consistency which ensures that as the sample size increases the estimates converges to the right answer is ensured here by the properties of Riemann sums. Other properties of interest are:
More precisely, the variance of crude Monte Carlo is
Simple description of a Markov chain
Chapter on Stochastic Simulation
Simple lesson on Markov chains
CpG islands, chapter 3 in DEKM.
Course notes:
Shamir's Course notes
CpG Island Applet Finder CpG Island Applet Finder
APPLICATION:imprinting.
imprinting
I showed an animation in class from: Good Applets and Didactic Material on all 3 algorithms
Underlying model is a graph with `Match', `delete' and `insert' states.
Regions of higher conservation are called functional domains, their resistance to change indicates they serve some critical function.
For those who already know about subsititution matrics (we will study next), the HMM model has more flexibility because it allows the transitions/transversion probabilities to change from position to position.
Downside: more parameters, have to have much more data.
Training is the first step, with training sequences
where the domains are already known.
Complete Tutorial by Rachel Karchin
Extensions:
Krogh's tutorial
Prediction of transmembrane helices
Two groups of amino acids whose differing
frequencies give information on the protein's
location are hydrophobic and hydrophilic.
Tutorial on Amino Acids
Slides by the inventor: Chris Burge, MIT
Jun Liu's overview of sequence analysis
The product multinomial model is
Continuous time Markov chains
A pdf version of the above maths
More Continuous-time Markov Chains
The HIV/SIV jump problem, when did it occur?
Did Modern Medicine Spread an Epidemic?
M. Singh - On Phylogeny(psfile)
Systematics and Phylogenetic Inference
An algorithmic approach to the ML tree
Amino Acid Evolution:
Codon substitution models, tests and consequences:
Ziheng Yang -psb 00- physicochemical properties aas
Modelling evolution at the Amino Acid Level- R. Goldstein
Software for Phylogeny:
Table of Methods for Studying Links between Variables:
| Techniques | Variables to explain | Explanatory Var. |
| Multiple Regression | 1 continuous | p continuous |
| Analysis of Variance | 1 continuous | p categorical |
| Analysis of Covariance | 1 continuous | p1 continuous |
| p2 categorical | ||
| Correspondence Analaysis | 1 categorical | 1 categorical |
| Canonical Correlation Analysis | q continuous | p continuous |
| Principal Components with | q continuous | p continuous |
| resepect to Instrumental Variables | ||
| Discriminant Analysis | 1 categorical | p continuous |
| Multidimensional Analysis | p continuous | p categorical |
| of Variance | ||
| Multidimensional Analysis | p continuous | p1 categorical |
| of Covariance | p2 continuous | |
| Regression Tree | 1 continuous | p1 continuous |
| p2 categorical | ||
| Classification Tree | 1 categorical | p1 continuous |
| p2 categorical | ||
| p2 categoricals |
Table of Methods for Representing Data:
| Techniques | Variables |
| Principal Components | p continuous |
| Multiple Correspondence Analysis | p categorical or |
| p categorical and q continuous | |
| Multidimensionnal Scaling | categoricals and continuous |
| Clustering (either hierarchical or not) | categorical |
Reading for next week, about micro-arrays: Anatomy of a Comparative Gene Expression Study
About Gene Expression read: Terry Speeds' lecturehttp://www.stat.Berkeley.EDU/users/terry/Classes/s246.2002/Week3/week3ab.pdf
Flash animation of DNA Microarray Methodology
Gene Expression Analysis and Genetic Network Modeling
Templates for looking at Gene expression
http://www.stat.Berkeley.EDU/users/terry/Classes/s246.2002/Week7/week7b.pdf
Analysis of Variance for Microarrays. MAANOVA: A Software Package for the Analysis of Spotted cDNA Microarray Experiments Discriminant Analysis