Bayes' theorem
Bayes' theorem is a result in probability theory, named after the Reverend Thomas Bayes, who proved a special case of it in the 18th century. It is used in statistical inference to update estimates of the probability that different hypotheses are true, based on observations and a knowledge of how likely those observations are, given each hypothesis. Its discrete version may appear to go little beyond an identity that is sometimes taken to be the definition of conditional probability, but there is also a continuous version. A frequent error is to think that reliance on Bayes' theorem is the essence of Bayesianism, whose essence is actually the degree-of-belief interpretation of probability, contrasted with various "frequency" interpretations.
| Table of contents |
|
2 A worked example 3 Bayesianism |
We will start with the simplest case of only two hypotheses, H1 and H2. Suppose that we know that precisely one of the two hypotheses must be true, and suppose furthermore that we know their "prior" probabilities P(H1) and P(H2) = 1 - P(H1). Now some "data" D is observed, and we know the conditional probabilities of D given H1 and H2, written as P(D | H1) and P(D | H2). We want to compute the "posterior" probabilities of H1 and H2, given the observation of D. Bayes' theorem states that these probabilities can be computed as
To illustrate, suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Somebody randomly picks a bowl, and then randomly picks a cookie. The cookie turns out to be a plain one. How likely is it that he picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than 50%, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. H1 corresponds to bowl #1, and H2 to bowl #2. Since the bowl was picked randomly, we know P(H1) = P(H2) = 50%. The "data" D consists in the observation of a plain cookie. From the contents of the bowls, we know that P(D | H1) = 75% and P(D | H2) = 50%. Bayes' formula then yields
The theorem is also true if we have more than just two hypotheses, say
H1, H2, H3, ..., of which precisely one is true. Suppose we know the prior probability distribution
The continuous case of Bayes' theorem also says the posterior distribution
results from multiplying the prior by the likelihood and then normalizing.
The prior and posterior distributions are usually identified with their
probability density functions.
For example, suppose the proportion of voters who will vote "yes" is
an unknown number p between 0 and 1. A sample of n voters is
drawn randomly from the population, and it is observed that x of
those n voters will vote "yes". The likelihood function is then
Bayesianism is the philosophical tenet that the rules of mathematical
probability apply not only when probabilities are relative frequencies
assigned to random events, but also when they are degrees of belief
assigned to uncertain propositions. Updating these degrees of belief in light of new evidence almost invariably involves application of Bayes' theorem.
See also:
Justification for Bayes' theorem
where the constant c has to be chosen so that the sum of the two probabilities is 1, i.e.
This theorem is a simple consequence of the definition of conditional probabilities.A worked example
Initially, we estimated that he would pick bowl #1 with 50% probability, but after observing the plain cookie, we adjust our estimate to 60%.
as well as the likelihood function
Then the posterior probability distribution
can be found by multiplying the prior probability distribution by the
likelihood function and then normalizing, so that we have
Here again the constant c must be so chosen as to make the sum of the
posterior probabilities equal to 1.
Multiplying that by the prior probability density funtion of p
and then normalizing gives the posterior probability distribution
of p, and thus updates probabilities in light of the new data
given by the opinion poll. Thus if the prior probability distribution
of p is uniform on the interval [0,1], then the posterior
probability distribution would have a density of the form
and this "constant" would be different from the one that appears
in the likelihood function.Bayesianism