Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Bayesian statisitics



On 09/03/2013 03:02 AM, Brian Blais wrote:
do you have a reference for this method that I could read? Sounds very interesting.

Try this:
John S. Denker and Christopher J. C. Burges
"Image Segmentation and Recognition"
http://research.microsoft.com/en-us/um/people/cburges/papers/path-norm.pdf

It is reasonably easy to read, although a bit long (24 pages).

What that leaves out is any discussion of the prior art. I ran around
asking a bunch of experts what kind of probability was produced by a
neural network. The answer was "likelihood". When I asked why, there
wasn't a detailed answer. They thought I was asking a stupid question.
It was just "obvious" that all such things involved the likelihood, i.e.
the a_priori probability P(image|classification). When I pointed out
that you would really rather have the a_posteriori probability
P(classification|image) or -- even better! -- the joint probability
P(classification,image), they quoted Mick Jagger:
"You can't ... always get ... what you waa-ant...."
They pointed out that P(i|c) was /proportional/ to P(c|i) via the Bayes
inversion formula, and they got "state of the art" results using P(i|c).
The problem is, their "constant" of proportionality was not constant,
so they could say everything in the world was "proportional" to everything
else, and I was not persuaded that it was OK to substitute P(i|c) for
P(c|i). Again they quoted Jagger. They had algorithms for computing
the likelihood P(i|c) and they didn't have algorithms for P(c|i), so I
should please shut up and do what was doable.

Eventually I figured out the answer to my own question:
Learning is a search through the space of all possible probability measures.

This idea is quite liberating. You can /train/ the machine to produce
P(c|i) or P(i|c) or P(i,c) or whatever, if you feed it the appropriate
training signals.

Note that the same idea applies to curve fitting and other forms of data
reduction. Least-squares fitting is maximum likelihood. AFAICT everything
in Bevington is maximum likelihood ... even though you would much rather
be using maximum a_posteriori.

===================

Note that for pedagogical and psychological reasons I now prefer to call
it the set-theory approach rather than the measure-theory approach. Note
the contrast:
-- Measure theory is considered advanced. It is something people see
in upper-division undergraduate classes, if at all.
-- Set theory is taught in grade school these days
http://www.google.com/search?q=%22common+core%22+%22union%22+%22intersection%22+%22set%22

As pointed out by Apostol, to do probability you do not need all the heavy
machinery of measure theory; you need very little beyond elementary set
theory.

It might be interesting for others to see some simple problems, where
it makes a difference, done out in a couple different methods. That
sort of thing really helps highlight the strengths of different
methods.

Yeah. We're talking about
a) The frequentist interpretation
b) The Bayesian interpretation
c) the set-theory interpretation

At some level, these all produce the same fundamental formulas, such as
the definition of conditional probability, et cetera. So it may be that
a 100% logical Vulcan mathematician would see all of these as the same
thing. However, for the rest of us -- especially for teachers and
students -- interpretations matter. A bad mental model can create
treeemendous barriers to understanding the formulas, even if the formulas
are the same, independent of model, independent of interpretation.

I see this as somewhat akin to "interpretations" of quantum mechanics:
Copenhagen interpretation, Everett interpretation, et cetera. I say
"That which interprets least interprets best."
That harks back to Newton's dictum, "hypotheses non fingo" ... which
in turn expresses an idea put forth by Galileo, marking Day One of
modern science.

I like the set-theory approach because it adds the least amount of
cruft to the basic ideas. Specifically:
-- The frequentist approach is waaaay too dependent on the large-N
limit. This creates conceptual problems when dealing with small
systems, such as the statistical mechanics of a single biomolecule,
or the thermodynamics of a Szilárd engine, et cetera.
-- "Bayesian" means different things to different people, but all too
often, this approach treats as constant things that aren't really
constant. The frequentist approach tends to have this problem, too.
-- Both the frequentist approach and the Bayesian approach tend to
be dogmatic about the normalization, whereas set theory is not.
In theory, it is possible to work around this limitation, but it
increases the computational workload slightly, and increases the
conceptual / cognitive workload enormously. That is to say, encoding
an un-normalized probability within a normalized probability is
possible ... but it's unphysical, counterintuitive, and very hard
to interpret.

Set theory is quite abstract, not requiring any pre-Copernican epicycles
... but that does not prevent us from building nice mental models of it.
We can visualize it in terms of pie charts, histograms, et cetera.
http://www.av8n.com/physics/probability-intro.htm

Last but not least: The modern (post-1933) definition of probability
does not depend on any prior notion of randomness. The idea of probability
without randomness may come as a shock to a physics professor who has
been trained on the frequentist approach (e.g. Feynman volume I chapter
6) ... but pedagogically speaking it's a feature not a bug, because
the typical incoming student does not have a reliable intuition about
randomness. So add this to the long list of things that can be explained
more easily to students than to teachers. The modern (post-1908) view
of spacetime WITHOUT time dilation and WITHOUT length contraction is
another topic on the same list.
http://www.av8n.com/physics/spacetime-welcome.htm