Chronology Current Month Current Thread Current Date [Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

# Re: [Phys-L] Bayesian statisitics

• From: John Denker <jsd@av8n.com>
• Date: Mon, 02 Sep 2013 18:54:24 -0700

On 09/02/2013 12:07 PM, Brian Blais wrote:

... wondering what you think of E. T. Jaynes' approach to Bayesian
inference. He does not make use of set-theoretic definitions, but
in my reading of him, he does seem to admit that these have
identical consequences in applications.

1) do you agree?

In general I don't like terms like "Bayesian" or "Darwinian".
By way of analogy: I make good use of Newton's laws, but does
that make me a "Newtonian"? I hope not. Am I required to accept
everything that has been said by Newton, or about Newton? I hope
not.

As previously mentioned: "Bayesian" means different things to
different people. Some people consider Jaynes to be an orthodox
Bayesian, and some don't ... and I don't care. That's because
I like to pick and choose, on an idea-by-idea basis, from any
school of thought. I latch onto the good ideas and ignore the
rest.

Set theory is powerful enough to include everything you want to
do. Conventional statistics can be understood as a subset, as
a corollary. Interestingly, the converse is not true, AFAICT.
That is to say, the set-theory approach is *not* a subset of the
other approaches. This is covered by the following questions:

2) do you find some *quantitative* improvement using the
set-theoretic definitions. I mean, is there an actual problem where
one method works and the other not.

3) is there some *practical* improvement using the set-theoretic
definitions. I mean, are there problems that are much easier to
solve, even if both methods yield the same result in the end?

The answer to (2) and (3) is the same.

Once upon a time, I was at Bell Labs, working on a pattern recognition
problem, using a machine-learning approach. Note that speech synthesis
and speech recognition were a core activity of Bell Labs, and had been
all along, ever since the days of Aleck Bell. There were dozens of guys
working on this. Seriously smart guys. They knew literally every trick
in the book, and indeed several of them had /written/ books on the subject.
My boss's boss's boss was one of the leaders in this field.

The effort included very fundamental research and very applied development.
The development guys had a huuuge software system that was doing OCR with
a 2% error rate, which was the same as people could do on the same data
set, so this was considered quite an achievement.

There were also a bunch of seriously smart guys at various other institutions,
working on similar problems.

I don't want to go into details, but one symptom was expressed by John
von Neumann, who was not the village idiot: "With four adjustable
parameters I can fit an elephant, and with five I can wiggle his tail."

Then word got out that my buddies and I were fitting 100,000 adjustable
parameters, with good results.

As a separate matter, there was an issue with "maximum likelihood"
learning in general, including curve fitting in particular. Many
standard texts assumed this was the right thing to do ... sometimes
tacitly, sometimes in a brief footnote. Those who suspected it was
not the right thing to do did it anyway! That's because the alternatives
were considered impossible or ridiculously impractical.

Then word got out that I had a scheme to learn maximum_a_posteriori (MAP)
not maximum likelihood. This is P(a|b) instead of P(b|a). The statistics
research guys did not believe this was possible. The development guys were
skeptical, but after much inveigling and cajoling they tried my idea, and
the error rate went down from 2% to 0.2%.

It may be that these solutions "could" have been found using old-school
Bayesian and/or frequentist methods ... or maybe not. I don't know. I
know for sure that a bunch of smart, highly-motivated people tried for
many years without success. Also, there is a good reason why you would
/expect/ the set-theoretic approach to work better. The punch line is
that machine learning should be considered a search through the space
of all possible probability measures ... so it really helps to use an
approach where the measure itself is a central focus of attention. To
say the same thing the other way, if you have an approach that focuses
attention on "the" truth and/or "the" state of nature, you are never
going to discover the solution, not in a million years. Indeed, I
have told this story to world-famous statisticians who not only didn't
discover the solution, but didn't even believe it after I told them