Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Bernoulli process; analysis using Monte Carlo, SVD, and ternary plots



On 05/30/2015 01:16 AM, I wrote:

in this case we have a /generalized/ Bernoulli process

Some folks have expressed [off list] interest in this.
It's analogous to situations that show up in physics and
chemistry all the time. So, let me explain some additional
ways of looking at it. That is, today we will do the data
analysis in a more professional way. Also we can consider
a slightly simpler example.

Suppose there are two candidates, X and Y. Consider a
scenario where some oracle has told us a_priori that
exactly 49% of the population favors X, and 49% favors
Y. The remaining 2% we call Z, no matter whether that
represents a third candidate or undecided or "none of
the above" or whatever.

We now sample this distribution. Each /sample/ consists
of asking N=50 people at random. Since this is a rather
small sample, there will be substantial "sampling error".

We have three observable variables, namely the sampled
X, sampled Y, and sampled Z. However, they span a space
with only two dimensions, because there is a constraint:
No matter what, X+Y+Z has to add up to 100%.

There are powerful techniques for visualizing three variables
in two dimensions. A good place to start is a ternary plot,
as explained in general here:
https://www.av8n.com/physics/axes.htm#sec-ternary
[As always, if you don't want to bother with security, you
can use http: instead of https:.]

I did a Monte Carlo simulation of the XYZ scenario and
made a scatter plot of the results. Here it is:
https://www.av8n.com/physics/img48/bernoulli-xyz.png

We can learn a tremendous amount from this plot.
-- Contours of constant X are shown in red, and run
in a southwest-northeast direction. 100% X is the
lower-right (southeast) corner.
-- Contours of constant Y are shown in blue and run
in a northwest-southeast direction. 100% Y is the
lower-left (southwest) corner.
-- Contours of constant Z are shown in green and run
in an east-west direction. 100% Z is the top corner.

As expected, the data points are most densely clustered
in the vicinity of [X, Y, Z] = [.49, .49, .02].

Pollsters have a nasty habit of saying that their numbers
have a "sampling error" equal to 1/√N (where N is the
sample size). In our example, 1/√50 = 14%. Yesterday I
briefly mentioned that this was a problem; today I will
give some constructive suggestions on how to do better.

Looking at the ternary plot, we know that 14% uncertainty
is just ridiculous. Each little triangle in the plot is
10 percentage points on a side.
*) If we move in an easterly direction, increasing X at the
expense of Y along a contour of constant Z, the distribution
looks "roughly" Gaussian. The standard deviation in the
east-west direction is easy to compute; it is noticeably
less than 14%, but not too wildly less.
*) If we move in a northerly direction, increasing Z at
the expense of both X and Y equally, we see all sorts
of ugly stuff:
-- The width in this direction is very much smaller
than the width in the perpendicular direction.
14% is a wild overestimate of the uncertainty in
this direction.
-- The distribution is nowhere near Gaussian. It is
nowhere near symmetric. We can move a few percent
towards the north and still see nonzero probability
density, but we cannot possibly move away from the
peak more than 2% towards the south without busting
the envelope, i.e. violating the constraint that Z
has to be positive. In other words, we have lopsided
error bars in the ±Z direction.

It may be that you can afford to run the poll in the field
only one time. That corresponds to just one dot in the
aforementioned scatter plot. Constructive suggestion:
Use Monte Carlo to simulate the experiment. That gives
you lots of dots, at a very low cost. Without that, it's
next to impossible to figure out the statistics.

Not-very-constructive warning: If you think in terms of
"error bars on X" and "error bars on Y" you are guaranteed
to get the wrong answer ... and you won't even know how
grossly wrong it is.

Marginally-constructive suggestion: Look at the covariance
matrix. This is the /minimum/ you must do if you want to
have any semblance of professionalism. The formula for a
multi-dimensional Gaussian can be written

dP(R) = exp(- ΔR^T Σ^-1 ΔR/2) [1]

where R is the vector
[ X ]
R = [ Y ] [2]
[ Z ]

and Σ is the covariance matrix. R^T is the transpose of R.
Σ^-1 is the matrix inverse of Σ.

Equation [1] is the natural generalization of the familiar
expression for a one-dimensional Gaussian:
dP(x) = exp(- Δx^2 / σ^2 / 2) [3]

In simple cases, the covariance matrix will be more-or-less
diagonal using whatever variables you have chosen. Then each
diagonal element is the variance i.e. σ^2 for the corresponding
variable. In such a case, congratulations; you can continue
using those variables without too much hassle. However, alas,
the problems that land on my desk are almost never simple.
In the present case, the covariance matrix is mess. There are
tremendous off-diagonal elements. These indicate correlations.
Indeed, the correlations are so bad that the matrix is singular,
and Σ^-1 does not even exist, strictly speaking.

I said that looking at the covariance matrix is "marginally"
constructive because (by itself) it provides little more than
a warning. It's like a sign that says you are in the middle
of a minefield. You know you've got a big problem, but can't
solve it without additional skill and effort.

SVD to the rescue. Singular Value Decomposition offers you
a way out of the minefield.

SVD is one of those things (rather like a Fourier transform)
that you cannot easily do by hand ... but you can compute it,
and then verify that it is correct, and then understand a lot
of things by looking at it. In particular, given the SVD, it
is trivial to invert the matrix; keep the same eigenvalues,
and take the arithmetical reciprocal of the eigenvectors.

Specifically, SVD will give you the eigenvectors and the
corresponding eigenvalues of the covariance matrix Σ. In
our example, the raw covariance matrix is:

0.004991, -0.004797, -0.0001944,
-0.004797, 0.004995, -0.0001985,
-0.0001944, -0.0001985, 0.0003929,

If you unwisely ignore the off-diagonal elements and take the
diagonal elements (i.e. the variances) to be the squares of
the error bars, you get
0.07065, 0.07068, 0.01982,

For two of the variables we (allegedly) have 7% uncertainty.
This is half of what pollsters conventionally quote. Beware
that 7% is wrong, as discussed below. Meanwhile 14% is
conceptually wrong, for a different reason ... unless you
are willing to make three mistakes that cancel each other
out. As a separate matter, note that the alleged uncertainty
on Z is only 2%. That's not as crazy as 14%, but it's not
exactly right, either.

The eigenvectors of Σ are:
-0.7069, -0.4085, 0.5774,
0.7073, -0.408, 0.5774,
-0.0003135, 0.8165, 0.5774,
and the corresponding eigenvalues are
0.00979, 0.0005894, 1.275e-14,

The third eigenvalue is actually zero. Because of roundoff
errors it is represented here by a super-tiny floating-point
number. The zero eigenvalue corresponds to a zero-length
error bar in a certain direction. It tells that if we change
X by itself, keeping Y and Z constant, we violate the constraint
that X+Y+Z must add up to 1. Ditto for changing Y or Z by
itself. This corresponds to moving in the third dimension
in the ternary plot, i.e. leaving the plane of the paper.
It is not allowed. In accordance with equation [3], any
movement at all in the direction of a zero-length error bar
will cause the probability to vanish.

The first eigenvector is the "cheap" one. You know this
because it corresponds to the large eigenvalue of Σ (and hence
the small eigenvalue of Σ^-1). To a good approximation, it
represents increasing X at the expense of Y (and/or vice versa)
along a contour of constant Z. You know this by looking at
the eigenvector components in the first column. SVD is telling
us a lot of useful stuff.

Taking the square root of the eigenvalue, we find that the
error bar is 9.9% in the cheap direction ... not 7% and not
14% but smack in between (geometric average). Meanwhile
the calculation suggests an error bar of 2.4% in the other
direction, i.e. increasing Z and the expense of X and Y
equally. Alas, this is not as useful as one might have
hoped, because the distribution is grossly non-Gaussian in
this direction. Starting from the peak of the distribution,
can move several times 2.4% toward the north, but we
cannot move even one times 2.4% toward the south.

Overall, we can get a fairly decent handle on whats going
on using a combination of Monte Carlo, ternary plots, and
Singular Value Decomposition.

==============

Here's how to understand where the pollsters get their
bogus 14% number.
a) If we ignore the constraint on X+Y+Z, we make one
mistake.
b) If we ignore the correlations between X and Y, i.e.
ignore the fact that the blob in the scatter plot is
elongated in one direction and not the other, we
make another mistake.
c) If we use a bogus value for the so-called "sampling
error" we make a third mistake.

Now it turns out that if we combine these three mistakes
and sprinkle fairy dust on them in just the right way,
we can /sometimes/ obtain a numerically correct answer.

Specifically: Suppose we ignore the constraint and
imagine that we can move the X variable by itself.
We compute a so-called probability using the bogus
formula

dP(X) = exp(- ΔX^2 / σ^2 / B^2 / 2) [4]

where σ is the genuine uncertainty (10%) and B is
a measure of how bogus the formula is.

We then write a similar formula in the Y direction.

dP(Y) = exp(- ΔY^2 / σ^2 / B^2 / 2) [5]

To get the joint probability of moving X and Y at
the same time, we multiply the two probabilities.
This is where we make the mistake of ignoring the
correlations. If we correctly accounted for the
correlations, we would use either equation [4] or
equation [5] ... not the product of the two. That's
because the /conditional/ probability that Y moves
when X moves is nearly 100%.

The product is off by a factor of 2 inside the
exponent. We can make that disappear by choosing
B^2 = 2. So the fake error bar is σB which is √2
longer than the real σ. Also the real σ is √2 than
what you would expect from a one-dimensional sampling
error, because the blob in the scatter plot is rotated
45 degrees relative to the "natural" variables X and Y.

To summarize, the pollsters make two mistakes of
a factor of √2 apiece, plus a third mistake that
is a factor of 2 in the other direction. These
mistakes cancel sometimes, but only SOMETIMES.

In particular, if you think it is OK to treat X and
Y as independent, you could just as easily increase
X and Y together -- rather than increasing one
while decreasing the other -- and the mistake in
this case would be a factor of infinity. Fudging
the "sampling error" by a factor of 2 will not
fix this. Not even close.

On top of all that, the 14% error bar is just ridiculous
when applied to the Z variable, or to any variable that
is not near 50%, as discussed yesterday.

Nitpicky note: In the scatter plot, the sample size
is N=50, so all the sampled data points X, Y, and Z
must occur at integer multiples of 0.02. However, I
post-processed them to smear the out by a tiny random
amount, to make them easier to see when many are piled
on top of each other. Sometimes the less-faithful
representation is more informative.

In summary:

A) As the facetious proverb says: When all else
fails, LOOK AT THE DATA.

Professional statisticians spend a *lot* of time just
staring at data. They cover the walls of their office
with plots. If the first representation doesn't tell
the tale, they construct another, and another, and
another.

As a corollary: If you have three variables with a
constraint, a ternary plot might be just the ticket.
It takes a few hours of staring at such things before
you get good at interpreting them ... so don't panic
if it's not 100% clear at first sight.

B) You can do a lot with Monte Carlo. It's not a
substitute for thinking ... but Monte Carlo /plus/
thinking is quite a bit better than either one
separately.

c) You can do a lot with SVD. A matrix in its undecomposed
form is next to impossible to understand (unless it
happens to be already diagonal) ... especially in more
than two dimensions. SVD will diagonalize it for you.
It's astonishing how many postdocs there are running
around, with physics degrees from Big Name institutions,
who have no clue about SVD.
-- Bevington _Data Reduction_ does not mention it.
-- Taylor _Error Analysis_ does not mention it.

Possibly constructive suggestion: I have not done a
systematic survey, but I've had good luck with the
"armadillo" package. It provides a thin layer of
nice C++ bindings on top of the LAPACK linear algebra
package. If anybody wants to play with the code I
used for this little project, it's at
https://www.av8n.com/physics/bernoulli-nway.c
https://www.av8n.com/physics/ternary-plot.gnumeric
or probably mostly equivalently
https://www.av8n.com/physics/ternary-plot.xls