Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

[Phys-L] polls and multinomials and zero-knowledge password proofs



Hi Folks --

I recently figured out something that had been bugging me
for years. You've seen lots of polls where they survey
1000 voters and then quote a "4%" statistical uncertainty
on the result. Have you ever wondered where that "4%"
comes from?

At first glance it seems completely bonkers. Once you
understand it, it's just mostly bonkers, not completely.

If you toss a coin 1000 times, the expected mean is 50%
with a statistical uncertainty of ± 1.6% ... which IMHO
is a lot less than 4%. The formula is simple: it's
σ = ½ √N
= ½ / √N × 100% of N

It is easy to rederive the formula; I can do it in my head
in less time than it takes to tell about it.

Hint: After the first coin toss, the coin is half
a unit above or below the average, so the variance
is obviously 0.5, and thereafter the variances just
add linearly, and the standard deviation is the
square root thereof.

Here's the sneaky bit: The pollsters are not quoting
the standard deviation; instead they are quoting the
*99% confidence* error bars. That's larger by a factor
of 2.756, as you can see here:

Interval Probability
± 0 σ 0.00 %
± 1 σ 68.27 %
± 2 σ 95.45 %
± 2.576 σ 99.00 %
± 3 σ 99.73 %

Multiplying 1.6 by 2.576 gives us 4.1 ... so we can write
σ = 1.6% (sigma) and Σ = 4.1% (big sigma).

So Σ = 4.1% is not completely bonkers. It means that
given a large (perhaps infinite) population, if you
draw a random sample of 1000 coins, the sample result
will be within 4.1% of the population result 99% of
the time.

HOWEVER, this is still mostly bonkers, for multiple
reasons.

1) Voters are not coins. Right now there are multiple
candidates polling in the single digits. Obviously you
cannot have somebody at 1% ± 4%, since probabilities
are always non-negative. In fact, when the population
probability is 1%, the sample uncertainty is σ = 0.3%
or Σ = 0.8% ... which is a lot smaller than what the
pollsters are quoting. The general formula is
σ = √(N p (1-p))
which is easy to remember and/or rederive.
I mean, seriously, what else could it be? It scales
like √N and gives the right answer a p=0, p=1, and
p=½ ... and it couldn't possibly be any simpler.

2) Voters are definitely not coins. The dominant
uncertainty in the poll is non-statistical.

2a) Some of this is due to voter behavior. They
change their minds from day to day, so there is no
abstract unchanging population from which to draw.

Also voters flat-out lie to pollsters.

2b) There are also systematic flaws in pollster
behavior. For example, if you conduct the poll by
phone, you are biased against people who screen
their calls and don't pick up for pollsters.

What's worse, if you conduct an online poll of
people who visit Igor's Rat Chow Emporium, you
are biased against the anti-rat faction.

And so on.

=========================================

Here's a bit about computational tactics.
Executive summary: Multinomial = conditional binomial.
This is sometimes very convenient.

First, some terminology:
-- A single toss of a 2-sided coin is called a
Bernoulli process.
-- The odds of seeing M heads (aka successes) in N
trials is called a binomial distribution.
-- A single roll of a multi-sided die is called a
multinoulli process or sometimes a categorical
process.
-- The odds of seeing M successes in N trials is
called a multinomial distribution.

To summarize the four variants:

__coin__ __die__

single toss: Bernoulli multinoulli

M out of N: binomial multinomial

Now suppose you want to do a Monte Carlo integral,
so you need to generate random samples that follow
a given distribution. That's easy to do for a one
dimensional distribution, but sometimes nasty for
multidimensional distributions. Some environments
(e.g. perl) provide a multinomial random generator,
from which the other three variants can be obtained
as obvious special cases. However, other environments
(e.g. spreadsheets) provide only the binomial variant.
So what to do?

Well, if you think about it, each candidate follows
a binomial distribution, namely that candidate against
the rest of the field. So you can easily generate
samples for that one candidate. Now here's the trick:
You can then handle candidate #2 by computing the
conditional probability for #2, conditioned on the
outcome you just saw for #1, using the /remaining/
probability applied to the /remaining/ part of the
sample, i.e. the ones that didn't vote for #1. And
so on, iteratively.

Once you know the trick you can derive the formulas
in less time than it takes to look them up, but if you
want to check your work you can look at page 12 here:
http://www3.jouy.inra.fr/miaj/public/nosdoc/rap2012-5.pdf

These are usable formal formulas for the probabilities,
although perhaps not optimal, because they tend to
obscure the symmetries of the situation, such as the
fact that the probabilities are independent of the
order in which they are calculated. But the formulas
give the right answer, and the Monte Carlo samples
don't care how they were calculated, since they're
just numbers.

By way of contrast: You cannot generate the random
samples just by treating each candidate as an
independent binomial distribution (competing against
the rest of the field) because there are tremendous
correlations. The probabilities need to sum to
unity. The /conditional/ binomials get this right,
but independent binomials would not.

======================================

As for passwords: BC is correct, as is Randall Munroe:
Trying to densely pack a lot of randomness into a short
password is a Bad Idea. It is left over from the days
when passwords were required to be short, which was
idiocy on top of stupidity.

It is a step in the right direction to pack a sufficient
amount of randomness into a long passphrase.

It is correct to view this as a human-interface issue,
not just a mathematical question concerning the amount
of randomness. The password needs to be easy for the
user to remember, but hard for the bad guys to guess.
Password security is serious business; if you don't
believe me, just ask President Hillary Clinton.

The discussion to this point is correct as far as it
goes, but leaves out several crucial issues.

For starters, consider the task of remembering passwords
for *multiple* sites. Computer passwords were introduced
back when you needed only one, and they do *not* scale
well to multiple sites.
-- It is a Bad Idea to use the same password for more
than one site. The problem is, the bad guys will
obtain your password by attacking a weak, low-value
site and then exploit it at high-value sites.
-- If you use a different password for each site, it
is a human-factors disaster. You'll never be able
to remember them all, especially for rarely-used
sites. The "password recovery" procedures are
either very expensive or very insecure.

Teachers are human-factors experts. It is our job to
understand what things will be remembered and what
will not. One of my favorite sayings is:
Utility is the best mnemonic.

That is, something that gets used all the time will
never be forgotten.

So..... The solution is to use a /keychain/ program
aka /password wallet/ program. The program stores all
your passwords. You just need to remember one master
password that unlocks the wallet.
-- This makes it easy to use different passwords for
different sites.
-- None of the sites every sees your master password.
-- This makes it easy to use a long, highly-random
password for each site, 20 or 30 randomly generated
characters. There is no incentive to shorten them,
since you will have have to type them. The wallet
program puts them on the clipboard for you, or
(better) stuffs them directly into the appropriate
field on the login form.
-- In particular, it's best if the wallet program
checks the address on the form before stuffing
the password, to make sure that a high-value
password cannot possibly be sent to a low-security
site.

All that is true as far as it goes; HOWEVER it is
mostly nonsense in the larger sense.

Sending passwords over the wire is an abomination.

Once you recognize the need for a wallet program,
you can with zero additional complexity replace
it with an /authentication agent/ which performs
computation, not just rote memory.

Specifically: There is such a thing as a zero-knowledge
proof. The site at the other end of the wire can
verify that you know the password, even though it
has has never seen the password. There is nothing
of yours the site needs to store or protect.

Code to implement zero-knowledge password proofs is
widely available. This should have been deployed a
long time ago. Excuse me while I go pound my head
on the desk.

See
https://en.wikipedia.org/wiki/Zero-knowledge_password_proof
and references therein.