Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] polls and multinomials and zero-knowledge password proofs



“… since probabilities are always non-negative. In fact, when the population probability is 1%, the sample uncertainty is σ = 0.3% or Σ = 0.8% ... which is a lot smaller than what the pollsters are quoting."



Which reminds me of: https://ui.adsabs.harvard.edu/abs/1994PhRvA..49.1562S/abstract


The only poll that matters is the one held on or about election day.

Also voters flat-out lie to pollsters.

anyone determined the proportion of exit poll liars?


bc thinks maybe the degree of agreement w/ the exit poll and the result.

p.s. Here’s the UK version https://www.bbc.com/news/election-2019-50716626 A US evaluation: https://academic.oup.com/poq/article/82/1/1/4837043

On 2019/Dec/05, at 10:41, Marx, David via Phys-l <phys-l@mail.phys-l.org> wrote:

Thanks, JD, for the exposition on polling. The most useful stuff is at the end of your post, which most people here probably know. Nevertheless, polls are used to sway public opinion as the public is mostly ignorant of such facts. The only poll that matters is the one held on or about election day. The others are mostly worthless.



-----Original Message-----
From: Phys-l <phys-l-bounces@mail.phys-l.org> On Behalf Of John Denker via Phys-l
Sent: Thursday, December 5, 2019 12:31 PM
To: Phys-L@Phys-L.org
Cc: John Denker <jsd@av8n.com>; Bryan Mumford <bryan@bmumford.com>; nancyseese@redshift.com
Subject: [Phys-L] polls and multinomials and zero-knowledge password proofs

[This message came from an external source. If suspicious, report to abuse@ilstu.edu<mailto:abuse@ilstu.edu>]

Hi Folks --

I recently figured out something that had been bugging me for years. You've seen lots of polls where they survey
1000 voters and then quote a "4%" statistical uncertainty on the result. Have you ever wondered where that "4%"
comes from?

At first glance it seems completely bonkers. Once you understand it, it's just mostly bonkers, not completely.

If you toss a coin 1000 times, the expected mean is 50% with a statistical uncertainty of ± 1.6% ... which IMHO is a lot less than 4%. The formula is simple: it's
σ = ½ √N
= ½ / √N × 100% of N

It is easy to the formula; I can do it in my head in less time than it takes to tell about it.

Hint: After the first coin toss, the coin is half
a unit above or below the average, so the variance
is obviously 0.5, and thereafter the variances just
add linearly, and the standard deviation is the
square root thereof.

Here's the sneaky bit: The pollsters are not quoting the standard deviation; instead they are quoting the *99% confidence* error bars. That's larger by a factor of 2.756, as you can see here:

Interval Probability
± 0 σ 0.00 %
± 1 σ 68.27 %
± 2 σ 95.45 %
± 2.576 σ 99.00 %
± 3 σ 99.73 %

Multiplying 1.6 by 2.576 gives us 4.1 ... so we can write σ = 1.6% (sigma) and Σ = 4.1% (big sigma).

So Σ = 4.1% is not completely bonkers. It means that given a large (perhaps infinite) population, if you draw a random sample of 1000 coins, the sample result will be within 4.1% of the population result 99% of the time.

HOWEVER, this is still mostly bonkers, for multiple reasons.

1) Voters are not coins. Right now there are multiple candidates polling in the single digits. Obviously you cannot have somebody at 1% ± 4%, since probabilities are always non-negative. In fact, when the population probability is 1%, the sample uncertainty is σ = 0.3% or Σ = 0.8% ... which is a lot smaller than what the pollsters are quoting. The general formula is
σ = √(N p (1-p))
which is easy to remember and/or rederive.
I mean, seriously, what else could it be? It scales like √N and gives the right answer a p=0, p=1, and p=½ ... and it couldn't possibly be any simpler.

2) Voters are definitely not coins. The dominant uncertainty in the poll is non-statistical.

2a) Some of this is due to voter behavior. They change their minds from day to day, so there is no abstract unchanging population from which to draw.

Also voters flat-out lie to pollsters.

2b) There are also systematic flaws in pollster behavior. For example, if you conduct the poll by phone, you are biased against people who screen their calls and don't pick up for pollsters.

What's worse, if you conduct an online poll of people who visit Igor's Rat Chow Emporium, you are biased against the anti-rat faction.

And so on.

=========================================

Here's a bit about computational tactics.
Executive summary: Multinomial = conditional binomial.
This is sometimes very convenient.

First, some terminology:
-- A single toss of a 2-sided coin is called a
Bernoulli process.
-- The odds of seeing M heads (aka successes) in N
trials is called a binomial distribution.
-- A single roll of a multi-sided die is called a
multinoulli process or sometimes a categorical
process.
-- The odds of seeing M successes in N trials is
called a multinomial distribution.

To summarize the four variants:

__coin__ __die__

single toss: Bernoulli multinoulli

M out of N: binomial multinomial

Now suppose you want to do a Monte Carlo integral, so you need to generate random samples that follow a given distribution. That's easy to do for a one dimensional distribution, but sometimes nasty for multidimensional distributions. Some environments (e.g. perl) provide a multinomial random generator, from which the other three variants can be obtained as obvious special cases. However, other environments (e.g. spreadsheets) provide only the binomial variant.
So what to do?

Well, if you think about it, each candidate follows a binomial distribution, namely that candidate against the rest of the field. So you can easily generate samples for that one candidate. Now here's the trick:
You can then handle candidate #2 by computing the conditional probability for #2, conditioned on the outcome you just saw for #1, using the /remaining/ probability applied to the /remaining/ part of the sample, i.e. the ones that didn't vote for #1. And so on, iteratively.

Once you know the trick you can derive the formulas in less time than it takes to look them up, but if you want to check your work you can look at page 12 here:
http://www3.jouy.inra.fr/miaj/public/nosdoc/rap2012-5.pdf

These are usable formal formulas for the probabilities, although perhaps not optimal, because they tend to obscure the symmetries of the situation, such as the fact that the probabilities are independent of the order in which they are calculated. But the formulas give the right answer, and the Monte Carlo samples don't care how they were calculated, since they're just numbers.

By way of contrast: You cannot generate the random samples just by treating each candidate as an independent binomial distribution (competing against the rest of the field) because there are tremendous correlations. The probabilities need to sum to unity. The /conditional/ binomials get this right, but independent binomials would not.

======================================

As for passwords: BC is correct, as is Randall Munroe:
Trying to densely pack a lot of randomness into a short password is a Bad Idea. It is left over from the days when passwords were required to be short, which was idiocy on top of stupidity.

It is a step in the right direction to pack a sufficient amount of randomness into a long passphrase.

It is correct to view this as a human-interface issue, not just a mathematical question concerning the amount of randomness. The password needs to be easy for the user to remember, but hard for the bad guys to guess.
Password security is serious business; if you don't believe me, just ask President Hillary Clinton.

The discussion to this point is correct as far as it goes, but leaves out several crucial issues.

For starters, consider the task of remembering passwords for *multiple* sites. Computer passwords were introduced back when you needed only one, and they do *not* scale well to multiple sites.
-- It is a Bad Idea to use the same password for more
than one site. The problem is, the bad guys will
obtain your password by attacking a weak, low-value
site and then exploit it at high-value sites.
-- If you use a different password for each site, it
is a human-factors disaster. You'll never be able
to remember them all, especially for rarely-used
sites. The "password recovery" procedures are
either very expensive or very insecure.

Teachers are human-factors experts. It is our job to understand what things will be remembered and what will not. One of my favorite sayings is:
Utility is the best mnemonic.

That is, something that gets used all the time will never be forgotten.

So..... The solution is to use a /keychain/ program aka /password wallet/ program. The program stores all your passwords. You just need to remember one master password that unlocks the wallet.
-- This makes it easy to use different passwords for
different sites.
-- None of the sites every sees your master password.
-- This makes it easy to use a long, highly-random
password for each site, 20 or 30 randomly generated
characters. There is no incentive to shorten them,
since you will have have to type them. The wallet
program puts them on the clipboard for you, or
(better) stuffs them directly into the appropriate
field on the login form.
-- In particular, it's best if the wallet program
checks the address on the form before stuffing
the password, to make sure that a high-value
password cannot possibly be sent to a low-security
site.

All that is true as far as it goes; HOWEVER it is mostly nonsense in the larger sense.

Sending passwords over the wire is an abomination.

Once you recognize the need for a wallet program, you can with zero additional complexity replace it with an /authentication agent/ which performs computation, not just rote memory.

Specifically: There is such a thing as a zero-knowledge proof. The site at the other end of the wire can verify that you know the password, even though it has has never seen the password. There is nothing of yours the site needs to store or protect.

Code to implement zero-knowledge password proofs is widely available. This should have been deployed a long time ago. Excuse me while I go pound my head on the desk.

See
https://en.wikipedia.org/wiki/Zero-knowledge_password_proof
and references therein.
_______________________________________________
Forum for Physics Educators
Phys-l@mail.phys-l.org
http://www.phys-l.org/mailman/listinfo/phys-l
_______________________________________________
Forum for Physics Educators
Phys-l@mail.phys-l.org
http://www.phys-l.org/mailman/listinfo/phys-l