Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] throwing out data -- or not



Executive summary: Don't throw out data unless you can
prove you know what you're doing.

On 07/04/2018 05:05 PM, bernard cleyet wrote:

when to "throw out" a datum: [1]

Scroll to the Appendix:
https://scott.physics.ucsc.edu/pdf/133_draftman.pdf

That document has many appendices. I assume we are talking
about the one that begins on page 57.

OTOH: The current 133 manual has a different reason for "throwing
out" data. however, I’m not free to reference.

The only reason is bad equipment, [2a]
in which case all data taken must be "thrown out”. [2b]

Those are two very different answers.
-- Example [1] seems very much too lenient.

-- Rule [2a] seems too narrow. There are lots of things that
could go wrong. In particular, it is common to find that
the equipment per_se is behaving according to specifications,
but the experimental strategy and tactics are suboptimal,
perhaps due to uncontrolled variables and/or problems with
the theoretical model.
-- Remedy [2b] is usually correct, especially in student-lab
situations. Throw out one ==> throw out all. And even
that can be problematic. Any situation where you have to
throw out data is begging for trouble.

Tangential remark: If you look hard enough you can find
complicated situations where it makes sense to make an
exception to [2b]. Tremendous sophistication and effort
are required to handle such exceptions properly.

Good rule of thumb:
Throwing out data ==> students are in way over their heads.

Returning to example [1]: It is unhelpful to give students
the idea that it is OK to throw out data with little or no
justification other than an improvement in the chisquare.

Constructive suggestion: If you have a good way to explain
the data in terms of real physics, do not throw out the data;
instead *model* the data. In example [1], rather than fitting
to a single exponential, fit to the sum of two exponentials.

In example [1], as presented in the manual, there isn't enough
data at small x to reliably fit the second exponential ... but
that means you need more data, not less!

AT THE VERY LEAST: If you decide to fit to the large-x tail
of the data, don't just throw away the small-x data. If you
can't properly model it, retain it anyway, and just give it
zero weight in the fit. Display it in all the graphs.
Acknowledge that it doesn't fit the 2-parameter model.
Clearly identify which points have been given zero weight.
This is an amateurish solution, but at least it is honest.
It is the simplest honest solution.

If you want to be professional about it, the second-best
remedy would be to use thinner sheets of lead at the beginning,
to get more x-values under the initial part of the curve, i.e.
the part where the second exponential is significant.

An even better remedy, if you think betas are involved, is to
redesign the experiment to distinguish betas from gammas. In
particular, use a few sheets of PMMA or other low-atomic-weight
material in front of the lead. That will stop the betas while
leaving the gammas relatively undisturbed. It's entirely possible
that the extra counts are coming from low-energy gammas, instead
of or in addition to betas, and you'll never know unless you
check.

I'm not an expert in this area, but I'm skeptical that you can
explain the data by saying significant numbers of 500 keV betas
are making it through 0.85 mm of lead, so you might be badly
fooling yourself by throwing out the second data point. It's
trying to tell you something. Hint: Look up the stopping-power
data before putting too much emphasis on the beta story.

Yeah, I know that lab time is a precious resource and taking
more data is a burden. And I am quite aware that a 4-parameter
fit is disproportionately more trouble than a 2-parameter fit.
But seriously, folks, what's the point of teaching the wrong
concept quickly? Take the time to do it right.

Bottom line: Don't throw out data unless you can prove you
know what you're doing. The rationale given in example [1]
is nowhere near sufficient.