Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Data Science for Physicists - Is there a dataset all students should see?



On 11/16/22 2:51 PM, Paul Nord wrote in part:

And there are so many other things you can do with the Simbad and Gaia
data. There are many more dimensions than just position, brightness, and
color.

Yes. That's interesting because it requires more than just
a simple SQL query. Finding the rogue stars requires doing
some /computations/ on the data.

=====================

A) This is an important topic. Important and very difficult
to do right.

B) The astronomy data seems very attractive. I can't think
of anything better. Even so, it's not ideal, so we should
keep looking. Here are some things to keep in mind:

1) Success catastrophe: The HR diagram is so intricate and
so informative that it is tempting to spend a lot of time
understanding it. In other words, there is an overabundance
of motivation.

This is no worse than, say conservation of momentum:
innumerable applications. The solution, as always, is to
take the spiral approach. Mention some of the HR motivations
without going into detail, and tell people if they want to
know more they can take an astronomy class.

2) Every student produces the same HR diagram. The result
can be looked up on the web. So this takes away some of the
motivation. The query used to garner the data this year is
the same as last year, so there is an incentive to recycle
somebody else's solution.

At some point it becomes the worst sort of cookbook
exercise: Follow the prescribed steps and you will get a
graph with lots of dots, whether or not you understand the
process or the result.

3) The HR diagram contains so much data that it creates a
visualization problem. Lots of data points sitting on top of
others. There are ways of dealing with this, but they
require tremendous amounts of skill and effort.

4) In some sense, constructing the HR diagram is too easy.
That's good for the first turn of the pedagogical spiral,
but as always, the starting point should not be the ending
point. In particular: In my life, big-data problems are not
only data-intensive, they are also algorithm intensive and
CPU-intensive. Consider for example weather forecasting. You
need a metric boatload of data, then you need some really
clever guys to come up with the outline of an algorithm for
analyzing it, and then you need some different really clever
guys to optimize the inner loop so that it runs efficiently.
And you need the world's biggest supercomputer.

I've never done weather forecasting, but I've done other
things where it took huge amounts of time and money and
effort to scrounge up the data, and then it took another
year to figure out how to analyze it.

5) Plumbing. The astronomy data is so clean as to be
atypical. It does not have the following problem, but
suppose it did: Suppose the apparent magnitude data was
available in one place, and the distance data in another
place, and the color data in another place ... all in
different formats. You would have to do a lot of database
plumbing before you could perform a join on the available
data. That would be more representative of the data I have
to deal with. Again, we want to start simple, but the
starting point must not be the ending point. If students are
expected to cope with real-world data, they need to learn
how to unscrew the screwed-up data.

If nothing else, dealing with some screwed-up data would
give students an object lesson in what not to do when
designing their own database schemata later.

6) This strongly overlaps with the topic of *uncertainty*.
Analyzing uncertainty makes sense if you have plenty of
data, whereas with a tiny amount of data it is difficult or
outright impossible.

In the introductory class, it is common to require students
to calculate "the" uncertainty. However, the calculation is
almost never compared to observations, because that would
expose how utterly bogus the concepts and methods are.

Having a big set of authentic data would revolutionize the
teaching of uncertainty.

7) The HR data is pretty much two-dimensional. YMMV, but
most of the data that lands on my desk is very high
dimensional. For example, suppose you have a million
hand-written ZIP codes. Each digit can be imaged as 16×16
pixel map. So it is a vector in a 256 dimensional space.

QM molecular orbital calculations can also become rather
high dimensional.

Coping with high-dimensional data is IMHO just as important
as coping with large numbers of data points. Often the two
go together. I have no idea how to handle this in an
introductory class.