Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Data Science for Physicists - Is there a dataset all students should see?



Gathering ideas into a couple of lists here.

Data sets ===
1. Astronomy archives
a) Gaia
b) SIMBAD
2. Weather, climate, and geography data
a) GOES
b) MODIS
c) GISTEMP
3. Particle Physics
a) CERN Open Data
b) ML Physics Portal
4. Create your own in the lab
a) accelerometer on a damaged fan


Suggested skills===
1. Data Preprocessing - missing entries, bad data, outliers, fixing
formatting, etc.
2. Database JOIN from disparate sources
3. Analysis
a) statistical summary
b) histogramming
c) fitting both linear and nonlinear functions
4. Working with High Dimensional Data
a) visualization methods
b) data selection/cuts, simple and along many dimensions
c) dimensionality reduction
5. Data Classification and Machine Learning
a) Methods - K means, decision trees, PCA, Neural Networks, etc.
b) Oversampling
c) Synthetic data
6. Monte Carlo simulation

Item's 1 through 4 are a good fit for the current discussion. Item 5
borders on some data science methods that I don't think have a lot of
general use in physics. Item 6 is a little outside of the bounds of the
discussion I wanted to have, but I quite justify keeping it out of the data
analysis goals for an undergraduate education. That might be another
category.

John's concern about too narrowly defining a specific lab activity is
good. The solution is to keep pedagogical goals in mind with any
analysis. And the goal of suggesting an example analysis is just to be an
idea starter. In addition to an HR diagram the astronomy databases could
be used to find Hubble's Law, search for binary stars, or analyze
exoplanets.

At the introductory level we are probably just writing down a list of
things that people are already doing. Now that I say that, I'm concerned
that many data acquisition packages may become simply a black box that
spits out the answer after the student turns a couple of cranks. That's
fine, but we absolutely want students to be able to put data into a
spreadsheet in an organized way, create a graph, and make some analysis of
that data. I don't think there's much debate about the intro level labs
and courses.

Random comment:
My astronomy colleague is using a program called TOPCAT for managing data
from astronomy databases. (http://www.star.bris.ac.uk/~mbt/topcat/)
It is designed for astronomy and understands how to interpret some data
types that are specific to the field. But it could be used as a general
tool for analysis.


On Thu, Nov 17, 2022 at 8:28 AM Gollmer, Steve via Phys-l <
phys-l@mail.phys-l.org> wrote:

With regard to point B7 (high dimensionality), you can use multispectral
satellite data. I use data from the Moderate Resolution Imaging
Spectroradiometer (MODIS). However, it needs to be geolocated and you need
to visually sort through the 36 channels to remove channels that have too
much saturation. The virtue of using this data is you can generate a
correlation matrix between the channels and do Principal Component Analysis
to transform the channels into 4 or 5 principle components, which contain
about 98% of the information in the image. The following step is to run a
nearest neighbor classifier. Once you plot the image based on the
classification results, you can identify major surface features, such as
deep water, shallow water, heavy vegetation, snow cover, and mountains.

An alternative is to use data from the GOES satellite, which consists of 16
channels (https://www.star.nesdis.noaa.gov/goes/conus.php?sat=G16).
However, the available images have continental, country and state outlines
as well as a false color palette imposed on some channels. If I can find
easy access to the unprocessed channels, this would streamline the process
immensely and be a good project for introducing physics students to big
data.

Steve



On Wed, Nov 16, 2022 at 11:00 PM John Denker via Phys-l <
phys-l@mail.phys-l.org> wrote:

On 11/16/22 2:51 PM, Paul Nord wrote in part:

And there are so many other things you can do with the Simbad and Gaia
data. There are many more dimensions than just position, brightness,
and
color.

Yes. That's interesting because it requires more than just
a simple SQL query. Finding the rogue stars requires doing
some /computations/ on the data.

=====================

A) This is an important topic. Important and very difficult
to do right.

B) The astronomy data seems very attractive. I can't think
of anything better. Even so, it's not ideal, so we should
keep looking. Here are some things to keep in mind:

1) Success catastrophe: The HR diagram is so intricate and
so informative that it is tempting to spend a lot of time
understanding it. In other words, there is an overabundance
of motivation.

This is no worse than, say conservation of momentum:
innumerable applications. The solution, as always, is to
take the spiral approach. Mention some of the HR motivations
without going into detail, and tell people if they want to
know more they can take an astronomy class.

2) Every student produces the same HR diagram. The result
can be looked up on the web. So this takes away some of the
motivation. The query used to garner the data this year is
the same as last year, so there is an incentive to recycle
somebody else's solution.

At some point it becomes the worst sort of cookbook
exercise: Follow the prescribed steps and you will get a
graph with lots of dots, whether or not you understand the
process or the result.

3) The HR diagram contains so much data that it creates a
visualization problem. Lots of data points sitting on top of
others. There are ways of dealing with this, but they
require tremendous amounts of skill and effort.

4) In some sense, constructing the HR diagram is too easy.
That's good for the first turn of the pedagogical spiral,
but as always, the starting point should not be the ending
point. In particular: In my life, big-data problems are not
only data-intensive, they are also algorithm intensive and
CPU-intensive. Consider for example weather forecasting. You
need a metric boatload of data, then you need some really
clever guys to come up with the outline of an algorithm for
analyzing it, and then you need some different really clever
guys to optimize the inner loop so that it runs efficiently.
And you need the world's biggest supercomputer.

I've never done weather forecasting, but I've done other
things where it took huge amounts of time and money and
effort to scrounge up the data, and then it took another
year to figure out how to analyze it.

5) Plumbing. The astronomy data is so clean as to be
atypical. It does not have the following problem, but
suppose it did: Suppose the apparent magnitude data was
available in one place, and the distance data in another
place, and the color data in another place ... all in
different formats. You would have to do a lot of database
plumbing before you could perform a join on the available
data. That would be more representative of the data I have
to deal with. Again, we want to start simple, but the
starting point must not be the ending point. If students are
expected to cope with real-world data, they need to learn
how to unscrew the screwed-up data.

If nothing else, dealing with some screwed-up data would
give students an object lesson in what not to do when
designing their own database schemata later.

6) This strongly overlaps with the topic of *uncertainty*.
Analyzing uncertainty makes sense if you have plenty of
data, whereas with a tiny amount of data it is difficult or
outright impossible.

In the introductory class, it is common to require students
to calculate "the" uncertainty. However, the calculation is
almost never compared to observations, because that would
expose how utterly bogus the concepts and methods are.

Having a big set of authentic data would revolutionize the
teaching of uncertainty.

7) The HR data is pretty much two-dimensional. YMMV, but
most of the data that lands on my desk is very high
dimensional. For example, suppose you have a million
hand-written ZIP codes. Each digit can be imaged as 16×16
pixel map. So it is a vector in a 256 dimensional space.

QM molecular orbital calculations can also become rather
high dimensional.

Coping with high-dimensional data is IMHO just as important
as coping with large numbers of data points. Often the two
go together. I have no idea how to handle this in an
introductory class.

_______________________________________________
Forum for Physics Educators
Phys-l@mail.phys-l.org
https://www.phys-l.org/mailman/listinfo/phys-l



--
Steven Gollmer
*Senior Professor of Physics*
Science and Mathematics
*Cedarville University*
o: 937-766-7764
https://stevegollmer.people.cedarville.edu/
_______________________________________________
Forum for Physics Educators
Phys-l@mail.phys-l.org
https://www.phys-l.org/mailman/listinfo/phys-l