Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

[Phys-L] Data Science for Physicists - Is there a dataset all students should see?



Lots of lip service is given to the term "Big Data" these days.
Increasingly many fields of physics find themselves with huge sets of
data. Undergraduate students are not typically exposed to analysis of such
data sets unless they get involved in a research internship. And even
then, in classic physicist style, students learn only enough about data
management to get the job done. The data they analyze is very specific to
one physics field and to a particular research project.

*Is there one data set that every physics student should see?* Some
complex set of data that requires searching, classifying, and summarizing
data with more advanced statistical tools. It would need to be in a field
that is generally accessible to undergraduates.

That suggests that we might look to a kinematics domain. But I could
imagine a couple of sets depending on the subfield of the instructor.
Astronomy might have one dataset while an undergraduate program that
emphasises materials science would use another. The fundamental
pedagogical goals of teaching this analysis should be clear from the
results. In whatever field students go into, they would feel prepared to
grapple with a large data set.


FOOTNOTE: What Is Big Data?
As a simple definition of "Big Data" I'm going to say that it is any data
set that will not fit into an Excel Spreadsheet. Excel allows 1M rows and
16k columns. (That's cell XFD1048576 if you were curious.) But I think we
are all aware that Excel breaks long before you fill up all of those
possible locations. It may fail to load without lots of RAM. And
recalculating anything complex in a large data set using Excel can be
painfully slow. One is never quite sure if the computer is hung or if
Excel will finish in 5 minutes or 5 hours. If your data set has between a
few thousand and a million data values we'll call that "Big Data."
Functionally it means that one will need to use a tool other than Excel to
analyze it.

This definition is rather small still in the modern data-driven world. An
itemized list of sales at your local supermarket for a month would likely
exceed this size. Receipts from the major online retailers will include
orders of magnitude more transactions. Astronomy surveys, daily weather
recordings, particle physics experiments, and many more scientific studies
are creating enormous data sets. Even a simple sensor in the lab is
capable of generating thousands of data values per second. Fortunately,
most physics lab experiments last less than 2 seconds.
=====

I'm working with the AAPT committee on labs to suggest that the
recommendations on curriculum include some data science concepts. I
haven't yet convinced my colleagues to add this, but this idea is my little
part of the discussion.
Intro level: Make an organized table in Excel.
Advanced level: ...?
I'd like to write, "Understand 3rd Normal Form," but that doesn't mean
anything to physicists. Nor does it motivate the need.

Happy to hear your thoughts. Or, critique anything I've said.

Paul