Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Predicting Time Series (was PHYS-L Nov99 Tech...)



At 13:59 11/4/99 -0500, David Bowman wrote:

Time span Avg. per diem number of posts
03/21/95 - 08/27/96 14.9
09/05/96 - 12/31/97 15.2
01/01/98 - 12/31/98 14.1
01/01/99 - 11/04/99 18.5


I suspect it may still be too soon to tell if the recent upswing in
volume represents a new long term average baseline level. However the
1999 volume so far does seem to be over 6 standard deviations above the
mean of previous years.

David Bowman

At 21:32 11/4/99 -0600, I responded:

This looks like a staggering jump. 6 S.Ds.
I wonder how the data would look if we took
(as a null-hypothesis) that these are samples from a single
population: specifically, how many S.Ds is this year's daily number
distant from the average daily number for all years?

Could it possibly be a rather small distance
- perhaps even insignificant? :-)
(I haven't worked the numbers...)

Brian W.

Making predictions from incomplete data is what business does.
And we all could make a fortune if we simply developed a process
to prognosticate financial futures (for example) with rather
better than chance success. But this is not at all easy.

A futures trader once told me the most secure predictions
depended on cyclical factors. Wind, weather, rainfall, sunshine,
hurricanes and tornadoes can be depended on to have a seasonal
tendency, and they can impact crop prices etc.

In considering David Bowman's scanty data set, a cycle is hard
to extract from the four data points.
But acknowledging the human gift for extracting patterns from
data, it seems reasonable to place these points on a chart.

I began with a straight horizontal line placed according to a
least squares rule.

The equation of this line, is this:
Quantity of Daily Posts = (Sum of n Years)/n = 15.675
The sum of squared deviations is 11.3

But this is somehow unexciting, too conservative by far.
Is it plausible to suppose that an internet facility will not
grow?

I next tried an equation of this form:
Qty = P0 + P2*Year

If we code the year numbers with only
their last digit as a YearCode, this equation resolves to this
Qty = 8.4 + 0.97 * YearCode

Here the sum of squared deviations is 6.6.
For year 10 (2000) this equation leads us to expect not 15.7 posts
but instead 18.1

Rather than a 'simple interest' style model, we might now consider
a 'compound interest' or exponential model, like this:
Qty = P0 + EXP(P2*YearCode)

Many natural growth processes follow this model (at least until
resource scarcity or saturation sets in) so we come up with
Qty = 10.554 + EXP(0.2140*YearCode)
and this predicts 19.1 posts next year.

But though the sum of squared deviations is only 5.7, the darn curve
doesn't pass through any of the data points. (At least the previous
sloping line was really close to one year's data)

Perhaps we should try a polynomial of second degree (or what amounts to
the same in this case, a parabola) like this:
Qty = P0 + P1*YearCode + P2*(YearCode)squared

At least this can inflect around a dip in the data such as ours.
This turns out as
Qty = 64.775 - 14.405*YearCode + 1.025*(YearCode)squared

(We could have used this as a parabolic formulation with the
same resulting curve:)
Qty = P0 + P1*(YearCode - P2)squared giving
Qty = 14.164 + 1.025*(YearCode - 7.027)squared

Sum of squares is 2.4 in either case.
This predicts 23.2 posts p.d next year by the way.

So in my innocence, I see that increasing the number of
parameters that I can fit, first one (straight horizontal line)
then two (straight sloping line)
then three (a parabola, concave up)
keeps reducing the minimal sum of squared deviations.

Let's go for the whole hog, then, with four parameters
(as many parameters as datapoints in fact, hehe)

A polynomial of power three can inflect twice, so it should catch
four scattered points: let's see if it does:
Qty = P0 + P1*YearCode + P2*(YearCode)squared + P3*(YearCode)cubed

This optimizes to
Qty = -402.7 + 177.3*YearCode - 24.85*(YearCode)squared
+ 1.15*(YrCode)cubed


You can see that there is something rather unphysical about this
formulation: where on earth does that -403 fit in the scheme of
things?
Anyway, we really hit the jackpot with this one - the sum of squared
deviations is a modest 6 times 10^-24, or as close to zero as this
regression plotter wants to get. So every point was split in two in
best William Tell style.

And THIS equation predicts 35.3 posts p.d next year.
And you thought it was tough to read, delete or discard PHYS-L today?

So what's it to be? 15.7, 18.1, 19.1, 23.2, 35.3 daily notes
next year?
There were somewhat plausible reasons for each of these values
computed by the best means technology can provide.

Here's what I try to remember: Sylvania, Mullard, Phillips and many
other vacuum tube makers were busy making their predictions of future
sales of vacuum tubes in just this way in December 1947
Even worse, they continued to do the same in 1948, 1949, 1950, 1960...
until they realised they were cast as the dinosaurs in the man versus
monster movie.

(I entered these models and data in a shareware package from
Phillip H Sherrod called NLREG 4.1. It plotted the curve and the
non linear regression and a page of statistics for all models
in a total of 30 minutes. Highly recommended)

Respectfully
brian whatcott <inet@intellisys.net>
Altus OK