Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Confessions of a Repentant Modeler in a Time of Plague.

On 4/19/2020 3:00 PM, David Bowman wrote:

For those (bc & BW?) who want to attempt to fit past data to a previously defined form for a multi-parameter family of functions for the purpose of predicting future outcomes, *without* attempting to model any of the complicated underlying dynamics of a real communicable disease epidemic, (which would typically involve many coupled independent degrees of freedom, each obeying their own mutually coupled ODEs, with frequent time dependent updating of parameters of said degrees of freedom, and which would have no closed form solution, if modeled well) and who *only* want to fit some prior data, without understanding how & why the data are what they are, so as to just make some simple model predictions from a reasonable data fit, I hereby offer the simplest 1-degree of freedom model form which I could think of which has 4
parameters independently separately controlling 4 different constraining aspects for any semi-realistic model, *and* which also has a closed form solution of a single ODE describing its temporal variation.
I rejoiced to see David's note last night which offers a method to separate the growth exponential from the decay exponential of a time series. One small (?) step for him, but a giant step forward for me!
 I will not resist the impulse to liken

..." *without* attempting to model any of the complicated underlying dynamics of a real communicable disease epidemic, (which would typically involve many coupled independent degrees of freedom, each obeying their own mutually coupled ODEs, with frequent time dependent updating of parameters of said degrees of freedom, and which would have no closed form solution, if modeled well)..." the physicist's desire to model aerodynamic lift
with a model of the complicated dynamics of air molecules and gust packets. [Unworthy AND ungrateful - I realize! ]

I consider a simple model for a cumulative sigmoidal growth process with single dependent variable y for a temporal independent variable, t, being the real line (or uniformly sampled real line) having the following 4 aspects which need to be determined by the fitting procedure. These are:

1.) An initial exponential growth rate in the t--> - ∞ limit. I'll call this rate a (for alpha) in that initial limit the growth is assumed be exponential in character behaving like
(Constant)*exp(a*t). For uniformly sampled data with sample time interval dt we can treat the initial growth behavior for the n-th sample as having the equivalent form

(Constant)*(1 + g)^n

where ln(1 + g) = a*dt & g = exp(a*dt) - 1. Here g is the initial compounded growth 'interest' rate for unit sampling time interval dt. We discretely model the continuous time t = dt*n for the nth data point, where we take the starting time t = 0 to coincide with starting sample n = 0 for convenience.

2.) Asymptotic long time (t --> + ∞, or n --> + ∞) maximum limiting cumulative value y_∞. IOW, lim t --> + ∞ y(t) --> y_∞ and/or lim n --> + ∞ y_n --> y_∞.

3.) The ratio of the final exponential decay rate to the maximum asymptote at infinite time to the initial exponential growth rate @ t --> - ∞. We call this rate ratio p. Thus the final asymptote is approached as t --> + ∞ like y(t) = y_∞ - (Constant)*exp(-a*p*t) , or for discrete samples y_n = y_∞ - (Constant)/(1 + g)^(p*n) .

4.) The location of the growth process's inflection point along the independent parameter, t* for continuous time or n* for uniformly sampled data. Here t* signals when the process stops accelerating and begins to decelerate toward the final bound y_∞.

The ODE describing the continuous process does not refer explicitly to this 4th parameter t* (or n*) as its value is determined by an integration constant which is found by the conditions for the particular fit calibrating the horizontal offset needed in the independent parameter for the fit.

The simplest (for me) ODE having the above properties which is still explicitly solvable (for me) is the equation

dy/dt = a*y*(1 - (y/y_∞)^p) .

Note the ODE depends on a, y_∞, & p, but not on t*. Note also the special case of p = 1 is the logistic ODE for a logistic process. The general solution to this ODE for a growth process is

y(t) = y_∞/(1 + p*exp(a*p*(t* - t)))^(1/p)

This functional form has initial exponential growth rate a (at t --> - ∞), asymptotic bound y_∞, final decay rate a*p (at t --> + ∞) and inflection point at t = t* where the value of the process at the inflection point is y(t*) = y_∞/(1 + p)^(1/p).

The discrete version of the process for uniformly sampled values is

y_n = y_∞/(1 + p*(1 + g)^(p*(n* - n))^(1/p) .

Again the inflection point (where the growth rate is maximum) has a value of y_n* = y_∞/(1 + p)^(1/p). Also again the p = 1 special case is the logistic process function.

Using uniformly sampled data and plotting the relative growth per sample d_n == y_n/y_(n-1) - 1 versus the cumulative value y_n allows one to determine g, y_∞ & p.

If the plot is a straight line the process is indeed logistic with p = 1, and the vertical intercept is g & horizontal intercept is y_∞. If the plot is smooth and is fittable with a curved graph using a value for p ≠ 1 then the simple growth/decay asymmetric process with p ≠ 1 indeed adequately describes the process. If the fit is not good because the underlying dynamics is not captured by the model, or if the statistical noise is too great then the model doesn't work so well. BTW in the p ≠ 1 situation the fit's vertical intercept is *still* g and its horizontal intercept is *still* y_∞. The parameter p just introduces more curvature to the graph between these intercepts as p deviates from 1 in value. The value for n* (or equivalently t*) can still be found later after parameters g, y_∞ & p are found from the d_n vs y_n plot by a method similar to what I already described previously for the logistic model. The p-dependent functional form for the curved d_n vs y_n plot has is the messy formula

d_n = (1 + g)*(1 - (1 - 1/(1 + g)^p)*(y_n/y_∞)^p)^(1/p) - 1 .

I tried out this 4-parameter model on the US CoVID-19 death data from and did all the fitting by hand with some eyeballing of how the curvature went. The resulting fit I got seemed to agree with the data to within the statistical sampling noise with nearly imperceptible apparent systematic deviation. Since I did the fitting by hand I did not properly deal with the error bars/bands in the data. I figured someone else could do that if they were so inclined. I only wanted a proof-of-concept fit to see if this phenomenological model may have any bearing on anything. It ends up it performs way better than the simple logistic model does.

Here are my parameters I got from my fit using data from March 19 - April 17. After doing the fit I added in April 18's data and that data point had no influence on the parameters as it fit just as well as all the other previous data. Here are the parameters I got for that fit.
g = 0.850453366
y_∞ = 69994.5
p = 0.1555328
n* = 24.822212 (note n = 0 corresponds to March 19, 2020)

This means the process's inflection point value is

y_n* = 27631.5 ( = 69994.5/2.53314124 = y_∞/(1 + p)^(1/p) ) .
I attempted to fit data for this period using David's continuous and discrete formulations but I failed to close on a fit (the reason is all but certain to be my fumbling.)

Then I uncoupled just one of the parameters from interacting with others to end up with a function of this kind:
From your

y(t) = y_∞/(1 + p*exp(a*p*(t* - t)))^(1/p)

I transcribed:

DEATHS = K1 / (1 + a*exp(a*b*(25-day))))^(1/c)
   You can see that I found the terminal parameter (1/p) to be one parameter too far. So I set it to an independent parameter c.
I then watched the fruits of your labor take wing (hand-fitting data is no walk in the park, I find!)

Caveat note, *please* do not be misled by all the sig figs listed here. In reality these parameters have *much* lower precision than quoted above. But since I didn't do any extensive error analysis (since I actually did the fit by hand) I don't know the real uncertainty ranges on them, nor their mutual covariances, as their variations interact with each other. So I just kept an over-supply of lots of guard digits in all calculations & report those raw numbers.

Because the low (0.1555328) p-value this causes the d_n vs. y_n curve to have most of its strong curvature near the low n-number region where the data number statistics are very poor.
I too noticed that fitting US DEATHS to an exponential curve in the early stages was unsatisfactory. I rationalized that too often, early casualties attracted alternative labels like  pneumonia, flu, natural causes of various kinds. By happy chance I found the Oklahoma Deaths were more tractable to exponential fitting.
It also means that for none of the data used is the actual effective daily growth value anywhere near as large as 0.8504. This g value is only the *projected* initial daily growth rate in the *limit* of the *infinite* past n > - ∞. The actual growth factors (where the data are) are typically around (d_n =) 0.37 (with up over 50% noise) for early days near March 19 where the number statistics are extremely poor anyway. The p ≠ 1 model's projected value for y_∞ near 70000 is apparently much more believable than the anomalously low value projected by a naive logistic model projecting from early April data last week to the vicinity of y_∞ around 32000. The model's asymptotic final decay-to-asymptotic bound daily compounded decay rate is 1/(1 + g)^p - 1 = -0.09128 in the tail after nearly all the deaths have happened.

BTW, the US death data has a huge anomalous spike in the daily count for April 14 (I think because NY reported all its prior COVID19 deaths happening outside of hospitals all at once on that day). This means a naive fit to the raw Worldometers death data will necessarily give bad results from this reporting spike. So I took the liberty of massaging the data as follows. Out of the 6185 daily deaths reported for April 14 I guesstimated that probably no more than about 2800 of them were really from that day and the remainder must have happened at some earlier time. Since I didn't know when those extra non-hospital deaths actually happened I just assumed they mostly likely would have spread out over all the previous days in a relatively uniform way. So I increased the death counts of all days before April 14 uniformly by a constant *relative* 14% above their raw daily values, and this then accounted for all the excess deaths reported on April 14 above my guesstimate of 2800.

David Bowman
I observed that Worldometers noted on that date that US data would be extended to non-hospital cases and deaths   ~   but I could not bring myself to massage the source data; instead I deleted the three entries immediately preceding the step increase, the ones most impacted.
 And YES this constitutes gardening the data too!

To confirm that I had a viable implementation of your splendid function, I added a FAKE data point several days in the future, to check that the decay curve could follow it well.
Actually two versions; one with a  low future datapoint in  "DAVID 1"
and another, "DAVID 2" with a higher value for this FAKE future datapoint. The non linear regression I used was perfectly comfortable with either.
 It will take a few more data points before I can expect the future exponential to be representative of future results. I placed plots for both variants on IMGUR

Besides the usual fit for cumulative deaths, I add the plot of residuals by day which one wants to show random scatter, and also those residuals against the expected bell-shaped scatter. As well, I took the differential of fitted parameters, to find the peak day. [I expect I lost any more direct conclusion of the peak from the aggregated data curve function I used ]
This amounted to four plots for the low-ball target, and four plots for the higher-ball target. us deaths david 1 us deaths david 1 residuals us deaths david 1 residuals vs expectation us death rate david 1 us deaths david 2 us deaths david 2 residuals us deaths david 2 residuals vs expectation us death rate david 2

Thank you so much!

Brian W