RE: [R] Help with predict.lm

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Wed 20 Apr 2005 - 07:27:01 EST

On 19-Apr-05 Ted Harding wrote:
> On 19-Apr-05 Mike White wrote:

```>> Hi
>> I have measured the UV absorbance (abs) of 10 solutions
>> of a substance at known concentrations (conc) and have
>> used a linear model to plot a calibration graph with
>> confidence limits.  I now want to predict the concentration
>> of solutions with UV absorbance results given in the
>> new.abs data.frame, however predict.lm only appears to work
>> for new "conc" variables not new "abs" variables.
>> [...]
>>     conc<-seq(100, 280, 20) #  mg/l
>>     abs<-c(1.064, 1.177, 1.303, 1.414, 1.534, 1.642, 1.744,
>>            1.852, 1.936,2.046) # absorbance units
>>     lm.calibration<-lm(abs ~ conc)
>>     pred.w.plim <- predict(lm.calibration,  interval="prediction")
>>     pred.w.clim <- predict(lm.calibration,  interval="confidence")
>>     matplot(conc, cbind(pred.w.clim, pred.w.plim[,-1]),
>>        lty=c(1,2,2,3,3), type="l", ylab="abs", xlab= "conc mg/l")
>>     points(conc, abs, pch=21, col="blue")
>>
>>     new.abs<-data.frame(abs=c(1.251, 1.324, 1.452))

>> [...]
```

> The correct approach has in the past been the subject of
> at times quite controversial discussion, under the title
> indeed of "The Calibration Problem". Nowadays this problem
> would be approached by making the concentrations to be
> "predicted" additional unknown parameters, and evaluating
> likelihood ratios for possible values of these.
>
> I don't have time at the moment to go into this approach,
> but will try to write something later.

Here is the basic theory, and an associated method, for the case of estimating 1 value of X ("conc") corresponding to 1 given value of Y ("abs"), when data x1,...,xn and y1,...yn have been obtained (your data "conc" and "abs" above, respectively).

The model

y = a + b*x + e

is assumed, where the x-values are given (as typically in a calibration experiment -- e.g. measuring standard test samples of known y-value), and the y-values are measured according to the above, with errors e assumed distributed as N(0,s^2).

For simplicity, centre the x and y observations at their means, so that Sum[xi]=0 and Sum[yi]=0.

Let an observation Y of y be given.
To infer the corresponding value of X.
(X is measured from the mean of the xi, Y from the mean of the yi).

Let X play the role of an additional parameter in the problem. Then the likelihood function for (a,b,s,X) is

(1/(s^(n+1))*exp(-(Sum[(y-a-b*x)^2] + (Y-a-b*X)^2)/s^2)

The MLEs of a#, b#, s#, X# of a, b, s, X are then given by

a# = 0

b# = (Sum[xy])/(Sum[x^2])

X# = Y/b#

(s#)^2 = (1/(n+1))Sum[(y - (b#)*x)^2

Now suppose that X is set at a fixed value (again denoted by X). Then the MLEs a~, b~, s~ of a, b and s are now given by

a~ = (Y*Sum[x^2] - X*Sum[x*y])/D

b~ = ((n+1)*Sum[x*y] + n*X*Y)/D

(s~)^2 = (Sum[(y - a~ - (b~)*x)^2] + (Y - a~ - (b~)*X)^2)/(n+1)

where

D = (n+1)*Sum[x^2] + n*X^2

The likelihood ratio (profile likelihood for X) of the hypothesis that X has a fixed value (X) versus the hypothesis that X might be anything is therefore

((s#)/(s~))^(n+1)

which depends only on

R(X) = (s#)^2/(s~)^2

where s~ (but not s#) depends on X.

Now

(n-1)*((s~)^2 - (s#)^2)/(s#^2) = (n-2)*(1 - R(X))/R(X)

has the F distribution with 1 and (n-1) d.f., quite independent of the true values of a, b and s^2, and large values of R(X) correspond to small values of this "F ratio".

Hence the MLE of X is Y/B# and a confidence set for X at a given confidence level P0 is the set of all X such that

(n-2)*(1 - R(X))/R(X) < qf(1-P0,1,n-2)

(in the notation of R's F functions pf, qf, rf, etc.)

R.calib <- function(x,y,X,Y){
n<-length(x) ; mx<-mean(x) ; my<-mean(y) ;   x<-(x-mx) ; y<-(y-my) ; X<-(X-mx) ; Y<-(Y-my)

ah<-0 ; bh<-(sum(x*y))/(sum(x^2)) ; Xh <- Y/bh

sh2 <- (sum((y-ah-bh*x)^2))/(n+1)

D<-(n+1)*sum(x^2) + n*X^2
at<-(Y*sum(x^2) - X*sum(x*y))/D; bt<-((n+1)*sum(x*y) + n*X*Y)/D   st2<-(sum((y - at - bt*x)^2) + (Y - at - bt*X)^2)/(n+1)

R<-(sh2/st2)

F<-(n-2)*(1-R)/R

```  x<-(x+mx) ; y<-(y+my) ;
X<-(X+mx) ; Y<-(Y+my) ; Xh<-(Xh+mx) ;
PF<-(pf(F,1,(n-2)))
list(x=x,y=y,X=X,Y=Y,R=R,F=F,PF=PF,
ahat=ah,bhat=bh,sh2=sh2,
atil=at,btil=bt,st2=st2,
Xhat=Xh)
```

}

Now lets take your original data and the first Y-value in your list (namely Y = 1.251), and suppose you want a 95% confidence interval for X. The X-value corresponding to Y which you would get by regressing x (conc) on y (abs) is X = 131.3813 so use this as a "starting value".

So now run this function with x<-conc, y<-abs, and these values of X and Y:

R.calib(x,y,131.3813,1.251)

You get a long list of stuff, amongst which

\$PF
[1] 0.02711878

and

\$Xhat
[1] 131.2771

So now you know that Xhat (the MLE of X for that Y) = 131.2771 and the F-ratio probability is 0.027...

You want to push \$PF upwards till it reaches 0.05, so work *outwards* in the X-value:

R.calib(x,y,131.4000,1.251)\$PF
[1] 0.03198323

R.calib(x,y,131.4500,1.251)\$PF
[1] 0.0449835

...

R.calib(x,y,131.4693,1.251)\$PF
[1] 0.04999851

R.calib(x,y,131.2771,1.251)\$PF
[1] 1.987305e-06

R.calib(x,y,131.2000,1.251)\$PF
[1] 0.02005908

R.calib(x,y,131.1000,1.251)\$PF
[1] 0.04604698

...

R.calib(x,y,131.0847,1.251)\$PF
[1] 0.0500181

So now you have the MLE Xhat = 131.2771, and the two limits of a 95% confidence interval (131.0847, 131.4693) for X, corresponding to the given value 1.251 of Y.

I leave it to you (or any other interested parties) to wrap this basic method in a proper routine which will avoid the above groping (but at least it shows explicitly what's going on).

As a refinement: In you original query, you put up three values of Y: 1.251, 1.324, 1.452

Now you could, of course, simply apply the above method to each one separately as above.

However, the same theoretical approach can be used to obtain a joint confidence region for the three corresponding X values jointly, and this is theoretically more correct, since these are 3 extra parameters and they will have a covariance structure. You would need to work through the algebra for k such Y-values and the corresponding X-values.

In practice this may not matter much -- probably not at all for your data.

Note: The above analysis of the single-case situation was exhibited in an address I gave to a modelling symposium in 1985, subsequently published along with the other presentations in a special number of The Statistician:

Modelling: the classical approach
The Statistician (1986) vol 35, pp. 115-134

NB that the equation for alpha-tilde (corresponding to a~ above) in that article has a misprint ("Y" missing before Sum(x^2)).

This analysis had something in common with work by Phil Brown:

P.J. Brown (1982) Multivariate Calibration   JRSS Ser. B, vol 44, pp. 287-321

though Brown's approach did not start from the likelihood ratio principle.

Hoping this helps!
Best wishes,
Ted.

E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
```Date: 19-Apr-05                                       Time: 22:27:01
------------------------------ XFMail ------------------------------

______________________________________________
```
R-help@stat.math.ethz.ch mailing list