Re: [R] lm ~ v1 + log(v1) + ... improve adj Rsq ┐any sense?

From: Mike Marchywka <>
Date: Tue, 22 Mar 2011 21:55:55 -0400

> Date: Tue, 22 Mar 2011 09:31:01 -0700
> From:
> To:
> Subject: [R] lm ~ v1 + log(v1) + ... improve adj Rsq ┐any sense?
> Dear all,
> I want to improve my adj - R sq. I 've chequed some established models and
> they introduce two times the same variable, one transformed, and the other
> not. It also improves my adj - R sq.
> But, isn't this bad for the collinearity? Do I interpret coefficients as
> usual?

I'm not sure how many replies you got or if your question was answered but just offhand let me see if I understand your concern. If your data is only over a limited range of v1 where you can Taylor expand to linear term only then sure it can be hard to tell a linear from log dependence of quantify a mixture of the two. If you try to find a and b to fit y=a*f(x) + b*g(x) that minimizes some error, you should be able to see the issues on paper. Presumaly log is not linear over a larger range and any error function, like SSE, would have "reasonbly " peaked minimum for some values of the two coefficients but you could do a sensitivty analysis to check- find the second derivatives of your error function or just perturb the coefficients a bit. I guess if there is some direction where the error does not change as a and b vary then you have the case you are worried about. I'm not sure what you consider to be "usual" but when I'm doing something like this, I usually have some physical interpretation mind. Most uninfomratively, you could interpret these coefficients as those which minimize your error given the data you have :) What you do from there depends on a lot of specifics. To tell if a given function seems to be appropriate for the data, it is always good to look at a plot of residuals. Note that ability to find a unique set of coefficients that minimizes a given error has nothing to do with independence of the two terms attached to the coefficients- indeed polynomial fits are a common example( log having a taylor series just constrains a lot of coefficient relationships LOL).

P-values and confidence intervals are another matter with post hoc exploratory work but I'll let a statistician comment on that as well as the meaning of the R output.
Usually the final decision on a putative model impovement comes from your ability to infer something about the underlying system although you may just want a simple empirical approximation and be more worried about meeting a given error with a limited number of computations etc etc.

Apparently you found on a retrospective literature search that everyone else is using the log term.
Sometimes you see people ask questions like, " given that in 10 papers on the subject 4 of them used the log term and these authors have historically been right 50 percent of the time but the other 6 are right 40 percent of the time, what are the chances that the log term should be included?" I will also avoid commenting on this question except to say it illustrates a number of ways people do approach these problems and what you consider to be relevant to your situation.

> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.73140 7.22477 0.240 0.81086
> v1 -0.33886 0.20321 -1.668 0.09705 .
> log(v1) 2.63194 3.74556 0.703 0.48311
> v2 -0.01517 0.01089 -1.394 0.16507
> log(v3) -0.45719 0.27656 -1.653 0.09995 .
> factor1 -1.81517 0.62155 -2.920 0.00392 **
> factor2 -1.87330 0.84375 -2.220 0.02759 *
> Analysis of Variance Table
> Response: height rise
> Df Sum Sq Mean Sq F value Pr(>F)
> v1 1 51.25 51.246 21.4128 6.842e-06 ***
> log(v1) 1 13.62 13.617 5.6897 0.018048 *
> v2 1 2.84 2.836 1.1850 0.277713
> log(v3) 1 3.02 3.024 1.2638 0.262357
> factor1 1 17.62 17.616 7.3608 0.007279 **
> factor2 1 11.80 11.797 4.9294 0.027586 *
> Residuals 190 454.71 2.393
> Thanks,
> --
> View this message in context:
> Sent from the R help mailing list archive at
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
                                          mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Wed 23 Mar 2011 - 02:09:02 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 23 Mar 2011 - 02:10:25 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive