[R] R: Optimization

From: Simone Vincenzi <svincenz_at_nemo.unipr.it>
Date: Wed 06 Sep 2006 - 00:10:34 EST


Thanks for the help.
I thought about taking the logarithms of the expression but to cope with the problem of zero values both in yield and the values of the environmental variables another transformation is needed (I simply add one to the yield value and to the environmental variables values). I follow the suggestion kindly provided by Spencer Graves but I found the following problems, which depend probably on my understanding:
1) I don't understand why it is needed to test the no-constant model (which provides a better fitting of the model). Can I simply assume a no-constant model?
2) Here comes the big problem. I don't understand what it is done with the models fit.0 and fit.1. In both cases I have to leave out one variable from the linear model but I don't understand why in this way I would test the constraint that all the coefficients should sum to 1. And it is not possible to apply anova(fit0,fit.0) and anova(fit1,fit.1) because I have different response variables.

In conclusion, I don't actually know how to proceed and if in R this kind of analysis is possible.
Any help and any further explanation would be very appreciated. Below I report part of the data frame I'm using for the analysis. The environmental variables (Salinity, Hydrodynamism, Sediment, Oxygen, Chlorophyll and Bathymetry) values have been transformed using suitability function (and thus are bounded between 0 and 1) while Yield is in kg/m2.

   Sedi Sal Bathy Chl Hydro Oxy Yield

1  1.00 1.00 0.51   1 0.17 0.75 1.5
2  0.50 0.95 1.00   1 0.09 0.94 0.4
3  0.50 1.00 0.17   1 0.44 0.90 1.8
4  1.00 0.98 1.00   1 0.10 0.89 4.5
5  0.13 0.84 0.73   1 0.16 0.84 0.4
6  0.50 0.90 0.91   1 0.22 0.84 0.4
7  0.13 0.75 1.00   1 0.14 0.86 0.2
8  0.13 0.84 0.75   1 0.10 0.83 0.3
9  0.13 0.78 0.97   1 0.06 0.84 0.5
10 0.13 0.87 0.70   1 0.45 0.85 1.0
11 1.00 0.77 1.00   1 0.19 0.86 1.5
12 1.00 0.94 0.81   1 0.47 0.86 3.0
13 1.00 0.93 1.00   1 0.45 0.89 2.5
14 0.50 1.00 1.00   1 0.54 0.84 4.0
15 0.50 1.00 1.00   1 0.25 0.88 2.2
16 1.00 1.00 0.56   1 0.25 0.90 5.0
17 1.00 0.90 0.56   1 0.40 0.90 1.5
18 0.50 0.97 1.00   1 0.22 0.95 1.0
19 0.54 0.96 1.00   1 0.18 0.91 0.3
20 1.00 0.97 0.33   1 0.39 0.90 3.0


And here the results of the fitted models so far.

summary(fit0)

Call:
lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl +

  1. + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1) - 1, data = data.df)

Residuals:

     Min 1Q Median 3Q Max -1.05485 -0.23759 0.01331 0.18692 1.23803

Coefficients:

              Estimate Std. Error t value Pr(>|t|)

log(Sal + 1)   5.07193    1.00584   5.042 1.98e-06 ***
log(Bathy + 1)  0.05804    0.20684   0.281 0.779561    
log(Chl + 1)  -8.53941    2.11720  -4.033 0.000106 ***
log(Hydro + 1)  1.73835    0.28815   6.033 2.56e-08 ***
log(Oxy + 1)   4.19951    1.98459   2.116 0.036750 *  
log(Sedi + 1)  1.15953    0.14807   7.831 4.51e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.3735 on 103 degrees of freedom
Multiple R-Squared: 0.9148,     Adjusted R-squared: 0.9098 
F-statistic: 184.2 on 6 and 103 DF,  p-value: < 2.2e-16 


> summary(fit1)
Call: lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1), data = (data.df)) Residuals: Min 1Q Median 3Q Max -1.057766 -0.227720 0.006146 0.192200 1.222543 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.49400 4.76603 -0.943 0.3479 log(Sal + 1) 5.13499 1.00861 5.091 1.63e-06 *** log(Bathy + 1) 0.06685 0.20716 0.323 0.7476 log(Chl + 1) -2.48909 6.75718 -0.368 0.7134 log(Hydro + 1) 1.70674 0.29025 5.880 5.22e-08 *** log(Oxy + 1) 4.63758 2.03928 2.274 0.0251 * log(Sedi + 1) 1.14005 0.14959 7.621 1.34e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3737 on 102 degrees of freedom Multiple R-Squared: 0.7589, Adjusted R-squared: 0.7447 F-statistic: 53.51 on 6 and 102 DF, p-value: < 2.2e-16
> summary(fit.0)
Call: lm(formula = log((Yield + 1)/(Sedi + 1)) ~ log((Sal + 1)/(Sedi + 1)) + log((Bathy + 1)/(Sedi + 1)) + log((Chl + 1)/(Sedi + 1)) + log((Hydro + 1)/(Sedi + 1)) + log((Oxy + 1)/(Sedi + 1)) - 1, data = subset(data.df)) Residuals: Min 1Q Median 3Q Max -1.5762 -0.2591 0.1065 0.5023 1.4533 Coefficients: Estimate Std. Error t value Pr(>|t|) log((Sal + 1)/(Sedi + 1)) 3.0170 1.5304 1.971 0.05134 . log((Bathy + 1)/(Sedi + 1)) -0.3982 0.3139 -1.268 0.20748 log((Chl + 1)/(Sedi + 1)) 8.8137 2.3941 3.681 0.00037 *** log((Hydro + 1)/(Sedi + 1)) 0.3309 0.4066 0.814 0.41760 log((Oxy + 1)/(Sedi + 1)) -12.5242 2.1881 -5.724 1.01e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5766 on 104 degrees of freedom Multiple R-Squared: 0.523, Adjusted R-squared: 0.5001 F-statistic: 22.8 on 5 and 104 DF, p-value: 2.179e-15
> summary(fit.1)
Call: lm(formula = log((Yield + 1)/(Sedi + 1)) ~ log((Sal + 1)/(Sedi + 1)) + log((Bathy + 1)/(Sedi + 1)) + log((Chl + 1)/(Sedi + 1)) + log((Hydro + 1)/(Sedi + 1)) + log((Oxy + 1)/(Sedi + 1)), data = subset(data.df)) Residuals: Min 1Q Median 3Q Max -1.05552 -0.23441 0.01263 0.18355 1.24473 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.84916 0.15474 11.950 < 2e-16 *** log((Sal + 1)/(Sedi + 1)) 5.03863 1.00977 4.990 2.46e-06 *** log((Bathy + 1)/(Sedi + 1)) 0.05279 0.20767 0.254 0.7999 log((Chl + 1)/(Sedi + 1)) -10.96682 2.27270 -4.825 4.86e-06 *** log((Hydro + 1)/(Sedi + 1)) 1.74631 0.28980 6.026 2.64e-08 *** log((Oxy + 1)/(Sedi + 1)) 3.95938 1.98206 1.998 0.0484 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3751 on 103 degrees of freedom Multiple R-Squared: 0.5606, Adjusted R-squared: 0.5393 F-statistic: 26.28 on 5 and 103 DF, p-value: < 2.2e-16
> anova(fit1,fit0)
Analysis of Variance Table Res.Df RSS Df Sum of Sq F Pr(>F) 1 102 14.2409 2 103 14.3650 -1 -0.1241 0.8891 0.3479 Thanks for the help Simone Vincenzi _________________________________________ Simone Vincenzi, PhD Student Department of Environmental Sciences University of Parma Parco Area delle Scienze, 33/A, 43100 Parma, Italy Phone: +39 0521 905696 Fax: +39 0521 906611 e.mail: svincenz@nemo.unipr.it -----Messaggio originale----- Da: Spencer Graves [mailto:spencer.graves@pdf.com] Inviato: luned́ 4 settembre 2006 21.31 A: Simone Vincenzi Cc: r-help@stat.math.ethz.ch Oggetto: Re: [R] Optimization Have you considered talking logarithms of the expression you mentioned: log(Yield) = a1*log(A)+b1*log(B)+c2*log(C)+... where a1 = a/(a+b+...), etc. This model has two constraints not present in ordinary least squares: First, the intercept is assumed to be zero. Second, the coefficients in this log formulation must sum to 1. If I were you, I might use something like "lm" to test them both. To explain how, I'll modify the notation, replacing A by X1, B by X2, ..., up to Xkm1 (= X[k-1]) and Xk for k different environmental variables. Then I might try something like the following: fit0 <- lm(log(Yield) ~ log(X1) + ... + log(Xk)-1 ) fit1 <- lm(log(Yield) ~ log(X1) + ... + log(Xk) ) fit.1 <- lm(log(Yield/Xk) ~ log(X1/Xk) + ... + log(Xkm1/Xk) ) fit.0 <- lm(log(Yield/Xk) ~ log(X1/Xk) + ... + log(Xkm1/Xk)-1 ) anova(fit1, fit0) would test the no-constant model, and if I haven't made a mistake in this, anova(fit0, fit.0) and anova(fit1, fit.1) would test the constraint that all the coefficients should sum to 1. If you would like further help from this listserve, please provide commented, minimal, self-contained, reproducible code to help potential respondents understand your question and concerns (as suggested in the posting guide "www.R-project.org/posting-guide.html"). Hope this helps. Spencer Graves Simone Vincenzi wrote:
> Dear R-list,
> I'm trying to estimate the relative importance of 6 environmental
variables
> in determining clam yield. To estimate clam yield a previous work used the
> function Yield = (A^a*B^b*C^c...)^1/(a+b+c+...) where A,B,C... are the
> values of the environmental variables and the weights a,b,c... have not
been
> calibrated on data but taken from literature. Now I'd like to estimate the
> weights a,b,c... by using a dataset with 110 observations of yield and
> values of the environmental variables. I'm wondering if it is feasible or
if
> the number of observation is too low, if some data transformation is
needed
> and which R function is the most appropriate to try to estimate the
weights.
> Any help would be greatly appreciated.
>
> Simone Vincenzi
>
> _________________________________________
> Simone Vincenzi, PhD Student
> Department of Environmental Sciences
> University of Parma
> Parco Area delle Scienze, 33/A, 43100 Parma, Italy
> Phone: +39 0521 905696
> Fax: +39 0521 906611
> e.mail: svincenz@nemo.unipr.it
>
>
>
> --
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- No virus found in this incoming message. -- ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Wed Sep 06 02:04:04 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 06 Sep 2006 - 15:25:05 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.