[R] Help with colinearity problem in multiple linear regression

From: Caleb Welton <cwelton_at_greenplum.com>
Date: Thu, 06 Mar 2008 11:16:07 -0800


Hello,

  For basic linear regression lm() does the job well, for datasets that are larger than memory biglm() seems to work.

  I'm working on a parallel implementation of multiple linear regression for datasets that are too large for memory.

  Currently I am working over least squares:     calculating: t(X) %*% X and t(X) %*% y     separately in parallel on each node

  This generates a small nxn matrix and an n vector which can then be brought together, summed, and solved for the linear regression result:

   ( solve(t(X) %*% X) ) %*% ( t(X) %*% y )

  Except that this can have problems with t(X) %*% X being singular.

  My understanding is that both lm() and biglm() deal with this via using QR decomposition, but I can't figure out how this technique would be combined with intermediate summary results which is required for the parallelism technique to work.

  Does anyone have any pointers how this might be accomplished?

Thanks,
  Caleb



# Bigger example

# Introduce artificial colinearity to test case
P = cbind(USArrests[2]*2, USArrests)
names(P) = c("Introduced", "Murder", "Assault", "UrbanPop", "Rape")

# Split the data into partitions to be calculated separately
# In the real case the full data would have been too large and
# each instance would have received just a chunk of data.
P1 = P[1:25,]
P2 = P[26:50]

# Partition 1 - calculated on host 1

P1_X = as.matrix(cbind(1, P1[,c("UrbanPop", "Assault", "Rape", "Introduced")]))

P1_A = t(P1_X) %*% P1_X
P1_b = t(P1_X) %*% P1[,"Murder"]

# Partition 2 - calculated on host 2

P2_X = as.matrix(cbind(1, P2[,c("UrbanPop", "Assault", "Rape", "Introduced")]))

P2_A = t(P2_X) %*% P2_X
P2_b = t(P2_X) %*% P2[,"Murder"]

# repeat for partitions 3:N
# ...

# collect data from the all the partitions, this is the key for
# the parallelism technique, I don't know how to do this step if
# I was somehow doing QR decomposition.

A = P1_A + P2_A      # ... + P3_A  ... + PN_A
b = P1_b + P2_b      # ... + P3_b  ... + PN_b

# calculate regression, this fails because of sigularity
solve(A) %*% b

# If I exclude the introduced column it works, but I'm not
# sure how this would be generalized.

solve(A[-5,-5]) %*% b[-5]

# Compare to lm()

lm(Murder~UrbanPop+Assault+Rape+Introduced, P)



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 06 Mar 2008 - 19:40:36 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 06 Mar 2008 - 20:30:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive