Re: [R] pairwise cross tabulation tables

From: Charles C. Berry <cberry_at_tajo.ucsd.edu>
Date: Thu, 10 Jan 2008 16:23:35 -0800

On Thu, 10 Jan 2008, AndyZon wrote:

>
> Thank you so much, Chuck!
>
> This is brilliant, I just tried some dichotomous variables, it was really
> fast.

Yes, and if you are on a multicore system with multithreaded linear algebra, crossprod() will distribute the job across the cores making the elapsed time shorter (by almost half on my Core 2 Duo MacBook as long as I have nothing else gobbling up CPU cycles)!

>
> Most categorical variables I am interested in are 3 levels, they are
> actually SNPs, I want to look at their interactions. My question is: after
> generating 0-1 codings, like 00, 01, 10, how should I use "crossprod()"?
> Should I just apply this function on these 2*n columns (originally I have n
> variables), and then operate on the generated cell counts?

If I followed you here, and you have ONLY those three categories, then yes.

Try a test case with perhaps 3 SNPs and a few subjects. Table the results the old fashioned way via table() or xtabs() or even by hand. Then look at what crossprod( test.case } gives you.

---

If '11' shows up you'll have to use a 'contr.treatment' style 
approach. ( run 'example( contrasts )' and look at what is going on).

Guess what these give:

 	contrasts( factor( c( "00","01","10" ) ) )
 	contrasts( factor( c( "00","01","10","11" ) ) )

then run them if you have trouble seeing why '11' changes the picture.

---

BTW, what I said (below) suggests that crossprod() returns integer 
values, but its storage.mode is actually "double".

HTH,

Chuck

I am confused

> about this.
>
> Your input will be greatly appreciated.
>
> Andy

>
>
>
> Charles C. Berry wrote:
>>
>> On Wed, 9 Jan 2008, AndyZon wrote:
>>
>>>
>>> Hi,
>>>
>>> I have a huge number of categorical variables, say at least 10000, and I
>>> put
>>> them into a matrix, each column is one variable. The question is: how can
>>> I
>>> make all of the pairwise cross tabulation tables efficiently? The
>>> straightforward solution is to use for-loops by looping two indexes on
>>> the
>>> table() function, but it was just too slow. Is there a more efficient way
>>> to
>>> do that? Any guidance will be greatly appreciated.
>>
>> The totals are merely the crossproducts of a suitably constructed binary
>> (zero-one) matrix is used to encode the categories. See '?contr.treatment'
>> if you cannot grok 'suitably constructed'.
>>
>> If the categories are all dichotomies coded as 0:1, you can use
>>
>> res <- crossprod( dat )
>>
>> to find the totals for the (1,1) cells
>>
>> If you need the full tables, you can get them from the marginal totals
>> using
>>
>> diag( res )
>>
>> to get the number in each '1' margin and
>>
>> nrow(dat)
>>
>> to get the table total from which the numbers in each '0' margin by
>> subtracting the corresponding '1' margin.
>>
>> With dichotomous variables, dat has 10000 columns and you will only need
>> 10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it takes
>> about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a
>> redundant triangle
>>
>> This approach generalizes to any number of categories:
>>
>> To extend this to more than two categories, you will need to do for each
>> such column what model.matrix(~factor( dat[,i] ) ) does by default
>> ( using 'contr.treatment' ) - construct zero-one codes for all but one
>> (reference) category.
>>
>> Note that with 10000 trichotomies, you will have a result with
>>
>> 10000^2 * ( 3-1 )^2
>>
>> integers needing about 3 Gigabytes, and so on.
>>
>> HTH,
>>
>> Chuck
>>
>> p.s. Why on Earth are you doing this????
>>
>>
>>>
>>> Andy
>>> --
>>> View this message in context:
>>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> Charles C. Berry (858) 534-2098
>> Dept of Family/Preventive
>> Medicine
>> E mailto:cberry_at_tajo.ucsd.edu UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> --
> View this message in context: http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14744086.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry_at_tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Fri 11 Jan 2008 - 00:26:23 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 12 Jan 2008 - 00:30:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive