Re: [R] pairwise cross tabulation tables

From: Charles C. Berry <cberry_at_tajo.ucsd.edu>
Date: Sat, 12 Jan 2008 10:59:41 -0800

On Fri, 11 Jan 2008, AndyZon wrote:

>
>
> Another question:
>
> Is it possible to make 3-variable cross tabulations (I mean, plus another
> dichotomous variable), such as 2*3*3, 3*2*3, or 3*3*2? Can we do it in
> similar ways as just 2-variables?

If you mean to stratify all of the other variables by just a single third variable at a time, yes.

You want each row of the matrix you submit to crossprod() be the elements in the outer product of corresponding rows of two binary matrices. One of those is the binary representation (ala model.matrix( ~third.factor-1 )) of the third variable and the other is the binary representation of all the other variables (as described in my earlier posts).

The package 'tensor' has utilities for manipulating tensors (hmm, was that obvious?) to construct a matrix like that to submit to crossprod().

However, you might just roll up your own function.

If you want to do all possible three-way combinations, you are looking at a very large problem - for 10000 binary variables, you will need a vector with more than choose(10000,3) elements - needing more than a terabyte as doubles - to store the result.

If this is really what you want, I am unclear why you would want to do this. I suspect that the genetical problem you wish to solve is much smaller than this or that an approach that uses less brute force would suffice.

HTH, Chuck

>
> Thank you very much!
>
> Andy
>
>
> Charles C. Berry wrote:
>>
>> On Thu, 10 Jan 2008, AndyZon wrote:
>>
>>>
>>> Thank you so much, Chuck!
>>>
>>> This is brilliant, I just tried some dichotomous variables, it was really
>>> fast.
>>
>>
>> Yes, and if you are on a multicore system with multithreaded linear
>> algebra, crossprod() will distribute the job across the cores making the
>> elapsed time shorter (by almost half on my Core 2 Duo MacBook as long as I
>> have nothing else gobbling up CPU cycles)!
>>
>>>
>>> Most categorical variables I am interested in are 3 levels, they are
>>> actually SNPs, I want to look at their interactions. My question is:
>>> after
>>> generating 0-1 codings, like 00, 01, 10, how should I use "crossprod()"?
>>> Should I just apply this function on these 2*n columns (originally I have
>>> n
>>> variables), and then operate on the generated cell counts?
>>
>>
>> If I followed you here, and you have ONLY those three categories, then
>> yes.
>>
>> Try a test case with perhaps 3 SNPs and a few subjects. Table the results
>> the old fashioned way via table() or xtabs() or even by hand. Then look at
>> what crossprod( test.case } gives you.
>>
>> ---
>>
>> If '11' shows up you'll have to use a 'contr.treatment' style
>> approach. ( run 'example( contrasts )' and look at what is going on).
>>
>> Guess what these give:
>>
>> contrasts( factor( c( "00","01","10" ) ) )
>> contrasts( factor( c( "00","01","10","11" ) ) )
>>
>> then run them if you have trouble seeing why '11' changes the picture.
>>
>> ---
>>
>> BTW, what I said (below) suggests that crossprod() returns integer
>> values, but its storage.mode is actually "double".
>>
>> HTH,
>>
>> Chuck
>>
>> I am confused
>>> about this.
>>>
>>> Your input will be greatly appreciated.
>>>
>>> Andy
>>
>>
>>
>>
>>>
>>>
>>>
>>> Charles C. Berry wrote:
>>>>
>>>> On Wed, 9 Jan 2008, AndyZon wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a huge number of categorical variables, say at least 10000, and
>>>>> I
>>>>> put
>>>>> them into a matrix, each column is one variable. The question is: how
>>>>> can
>>>>> I
>>>>> make all of the pairwise cross tabulation tables efficiently? The
>>>>> straightforward solution is to use for-loops by looping two indexes on
>>>>> the
>>>>> table() function, but it was just too slow. Is there a more efficient
>>>>> way
>>>>> to
>>>>> do that? Any guidance will be greatly appreciated.
>>>>
>>>> The totals are merely the crossproducts of a suitably constructed binary
>>>> (zero-one) matrix is used to encode the categories. See
>>>> '?contr.treatment'
>>>> if you cannot grok 'suitably constructed'.
>>>>
>>>> If the categories are all dichotomies coded as 0:1, you can use
>>>>
>>>> res <- crossprod( dat )
>>>>
>>>> to find the totals for the (1,1) cells
>>>>
>>>> If you need the full tables, you can get them from the marginal totals
>>>> using
>>>>
>>>> diag( res )
>>>>
>>>> to get the number in each '1' margin and
>>>>
>>>> nrow(dat)
>>>>
>>>> to get the table total from which the numbers in each '0' margin by
>>>> subtracting the corresponding '1' margin.
>>>>
>>>> With dichotomous variables, dat has 10000 columns and you will only need
>>>> 10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it
>>>> takes
>>>> about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a
>>>> redundant triangle
>>>>
>>>> This approach generalizes to any number of categories:
>>>>
>>>> To extend this to more than two categories, you will need to do for each
>>>> such column what model.matrix(~factor( dat[,i] ) ) does by default
>>>> ( using 'contr.treatment' ) - construct zero-one codes for all but one
>>>> (reference) category.
>>>>
>>>> Note that with 10000 trichotomies, you will have a result with
>>>>
>>>> 10000^2 * ( 3-1 )^2
>>>>
>>>> integers needing about 3 Gigabytes, and so on.
>>>>
>>>> HTH,
>>>>
>>>> Chuck
>>>>
>>>> p.s. Why on Earth are you doing this????
>>>>
>>>>
>>>>>
>>>>> Andy
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html
>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>
>>>>> ______________________________________________
>>>>> R-help_at_r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>> Charles C. Berry (858) 534-2098
>>>> Dept of Family/Preventive
>>>> Medicine
>>>> E mailto:cberry_at_tajo.ucsd.edu UC San Diego
>>>> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego
>>>> 92093-0901
>>>>
>>>> ______________________________________________
>>>> R-help_at_r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14744086.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> Charles C. Berry (858) 534-2098
>> Dept of Family/Preventive
>> Medicine
>> E mailto:cberry_at_tajo.ucsd.edu UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> --
> View this message in context: http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14767832.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry_at_tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 12 Jan 2008 - 19:06:03 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 12 Jan 2008 - 19:30:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive