Re: [R] reordering huge data file

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Mon, 21 Jan 2008 17:24:47 -0800 (PST)


On Mon, 21 Jan 2008, Boks, M.P.M. wrote:

>
> Dear R-experts,
>
> My problem is how to handle a 10GB data file containing genotype data. The
> file is in a particular format (Illumina final report) and needs to be
> altered
> and merged with phenotype data for further analysis.
>

If the data have all the SNPs for one individual, then all the SNPs for the next individual, and so on, you can read in 305000 lines of data, look up the phenotype, then write out one line of output, eg with cat().

As another approach, I've been using the ncdf package for handling Illumina genotype data (slightly larger datasets, and multiple phenotypes). This has been faster and more compact than SQLite (because it doesn't need indexes to do random access by person and by SNP). It is then easy to write analyses by SNP (association tests) or analyses by person (allele sharing, population structure), and even analyses by genomic region (all SNPs in chr9q21.3)

     -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley_at_u.washington.edu	University of Washington, Seattle

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 22 Jan 2008 - 01:27:16 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 22 Jan 2008 - 01:30:08 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive