[R] Selecting a subsample so that it follows a distribution.

From: Bryo <brynedal_at_gmail.com>
Date: Wed, 02 Mar 2011 07:14:02 -0800 (PST)

Hi All,

I want to select rows at random from a large data.frame while achieving a particular distribution defined my a given subset of this data.frame. How can I do this? More details and what I've done so far is given below.

I have gene expression data and gene sets of interest. In order to look at enrichment of differential expression I'm doing a simple permutation approach: Selecting a an random set of genes (same size at those diff exp) and recording the overlap, repeating 10 000 times. The problem: The expression level and significance in differential expression is correlated (more power). Hence I want to do a biased permutation, selecting random genes that together follow the same expression level distribution.

This is what I've done so far:
geneExp is my data.frame with DE statistics. 6585 rows of genes, col one is gene ID.
geneSet is my gene set, column one is gene ID. index is the index of the genes DE in my geneExp.

dSign=density(geneExp[index,'baseMean']) #baseMean is a measure of expressionlevel

prob=lapply(geneExp[,"baseMean"],function(x) approx(dSign$x,dSign$y,x)$y) prob=unlist(prob)

So when I am doing my permutation I do:


for (i in 1:10000) {


And thereafter look at the distribution of random overlaps compared to the initially observed overlap.

But, the distribution of values that this permutation gives in NOT equal to the distr of significant genes, but a lot narrower. Simple because my method assumes a uniform distribution of values to chose from.

Sorry if this was a complicated message, I would highly appreciate any help or comments!


View this message in context: http://r.789695.n4.nabble.com/Selecting-a-subsample-so-that-it-follows-a-distribution-tp3331659p3331659.html
Sent from the R help mailing list archive at Nabble.com.

R-help_at_r-project.org mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 02 Mar 2011 - 16:05:08 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 02 Mar 2011 - 16:10:20 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive