Re: [R] Calculated mean value based on another column bin from dataframe.

From: David Winsemius <dwinsemius_at_comcast.net>
Date: Wed, 06 Apr 2011 17:20:52 -0400

On Apr 6, 2011, at 9:46 AM, Fabrice Tourre wrote:

> Dear Henrique Dallazuanna,
>
> Thank you very much for your suggestion.
>
> It is obvious that your method is better than me.
>
> Is it possible to use cut, table,by etc? Whether there is some
> aggregate function in R can do this?
>
> Thanks.
>
> On Wed, Apr 6, 2011 at 2:16 PM, Henrique Dallazuanna
> <wwwhsd_at_gmail.com> wrote:
>> Try this:
>>
>> fil <- sapply(ran, '<', e1 = dat[,1]) & sapply(ran[2:(length(ran) +
>> 1)], '>=', e1 = dat[,1])
>> mm <- apply(fil, 2, function(idx)mean(dat[idx, 2]))
>>
>> On Wed, Apr 6, 2011 at 5:48 AM, Fabrice Tourre <fabrice.ciup_at_gmail.com
>> > wrote:
>>> Dear list,
>>>
>>> I have a dataframe with two column as fellow.
>>>
>>>> head(dat)
>>> V1 V2
>>> 0.15624 0.94567
>>> 0.26039 0.66442
>>> 0.16629 0.97822
>>> 0.23474 0.72079
>>> 0.11037 0.83760
>>> 0.14969 0.91312
>>>
>>> I want to get the column V2 mean value based on the bin of column of
>>> V1. I write the code as fellow. It works, but I think this is not
>>> the
>>> elegant way. Any suggestions?
>>>
>>> dat<-read.table("dat.txt",head=F)
>>> ran<-seq(0,0.5,0.05)
>>> mm<-NULL
>>> for (i in c(1:(length(ran)-1)))
>>> {
>>> fil<- dat[,1] > ran[i] & dat[,1]<=ran[i+1]
>>> m<-mean(dat[fil,2])
>>> mm<-c(mm,m)
>>> }
>>> mm
>>>
>>> Here is the first 20 lines of my data.
>>>
>>>> dput(head(dat,20))
>>> structure(list(V1 = c(0.15624, 0.26039, 0.16629, 0.23474, 0.11037,
>>> 0.14969, 0.16166, 0.09785, 0.36417, 0.08005, 0.29597, 0.14856,
>>> 0.17307, 0.36718, 0.11621, 0.23281, 0.10415, 0.1025, 0.04238,
>>> 0.13525), V2 = c(0.94567, 0.66442, 0.97822, 0.72079, 0.8376,
>>> 0.91312, 0.88463, 0.82432, 0.55582, 0.9429, 0.78956, 0.93424,
>>> 0.87692, 0.83996, 0.74552, 0.9779, 0.9958, 0.9783, 0.92523, 0.99022
>>> )), .Names = c("V1", "V2"), row.names = c(NA, 20L), class =
>>> "data.frame")
>>>
>>> ______________________________________________

Here is how I would have done it with findInterval and tapply which is very similar to using a `cut` and `table` approach:

 > dat$grp <- findInterval(dat$V1, seq(0,0.5,0.05) )  > tapply(dat$V2, dat$grp, mean)

         1 2 3 4 5 6 8 0.9252300 0.8836100 0.9135429 0.9213600 0.8493450 0.7269900 0.6978900 #####---------------

You do not get exactly the same form of the result as with Henrique's method. His yields:
 > mm
  [1] 0.9252300 0.8836100 0.9135429 0.9213600 0.8493450

0.7269900       NaN
  [8] 0.6978900       NaN       NaN       NaN

####----------------

The cut approach would yield this, which is more informatively labeled. (I'm wasn't completely sure the second to last word in the prior sentence was a real word, but several dictionaries seem to think so.):

 > dat$grp2 <- cut(dat$V1 , breaks=ran)
 > tapply(dat$V2, dat$grp2, mean)
   (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3]   0.9252300 0.8836100 0.9135429 0.9213600 0.8493450 0.7269900 (0.3,0.35] (0.35,0.4] (0.4,0.45] (0.45,0.5]

         NA 0.6978900 NA NA

>

David Winsemius, MD
West Hartford, CT



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 06 Apr 2011 - 21:53:53 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 06 Apr 2011 - 23:00:28 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive