Re: [R] (Newbie) Aggregate for NA values

From: Adaikalavan Ramasamy <ramasamy_at_cancer.org.uk>
Date: Sat 25 Feb 2006 - 03:05:07 EST

I think it makes perfect sense for R to drop it since 'NA' represents uninformative information. I do not know if there is a elegant solution but I would suggest that you make these 'NA' into an informative value.

Here is one possibility:

 df <- data.frame( AA=1:10, BB=rep(1:5,2), CC=rep(1:2,5), DD=rnorm(10) )  df[ 9:10, "CC" ] <- NA

 df[is.na(df)] <- "lala" ## change NA's into informative category ##

 aggregate( df$DD, by=list( df$CC ), mean )

     Group.1          x
   1       1  1.1533763
   2       2  0.6427338

   3 lala -0.2745249

 aggregate( df$DD, by=list( df$BB, df$CC ), mean )

      Group.1 Group.2           x
   1        1       1  0.47264081
   2        2       1  0.63795211
   3        3       1  1.66756015
   4        5       1  1.83535232
   5        1       2  0.89914287
   6        2       2  1.11102134
   7        3       2  0.22268699
   8        4       2  0.33808394
   9        4    lala -0.60154608
   10       5    lala  0.05249622

Regards, Adai

On Fri, 2006-02-24 at 10:16 -0500, Vivek Satsangi wrote:
> Folks,
>
> Sorry if this question has been answered before or is obvious (or
> worse, statistically "bad"). I don't understand what was said in one
> of the search results that seems somewhat related.
>
> I use aggregate to get a quick summary of the data. Part of what I am
> looking for in the summary is, how much influence might the NA's have
> had, if they were included, and is excluding them from the means
> causing some sort of bias. So I want the summary stat for the NA's
> also.
>
> Here is a simple example session (edited to remove the typos I made,
> comments added later):
>
> > tmp_a <- 1:10
> > tmp_b <- rep(1:5,2)
> > tmp_c <- rep(1:2,5)
> > tmp_d <- c(1,1,1,2,2,2,3,3,3,4)
> > tmp_df <- data.frame(tmp_a,tmp_b,tmp_c,tmp_d);
> > tmp_df$tmp_c[9:10] <- NA ;
> > tmp_df
> tmp_a tmp_b tmp_c tmp_d
> 1 1 1 1 1
> 2 2 2 2 1
> 3 3 3 1 1
> 4 4 4 2 2
> 5 5 5 1 2
> 6 6 1 2 2
> 7 7 2 1 3
> 8 8 3 2 3
> 9 9 4 NA 3
> 10 10 5 NA 4
> > aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_b,tmp_df$tmp_c),mean);
> Group.1 Group.2 x
> 1 1 1 1
> 2 2 1 3
> 3 3 1 1
> 4 5 1 2
> 5 1 2 2
> 6 2 2 1
> 7 3 2 3
> 8 4 2 2
> # Only one row for each (tmp_b, tmp_c) combination, NA's getting dropped.
>
> > aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_c),mean);
> Group.1 x
> 1 1 1.75
> 2 2 2.00
>
> What I want in this last aggregate is, a mean for the values in tmp_d
> that correspond to the tmp_c values of NA. Similarly, perhaps there is
> a way to make the second last call to aggregate return the values of
> tmp_d for the NA values of tmp_c also.
>
> How can I achieve this?
>
> --
> -- Vivek Satsangi
> Student, Rochester, NY USA
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Feb 25 03:13:03 2006

This archive was generated by hypermail 2.1.8 : Sat 25 Feb 2006 - 06:08:44 EST