Re: [R] Problem with ddply in the plyr-package: surprising output of a date-column

From: William Dunlap <wdunlap_at_tibco.com>
Date: Mon, 25 Apr 2011 11:55:06 -0700

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces_at_r-project.org
> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Brian Diggs
> Sent: Monday, April 25, 2011 11:05 AM
> To: christoph.jaeckel_at_wi.tum.de
> Cc: r-help_at_r-project.org
> Subject: Re: [R] Problem with ddply in the plyr-package:
> surprising output of a date-column
>
> On 4/25/2011 10:19 AM, Christoph Jäckel wrote:
> > Hi Together,
> >
> > I have a problem with the plyr package - more precisely
> with the ddply
> > function - and would be very grateful for any help. I hope
> the example
> > here is precise enough for someone to identify the problem.
> Basically,
> > in this step I want to identify observations that are identical in
> > terms of certain identifiers (ID1, ID2, ID3) and just want to save
> > those observations (in this step, without deleting any rows or
> > manipulating any data) in a separate data.frame. However, I get the
> > warning message below and the column with dates is messed up.
> > Interestingly, the value column (the type is factor here, but if you
> > change that with as.integer it doesn't make any difference)
> is handled
> > correctly. Any idea what I do wrong?
> >
> > df<-
> data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d
','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
> >
> >
> Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-0
> 5-5","1985-05-6","1985-05-7"),
> > Value=c(1,2,3,4,5,6,7)))
> > df[,1]<- as.character(df[,1])
> > df[,2]<- as.character(df[,2])
> > df$Date<- strptime(df$Date,"%Y-%m-%d")
> >
> > #Apparently there are two observation that have the same
> IDs: ID1=2 and ID1=4
> > ddply(df,.(ID1,ID2,ID3),nrow)
> > #I want to save those IDs in a separate data.frame, so the
> desired output is:
> > df[c(2:3,6:7),]
> >
> > #My idea: Write a custom function that only returns
> observations with
> > multiple rows.
> > #Seems to work except that the Date column doesn't make any
> sense anymore
> > #Warning message: In output[[var]][rng]<- df[[var]]: number of items
> > to replace is not a multiple of replacement length
> > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
> >
> > #Notice that it works perfectly if I only have one observation with
> > multiple rows
> > ddply(df[1:6,],.(ID1,ID2,ID3),function(df)
> if(nrow(df)<=1){NULL}else{df})
>
> Works for me:
>
> > df[c(2:3,6:7),]
> ID1 ID2 ID3 Date Value
> 2 2 b v1 1985-05-2 2
> 3 2 b v1 1985-05-3 3
> 6 4 e v1 1985-05-6 6
> 7 4 e v1 1985-05-7 7
> > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
> ID1 ID2 ID3 Date Value
> 1 2 b v1 1985-05-2 2
> 2 2 b v1 1985-05-3 3
> 3 4 e v1 1985-05-6 6
> 4 4 e v1 1985-05-7 7
> [ ... version info elided ... ]
> A couple of things: there was just an update of plyr to 1.5.2; maybe
> that fixes what you are seeing? Also, your df consists of
> only factors.
> cbind-ing the data before turning it into a data.frame makes it a
> character matrix which gets converted to factors.
>
> > str(df)
> 'data.frame': 7 obs. of 5 variables:
> $ ID1 : Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4
> $ ID2 : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
> $ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
> $ Date : Factor w/ 7 levels "1985-05-1","1985-05-2",..: 1 2
> 3 4 5 6 7
> $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7

The OP's data.frame contained a POSIXlt (not factor) object in the "Date" column
  > str(df)
  'data.frame': 7 obs. of 5 variables:

   $ ID1  : chr  "1" "2" "2" "3" ...
   $ ID2  : chr  "a" "b" "b" "c" ...
   $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
   $ Date : POSIXlt, format: "1985-05-01" "1985-05-02" ...
   $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7
and apparently plyr's equivalent of rbind doesn't support that class.

If you want to continue using POSIXlt objects you can get your immediate result without ddply; subscripting will do the job:   > nDups <- with(df, ave(rep(0,nrow(df)), ID1, ID2, ID3, FUN=length))   > print(nDups)
  [1] 1 2 2 1 1 2 2
  > df[nDups>1, ]

    ID1 ID2 ID3       Date Value
  2   2   b  v1 1985-05-02     2
  3   2   b  v1 1985-05-03     3
  6   4   e  v1 1985-05-06     6
  7   4   e  v1 1985-05-07     7

  > str(.Last.value)
  'data.frame': 4 obs. of 5 variables:
   $ ID1  : chr  "2" "2" "4" "4"
   $ ID2  : chr  "b" "b" "e" "e"
   $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1
   $ Date : POSIXlt, format: "1985-05-02" "1985-05-03" ...
   $ Value: Factor w/ 7 levels "1","2","3","4",..: 2 3 6 7

If you need plyr for other tasks you ought to use a different class for your date data (or wait until plyr can deal with POSIXlt objects).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

>
> Maybe that has something to do with the odd "dates" since
> they are not
> really dates at all, just string representations of factor levels.
> Compare with:
>
> DF <- data.frame(ID1=c(1,2,2,3,3,4,4),
> ID2=c('a','b','b','c','d','e','e'),
> ID3=c("v1","v1","v1","v1","v2","v1","v1"),
> Date=as.Date(c("1985-05-1","1985-05-2","1985-05-3",
> "1985-05-4","1985-05-5","1985-05-6","1985-05-7")),
> Value=c(1,2,3,4,5,6,7))
> str(DF)
> #'data.frame': 7 obs. of 5 variables:
> # $ ID1 : num 1 2 2 3 3 4 4
> # $ ID2 : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
> # $ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
> # $ Date : Date, format: "1985-05-01" "1985-05-02" ...
> # $ Value: num 1 2 3 4 5 6 7
>
> This version also works for me.
>
> ddply(DF,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
> # ID1 ID2 ID3 Date Value
> #1 2 b v1 1985-05-02 2
> #2 2 b v1 1985-05-03 3
> #3 4 e v1 1985-05-06 6
> #4 4 e v1 1985-05-07 7
>
> > Thanks in advance,
> >
> > Christoph
> >
> >
> --------------------------------------------------------------
> --------------------------------------------------------------
> ----------------------------------------
> >
> > Christoph Jäckel (Dipl.-Kfm.)
> >
> >
> --------------------------------------------------------------
> --------------------------------------------------------------
> ----------------------------------------
> >
> > Research Assistant
> >
> > Chair for Financial Management and Capital Markets | Lehrstuhls für
> > Finanzmanagement und Kapitalmärkte
> >
> > TUM School of Management | Technische Universität München
> >
> > Arcisstr. 21 | D-80333 München | Germany
> >
>
>
> --
> Brian S. Diggs, PhD
> Senior Research Associate, Department of Surgery
> Oregon Health & Science University
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 25 Apr 2011 - 18:58:36 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 25 Apr 2011 - 20:20:37 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive