Re: [R] Problem with ddply in the plyr-package: surprising output of a date-column

From: Brian Diggs <diggsb_at_ohsu.edu>
Date: Mon, 25 Apr 2011 13:06:47 -0700

On 4/25/2011 11:55 AM, William Dunlap wrote:
>
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-help-bounces_at_r-project.org
>> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Brian Diggs
>> Sent: Monday, April 25, 2011 11:05 AM
>> To: christoph.jaeckel_at_wi.tum.de
>> Cc: r-help_at_r-project.org
>> Subject: Re: [R] Problem with ddply in the plyr-package:
>> surprising output of a date-column
>>
>> On 4/25/2011 10:19 AM, Christoph Jäckel wrote:
>>> Hi Together,
>>>
>>> I have a problem with the plyr package - more precisely
>> with the ddply
>>> function - and would be very grateful for any help. I hope
>> the example
>>> here is precise enough for someone to identify the problem.
>> Basically,
>>> in this step I want to identify observations that are identical in
>>> terms of certain identifiers (ID1, ID2, ID3) and just want to save
>>> those observations (in this step, without deleting any rows or
>>> manipulating any data) in a separate data.frame. However, I get the
>>> warning message below and the column with dates is messed up.
>>> Interestingly, the value column (the type is factor here, but if you
>>> change that with as.integer it doesn't make any difference)
>> is handled
>>> correctly. Any idea what I do wrong?
>>>
>>> df<-
>> data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d
> ','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
>>>
>>>
>> Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-0
>> 5-5","1985-05-6","1985-05-7"),
>>> Value=c(1,2,3,4,5,6,7)))
>>> df[,1]<- as.character(df[,1])
>>> df[,2]<- as.character(df[,2])
>>> df$Date<- strptime(df$Date,"%Y-%m-%d")
>>>
>>> #Apparently there are two observation that have the same
>> IDs: ID1=2 and ID1=4
>>> ddply(df,.(ID1,ID2,ID3),nrow)
>>> #I want to save those IDs in a separate data.frame, so the
>> desired output is:
>>> df[c(2:3,6:7),]
>>>
>>> #My idea: Write a custom function that only returns
>> observations with
>>> multiple rows.
>>> #Seems to work except that the Date column doesn't make any
>> sense anymore
>>> #Warning message: In output[[var]][rng]<- df[[var]]: number of items
>>> to replace is not a multiple of replacement length
>>> ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>>>
>>> #Notice that it works perfectly if I only have one observation with
>>> multiple rows
>>> ddply(df[1:6,],.(ID1,ID2,ID3),function(df)
>> if(nrow(df)<=1){NULL}else{df})
>>
>> Works for me:
>>
>> > df[c(2:3,6:7),]
>> ID1 ID2 ID3 Date Value
>> 2 2 b v1 1985-05-2 2
>> 3 2 b v1 1985-05-3 3
>> 6 4 e v1 1985-05-6 6
>> 7 4 e v1 1985-05-7 7
>> > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>> ID1 ID2 ID3 Date Value
>> 1 2 b v1 1985-05-2 2
>> 2 2 b v1 1985-05-3 3
>> 3 4 e v1 1985-05-6 6
>> 4 4 e v1 1985-05-7 7
>> [ ... version info elided ... ]
>> A couple of things: there was just an update of plyr to 1.5.2; maybe
>> that fixes what you are seeing? Also, your df consists of
>> only factors.
>> cbind-ing the data before turning it into a data.frame makes it a
>> character matrix which gets converted to factors.
>>
>> > str(df)
>> 'data.frame': 7 obs. of 5 variables:
>> $ ID1 : Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4
>> $ ID2 : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
>> $ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
>> $ Date : Factor w/ 7 levels "1985-05-1","1985-05-2",..: 1 2
>> 3 4 5 6 7
>> $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7
>
> The OP's data.frame contained a POSIXlt (not factor) object
> in the "Date" column
> > str(df)
> 'data.frame': 7 obs. of 5 variables:
> $ ID1 : chr "1" "2" "2" "3" ...
> $ ID2 : chr "a" "b" "b" "c" ...
> $ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
> $ Date : POSIXlt, format: "1985-05-01" "1985-05-02" ...
> $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7

Thanks, Bill. Somehow I missed that, despite the OP having it in his code; I even copied it into my testing window. It was my error for not running it and noting it.

> and apparently plyr's equivalent of rbind doesn't support that class.

plyr uses rbind.fill primarily. And it doesn't handle columns of POSIXlt based on testing that directly. (Although with only one argument, it just passes the data.frame back, which is why when there was just a single duplicate, it worked; that bypassed the code that couldn't handle POSIXlt's.)

> If you want to continue using POSIXlt objects you can get your
> immediate result without ddply; subscripting will do the job:
> > nDups<- with(df, ave(rep(0,nrow(df)), ID1, ID2, ID3, FUN=length))
> > print(nDups)
> [1] 1 2 2 1 1 2 2
> > df[nDups>1, ]
> ID1 ID2 ID3 Date Value
> 2 2 b v1 1985-05-02 2
> 3 2 b v1 1985-05-03 3
> 6 4 e v1 1985-05-06 6
> 7 4 e v1 1985-05-07 7
> > str(.Last.value)
> 'data.frame': 4 obs. of 5 variables:
> $ ID1 : chr "2" "2" "4" "4"
> $ ID2 : chr "b" "b" "e" "e"
> $ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1
> $ Date : POSIXlt, format: "1985-05-02" "1985-05-03" ...
> $ Value: Factor w/ 7 levels "1","2","3","4",..: 2 3 6 7
>
> If you need plyr for other tasks you ought to use a different
> class for your date data (or wait until plyr can deal with
> POSIXlt objects).

If you do want to change classes, both Date and POSIXct are choices that will work with plyr.

> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>
>> Maybe that has something to do with the odd "dates" since
>> they are not
>> really dates at all, just string representations of factor levels.
>> Compare with:
>>
>> DF<- data.frame(ID1=c(1,2,2,3,3,4,4),
>> ID2=c('a','b','b','c','d','e','e'),
>> ID3=c("v1","v1","v1","v1","v2","v1","v1"),
>> Date=as.Date(c("1985-05-1","1985-05-2","1985-05-3",
>> "1985-05-4","1985-05-5","1985-05-6","1985-05-7")),
>> Value=c(1,2,3,4,5,6,7))
>> str(DF)
>> #'data.frame': 7 obs. of 5 variables:
>> # $ ID1 : num 1 2 2 3 3 4 4
>> # $ ID2 : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
>> # $ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
>> # $ Date : Date, format: "1985-05-01" "1985-05-02" ...
>> # $ Value: num 1 2 3 4 5 6 7
>>
>> This version also works for me.
>>
>> ddply(DF,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>> # ID1 ID2 ID3 Date Value
>> #1 2 b v1 1985-05-02 2
>> #2 2 b v1 1985-05-03 3
>> #3 4 e v1 1985-05-06 6
>> #4 4 e v1 1985-05-07 7
>>
>>> Thanks in advance,
>>>
>>> Christoph
>>>
>>>
>> --------------------------------------------------------------
>> --------------------------------------------------------------
>> ----------------------------------------
>>>
>>> Christoph Jäckel (Dipl.-Kfm.)
>>>
>>>
>> --------------------------------------------------------------
>> --------------------------------------------------------------
>> ----------------------------------------
>>>
>>> Research Assistant
>>>
>>> Chair for Financial Management and Capital Markets | Lehrstuhls für
>>> Finanzmanagement und Kapitalmärkte
>>>
>>> TUM School of Management | Technische Universität München
>>>
>>> Arcisstr. 21 | D-80333 München | Germany
>>>
>>
>>
>> --
>> Brian S. Diggs, PhD
>> Senior Research Associate, Department of Surgery
>> Oregon Health& Science University
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Mon 25 Apr 2011 - 20:10:08 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 25 Apr 2011 - 20:20:37 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive