Re: [R] R Newbie, please help!

From: Joshua Wiley <jwiley.psych_at_gmail.com>
Date: Fri, 04 Jun 2010 00:01:03 -0700

I am not exactly sure how your filtering code is working, but take a look at

?na.omit

You will probably need a few additional steps if you want to remove all rows related to a particular id. Also look at ?subset which is a good general way to subset your data.

Josh

On Thu, Jun 3, 2010 at 11:45 PM, Jeffery Ding <jefferyding_at_gmail.com> wrote:
> Thanks, you have been tremendously helpful!
> I will be able to implement option 2, after I filter out stocks with
> incomplete data sets.
>
> So far, for my filtering code I have:
>
> ##Filtering
>
> x<-length(unique(Returns$date_))
> y<-unique(Returns$id)
> Returns.filter<-Returns
>
> i<-1
>
> while(i<=length(y)) {
>     a<-sum(Returns$id==y[i])
>     if(a<x) {
>         ##need code that will remove all rows with id a
>     }
>     i<-i+1
>     }
>
>
>
> On Fri, Jun 4, 2010 at 2:40 PM, Joshua Wiley <jwiley.psych_at_gmail.com> wrote:
>>
>> Hey Jeff,
>>
>> I have a few ideas.  Each has some different requirements, and to help
>> you choose, I bench marked them.
>>
>>
>> ###START###
>>
>> ##Basic data
>> > test <- data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3),
>> > time=rep(c(1, rep(0, 999)), 10^4))
>>
>> ##Option 1: probably the most general, but also the slowest by far.
>> ##The idea is it does the calculation for each stock/ID, and then
>> concatenates [c()] an NA in front.
>>
>> > system.time(test[,"dailyreturns"] <- unlist(by(test[,"totret"],
>> > test[,"id"], function(x) {c(NA, x[-1]/x[-length(x)])})), gcFirst=TRUE)
>>   user  system elapsed
>>  49.11    0.42   49.86
>>
>> ##Option 2: Assumes that you have the same number of measurements for
>> each stock/ID so you can just assign an NA every nth row.
>> ##This is fairly fast
>>
>> > system.time(test[-1,"dailyreturns"] <-
>> > test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
>>   user  system elapsed
>>   1.11    0.21    1.31
>> > system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] <- NA,
>> > gcFirst=TRUE)
>>   user  system elapsed
>>   0.39    0.04    0.42
>>
>> ##Option 3: Assumes that you have some variable (time in my little
>> test data) that somehow indicates when each stock/ID has its first
>> measurement.  In the example, the first measurement gets a 1 and
>> subsequent ones a 0.  So we just assign NA in 'dailyreturns' everytime
>> the other "time" column has a 1.  Again, a big assumption, but fairly
>> quick.
>>
>> > system.time(test[-1,"dailyreturns"] <-
>> > test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
>>   user  system elapsed
>>   1.06    0.17    1.25
>> > system.time(test[which(test[,"time"]==1),"dailyreturns"] <- NA,
>> > gcFirst=TRUE)
>>   user  system elapsed
>>   0.46    0.09    0.55
>>
>> ###END###
>>
>> I really feel like there should be a faster way that is also more
>> general, but it is late and I am not coming up with any better ideas
>> at the moment.  Perhaps somehow finding the first instance of a
>> stock/ID?  Anyway, this was simulated on 10 million rows, so maybe
>> by() works plenty fast for you.
>>
>> Josh
>>
>>
>> On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 <jefferyding_at_gmail.com> wrote:
>> >
>> > Hey Josh,
>> >
>> > Thanks for the quick response!
>> >
>> > I guess I have to switch from the Java mindset to the matrix/vector
>> > mindset
>> > of R.
>> >
>> > Your code worked very well, but I just have one problem:
>> >
>> > Essentially I have a time series of stock A, followed by a time series
>> > of
>> > stock B, etc.
>> > So there are break points in the data (the points where it switches
>> > stocks
>> > have incorrect returns, and should be NA at t=0 for each stock)
>> >
>> > Is there an easy way to account for this in R? What I was thinking of is
>> > if
>> > there is a way to make a filter rule. Such as if the ID of the row
>> > matches
>> > Stock A, then perform this.
>> >
>> >>>"Hello Jeff,
>> >
>> > Try this:
>> >
>> > test <- data.frame(totret=rnorm(10^7)) #create some sample data
>> > test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"]
>> >
>> > The general idea is to take the column "totret" excluding the first 1,
>> > dividided by "totret" exluding the last row.  This gives in effect t+1
>> > (since t is now shorter)/t
>> >
>> > I assigned the result to a new column "dailyreturn".  For 10^7 rows,
>> > it tooks 1.92 seconds on my system."
>> > --
>> > View this message in context:
>> > http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242703.html
>> > Sent from the R help mailing list archive at Nabble.com.
>> >
>> > ______________________________________________
>> > R-help_at_r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Joshua Wiley
>> Senior in Psychology
>> University of California, Riverside
>> http://www.joshuawiley.com/
>
>
>
> --
> Jeffery Ding
> Duke University, Class of 2012
> (224) 622-3398 | jd116_at_duke.edu
>

-- 
Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri 04 Jun 2010 - 07:03:43 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 04 Jun 2010 - 07:40:26 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive