Re: [R] help with mysql and R: partitioning by quintile

From: Dennis Murphy <djmuser_at_gmail.com>
Date: Sat, 14 May 2011 11:37:57 -0700

Hi:

Is this what you're after?

tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2))) ds$int <- with(ds, cut(freq, c(0, tq)))
with(ds, table(int))

int
 (0,1] (1,2] (2,4] (4,7] (7,16]

    10 6 7 6 6

HTH,
Dennis

On Sat, May 14, 2011 at 9:42 AM, gj <gawesh_at_gmail.com> wrote:
> Hi Jim,
> Thanks very much for the code. I modified it a bit because I needed to
> allocate the track ratings by userid (eg if user 1 plays track x once, he
> gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and not
> by track (sorry if this wasn't clear in my original post).
>
> This is almost working! What I can't get right at the moment is the cutoff
> interval for the ratings.
> Any help please?
>
> Sample data:
>
> userid,track,freq
> 1,1,1
> 1,2,10
> 1,3,1
> 1,4,1
>
> 1,5,15
> 1,6,4
> 1,7,16
> 1,8,6
> 1,9,1
> 1,10,1
> 1,11,2
> 1,12,2
> 1,13,1
> 1,14,6
> 1,15,7
> 1,16,13
> 1,17,3
> 1,18,2
> 1,19,5
> 1,20,2
> 1,21,2
> 1,22,6
> 1,23,4
> 1,24,1
> 1,25,1
> 1,26,16
> 1,27,4
> 1,28,1
> 1,29,4
> 1,30,4
> 1,31,4
> 1,32,1
> 1,33,14
> 1,34,2
> 1,35,7
>
>
>>tq
>
>
> $'1'
>
>
> 20%  40%  60%  80% 100%
>   1    2    4    7   16
>
>
> >From this distribution I would expect to distribute ratings as follows:
> freq: <=1  <=2  <=4  <=7  <=16
> rating: 1      2      3      4       5
>
> But my output is (the ratings are correct except at the cut off points):
> "userid" "freq" "track" "rating"
> 1 1 1 2
> 1 10 2 5
> 1 1 3 2
> 1 1 4 2
> 1 15 5 5
> 1 4 6 4
> 1 16 7 5
> 1 6 8 4
> 1 1 9 2
> 1 1 10 2
> 1 2 11 3
> 1 2 12 3
> 1 1 13 2
> 1 6 14 4
> 1 7 15 5
> 1 13 16 5
> 1 3 17 3
> 1 2 18 3
> 1 5 19 4
> 1 2 20 3
> 1 2 21 3
> 1 6 22 4
> 1 4 23 4
> 1 1 24 2
> 1 1 25 2
> 1 16 26 5
> 1 4 27 4
> 1 1 28 2
> 1 4 29 4
> 1 4 30 4
> 1 4 31 4
> 1 1 32 2
> 1 14 33 5
> 1 2 34 3
> 1 7 35 5
>
> This is the code:
> # cheers jim holtman
> x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,

> sep=',')
> # get the quantiles for each track
> tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
> # create a matrix with the rownames as the tracks to use in the findInterval
> tqm <- do.call(rbind, tq)
> #now put the ratings
> require(data.table)
> x.dt <- data.table(x)
> x.new <- x.dt[,list(freq = freq,track=track,rating =
> findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = TRUE) +
> 1L),by=userid]
>
> Regards
> Gawesh
>
> On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholtman_at_gmail.com> wrote:
>
>> try this:
>>
>> > # create some data
>> > x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '')
>> +               , track = rep(1:20, 20)
>> +               , freq = floor(runif(400, 10, 200))
>> +               , stringsAsFactors = FALSE
>> +               )
>> > # get the quantiles for each track
>> > tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8, 1))
>> > # create a matrix with the rownames as the tracks to use in the
>> findInterval
>> > tqm <- do.call(rbind, tq)
>> > # now put the ratings
>> > require(data.table)
>> > x.dt <- data.table(x)
>> > x.new <- x.dt[,
>> +               list(userid = userid
>> +                  , freq = freq
>> +                  , rating = findInterval(freq
>> +                                        # use track as index into
>> quantile matrix
>> +                                        , tqm[as.character(track[1L]),]
>> +                                        , rightmost.closed = TRUE
>> +                                        ) + 1L
>> +                  )
>> +              , by = track]
>> >
>> > head(x.new)

>>     track userid freq rating
>> [1,]     1     u1   10      1
>> [2,]     1     u2   15      1
>> [3,]     1     u3  126      4
>> [4,]     1     u4  117      3
>> [5,]     1     u5   76      2
>> [6,]     1     u6  103      3
>> >
>>
>>
>> On Sun, May 8, 2011 at 2:48 PM, gj <gawesh_at_gmail.com> wrote:
>> > Hi,
>> >
>> > I have a mysql table with fields userid,track,frequency e.g
>> > u1,1,10
>> > u1,2,100
>> > u1,3,110
>> > u1,4,200
>> > u1,5,120
>> > u1,6,130
>> > .
>> > u2,1,23
>> > .
>> > .
>> > where "frequency" is the number of times a music track is played by a
>> > "userid"
>> >
>> > I need to turn my 'frequency' table into a rating table (it's for a
>> > recommender system). So, for each user, I need to categorise the
>> frequency
>> > of tracks played by quintile so that each particular track can have 5
>> > ratings (1-5), with the ratings allocated as follows: inter-quintile
>> range
>> > 100-80% = rating 5,   inter-quintile range 80-60% = rating 4,
>> > ..., inter-quintile range 20-0% = rating 1)
>> >
>> > Hence, I need to create a table with fields userid,track,rating:
>> > u1,1,1
>> > u1,2, 3
>> > ...
>> >
>> > Can anybody help me to do this with R?
>> >
>> > Regards
>> > Gawesh
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help_at_r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 14 May 2011 - 18:41:21 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 14 May 2011 - 23:00:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive