# Re: [R] help with mysql and R: partitioning by quintile

From: gj <gawesh_at_gmail.com>
Date: Sat, 14 May 2011 23:09:04 +0100

```1,1,1
1,2,10
1,3,1
1,4,1
1,5,15
1,6,4
1,7,16
1,8,6
1,9,1
1,10,1
1,11,2
1,12,2
1,13,1
1,14,6
1,15,7
1,16,13
1,17,3
1,18,2
1,19,5
1,20,2
1,21,2
1,22,6
1,23,4
1,24,1
1,25,1
1,26,16
1,27,4
1,28,1
1,29,4
1,30,4
1,31,4
1,32,1
1,33,14
1,34,2
1,35,7

```

It is a sample of the history of tracks played: userid,track and frequency. What I want is to convert the frequency into a rating scale (1-5) based on the frequency at which a user has played a track, using the following interquintile ranges for the cfd:
0%-20% = rating 1, 20%-40% = rating 2, .... ,80%-100%=rating 5

Jim kindly provided the following code:
# cheers jim holtman
sep=',')
># get the quantiles for each user(we want the frequency distribution to be
based on user)
>tq <- tapply(x\$freq,x\$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
># create a matrix with the rownames as the tracks to use in the
findInterval
>tqm <- do.call(rbind, tq)
>#now put the ratings
>require(data.table)
>x.dt <- data.table(x)
>x.new <- x.dt[,list(freq = freq,track=track,rating =
findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = TRUE) + 1L),by=userid]

```     userid freq track rating
[1,]      1    1     1      2
[2,]      1   10     2      5
[3,]      1    1     3      2
[4,]      1    1     4      2
[5,]      1   15     5      5
[6,]      1    4     6      4

```

which is almost what I wanted except that the ratings are 1 point higher for tracks where the frequency is at the cut-off points in the interquintile range.
To illustrate the quintiles are:

> tq\$`1`

20% 40% 60% 80% 100%
1 2 4 7 16

So, ideally I want (note the different ratings):

```     userid freq track rating
[1,]      1    1     1      1
[2,]      1   10     2      5
[3,]      1    1     3      1
[4,]      1    1     4      1
[5,]      1   15     5      5
[6,]      1    4     6      3

```

Can anybody help me? I'm new to R (as you have probably guessed). Sorry for the long explanation.

Regards
Gawesh

On Sat, May 14, 2011 at 7:37 PM, Dennis Murphy <djmuser_at_gmail.com> wrote:

> Hi:
>
> Is this what you're after?
>
> tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2)))
> ds\$int <- with(ds, cut(freq, c(0, tq)))
> with(ds, table(int))
>
> int
> (0,1] (1,2] (2,4] (4,7] (7,16]
> 10 6 7 6 6
>
> HTH,
> Dennis
>
> On Sat, May 14, 2011 at 9:42 AM, gj <gawesh_at_gmail.com> wrote:
> > Hi Jim,
> > Thanks very much for the code. I modified it a bit because I needed to
> > allocate the track ratings by userid (eg if user 1 plays track x once, he
> > gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and
> not
> > by track (sorry if this wasn't clear in my original post).
> >
> > This is almost working! What I can't get right at the moment is the
> cutoff
> > interval for the ratings.
> >
> > Sample data:
> >
> > userid,track,freq
> > 1,1,1
> > 1,2,10
> > 1,3,1
> > 1,4,1
> >
> > 1,5,15
> > 1,6,4
> > 1,7,16
> > 1,8,6
> > 1,9,1
> > 1,10,1
> > 1,11,2
> > 1,12,2
> > 1,13,1
> > 1,14,6
> > 1,15,7
> > 1,16,13
> > 1,17,3
> > 1,18,2
> > 1,19,5
> > 1,20,2
> > 1,21,2
> > 1,22,6
> > 1,23,4
> > 1,24,1
> > 1,25,1
> > 1,26,16
> > 1,27,4
> > 1,28,1
> > 1,29,4
> > 1,30,4
> > 1,31,4
> > 1,32,1
> > 1,33,14
> > 1,34,2
> > 1,35,7
> >
> >
> >>tq
> >
> >
> > \$'1'
> >
> >
> > 20% 40% 60% 80% 100%
> > 1 2 4 7 16
> >
> >
> > >From this distribution I would expect to distribute ratings as follows:
> > freq: <=1 <=2 <=4 <=7 <=16
> > rating: 1 2 3 4 5
> >
> > But my output is (the ratings are correct except at the cut off points):
> > "userid" "freq" "track" "rating"
> > 1 1 1 2
> > 1 10 2 5
> > 1 1 3 2
> > 1 1 4 2
> > 1 15 5 5
> > 1 4 6 4
> > 1 16 7 5
> > 1 6 8 4
> > 1 1 9 2
> > 1 1 10 2
> > 1 2 11 3
> > 1 2 12 3
> > 1 1 13 2
> > 1 6 14 4
> > 1 7 15 5
> > 1 13 16 5
> > 1 3 17 3
> > 1 2 18 3
> > 1 5 19 4
> > 1 2 20 3
> > 1 2 21 3
> > 1 6 22 4
> > 1 4 23 4
> > 1 1 24 2
> > 1 1 25 2
> > 1 16 26 5
> > 1 4 27 4
> > 1 1 28 2
> > 1 4 29 4
> > 1 4 30 4
> > 1 4 31 4
> > 1 1 32 2
> > 1 14 33 5
> > 1 2 34 3
> > 1 7 35 5
> >
> > This is the code:
> > # cheers jim holtman
> > sep=',')
> > # get the quantiles for each track
> > tq <- tapply(x\$freq,x\$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
> > # create a matrix with the rownames as the tracks to use in the
> findInterval
> > tqm <- do.call(rbind, tq)
> > #now put the ratings
> > require(data.table)
> > x.dt <- data.table(x)
> > x.new <- x.dt[,list(freq = freq,track=track,rating =
> > findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed =
> TRUE) +
> > 1L),by=userid]
> >
> > Regards
> > Gawesh
> >
> > On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholtman_at_gmail.com> wrote:
> >
> >> try this:
> >>
> >> > # create some data
> >> > x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '')
> >> + , track = rep(1:20, 20)
> >> + , freq = floor(runif(400, 10, 200))
> >> + , stringsAsFactors = FALSE
> >> + )
> >> > # get the quantiles for each track
> >> > tq <- tapply(x\$freq, x\$track, quantile, prob = c(.2, .4, .6, .8, 1))
> >> > # create a matrix with the rownames as the tracks to use in the
> >> findInterval
> >> > tqm <- do.call(rbind, tq)
> >> > # now put the ratings
> >> > require(data.table)
> >> > x.dt <- data.table(x)
> >> > x.new <- x.dt[,
> >> + list(userid = userid
> >> + , freq = freq
> >> + , rating = findInterval(freq
> >> + # use track as index into
> >> quantile matrix
> >> + , tqm[as.character(track[1L]),]
> >> + , rightmost.closed = TRUE
> >> + ) + 1L
> >> + )
> >> + , by = track]
> >> >
> >> track userid freq rating
> >> [1,] 1 u1 10 1
> >> [2,] 1 u2 15 1
> >> [3,] 1 u3 126 4
> >> [4,] 1 u4 117 3
> >> [5,] 1 u5 76 2
> >> [6,] 1 u6 103 3
> >> >
> >>
> >>
> >> On Sun, May 8, 2011 at 2:48 PM, gj <gawesh_at_gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I have a mysql table with fields userid,track,frequency e.g
> >> > u1,1,10
> >> > u1,2,100
> >> > u1,3,110
> >> > u1,4,200
> >> > u1,5,120
> >> > u1,6,130
> >> > .
> >> > u2,1,23
> >> > .
> >> > .
> >> > where "frequency" is the number of times a music track is played by a
> >> > "userid"
> >> >
> >> > I need to turn my 'frequency' table into a rating table (it's for a
> >> > recommender system). So, for each user, I need to categorise the
> >> frequency
> >> > of tracks played by quintile so that each particular track can have 5
> >> > ratings (1-5), with the ratings allocated as follows: inter-quintile
> >> range
> >> > 100-80% = rating 5, inter-quintile range 80-60% = rating 4,
> >> > ..., inter-quintile range 20-0% = rating 1)
> >> >
> >> > Hence, I need to create a table with fields userid,track,rating:
> >> > u1,1,1
> >> > u1,2, 3
> >> > ...
> >> >
> >> > Can anybody help me to do this with R?
> >> >
> >> > Regards
> >> > Gawesh
> >> >
> >> > [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help_at_r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >>
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

[[alternative HTML version deleted]]

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 14 May 2011 - 22:19:42 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 14 May 2011 - 23:40:07 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.