Re: [R] help with mysql and R: partitioning by quintile

From: gj <gawesh_at_gmail.com>
Date: Sun, 15 May 2011 00:28:17 +0100

Jim's suggestion did the trick:
tqm <- do.call(rbind, tq) + 0.001

head(x.new)     userid freq track rating
[1,]      1    1     1      1
[2,]      1   10     2      5
[3,]      1    1     3      1
[4,]      1    1     4      1
[5,]      1   15     5      5
[6,]      1    4     6      3


Dennis, what you suggested didn't work.
Thanks a lot guys! :-)
But before I can smile, I need to resolve a problem inherent in the data. When the play history lacks diversity (in terms of frequency), I want to assign a neutral rating of 3 (for my recommender system I use rating 1 is 'don't like' and 5 is 'i like').
Can that be done in R?

For example:
input:
userid,track,freq

1,1,1
1,2,1
1,3,1
1,4,1
1,5,2
1,6,2
1,7,1
1,8,1
1,9,1
1,10,1
1,11,2
1,12,2
1,13,1
1,14,1
1,15,1
1,16,1
1,17,2
1,18,2
1,19,2
1,20,2
1,21,2
1,22,1
1,23,1
1,24,1
1,25,1
1,26,1
1,27,1
1,28,1
1,29,1
1,30,1
1,31,1
1,32,1
1,33,1
1,34,1
1,35,1

gives output:

> head(x.new) userid freq track rating

[1,]      1    1     1      1
[2,]      1    1     2      1
[3,]      1    1     3      1
[4,]      1    1     4      1
[5,]      1    2     5      4
[6,]      1    2     6      4


Ideally I want to give a neutral rating in this case :

     userid freq track rating
[1,]      1    1     1      3
[2,]      1    1     2      3
[3,]      1    1     3      3
[4,]      1    1     4      3
[5,]      1    2     5      3
[6,]      1    2     6      3


Regards
Gawesh

On Sat, May 14, 2011 at 11:52 PM, jim holtman <jholtman_at_gmail.com> wrote:

> An easy way is to just offset the quantiles by a small increment so
> that boundary condition is less likely. If you change the line
>
> tqm <- do.call(rbind, tq) + 0.001
>
> in my example, that should do the trick.
>
> On Sat, May 14, 2011 at 6:09 PM, gj <gawesh_at_gmail.com> wrote:
> > Hi,
> > I think I haven't been able to explain correctly what I want. Here
> another
> > try:
> > Given that I have the following input:
> > userid,track,freq
> >
> > 1,1,1
> > 1,2,10
> > 1,3,1
> > 1,4,1
> > 1,5,15
> > 1,6,4
> > 1,7,16
> > 1,8,6
> > 1,9,1
> > 1,10,1
> > 1,11,2
> > 1,12,2
> > 1,13,1
> > 1,14,6
> > 1,15,7
> > 1,16,13
> > 1,17,3
> > 1,18,2
> > 1,19,5
> > 1,20,2
> > 1,21,2
> > 1,22,6
> > 1,23,4
> > 1,24,1
> > 1,25,1
> > 1,26,16
> > 1,27,4
> > 1,28,1
> > 1,29,4
> > 1,30,4
> > 1,31,4
> > 1,32,1
> > 1,33,14
> > 1,34,2
> > 1,35,7
> >
> > It is a sample of the history of tracks played: userid,track and
> frequency.
> > What I want is to convert the frequency into a rating scale (1-5) based
> on
> > the frequency at which a user has played a track, using the following
> > interquintile ranges for the cfd:
> > 0%-20% = rating 1, 20%-40% = rating 2, .... ,80%-100%=rating 5
> >
> > Jim kindly provided the following code:
> > # cheers jim holtman
> >>x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,
> > sep=',')
> >># get the quantiles for each user(we want the frequency distribution to
> be
> > based on user)
> >>tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
> >># create a matrix with the rownames as the tracks to use in the
> > findInterval
> >>tqm <- do.call(rbind, tq)
> >>#now put the ratings
> >>require(data.table)
> >>x.dt <- data.table(x)
> >>x.new <- x.dt[,list(freq = freq,track=track,rating =
> > findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed =
> TRUE) +
> > 1L),by=userid]
> >>head(x.new)
> >
> >
> > userid freq track rating
> > [1,] 1 1 1 2
> > [2,] 1 10 2 5
> > [3,] 1 1 3 2
> > [4,] 1 1 4 2
> > [5,] 1 15 5 5
> > [6,] 1 4 6 4
> >
> >
> > which is almost what I wanted except that the ratings are 1 point higher
> for
> > tracks where the frequency is at the cut-off points in the interquintile
> > range.
> > To illustrate the quintiles are:
> >
> >> tq$`1`
> > 20% 40% 60% 80% 100%
> > 1 2 4 7 16
> >
> >
> >
> > So, ideally I want (note the different ratings):
> >
> >
> > userid freq track rating
> > [1,] 1 1 1 1
> > [2,] 1 10 2 5
> > [3,] 1 1 3 1
> > [4,] 1 1 4 1
> > [5,] 1 15 5 5
> > [6,] 1 4 6 3
> >
> >
> > Can anybody help me? I'm new to R (as you have probably guessed). Sorry
> for
> > the long explanation.
> >
> > Regards
> > Gawesh
> >
> > On Sat, May 14, 2011 at 7:37 PM, Dennis Murphy <djmuser_at_gmail.com>
> wrote:
> >
> >> Hi:
> >>
> >> Is this what you're after?
> >>
> >> tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2)))
> >> ds$int <- with(ds, cut(freq, c(0, tq)))
> >> with(ds, table(int))
> >>
> >> int
> >> (0,1] (1,2] (2,4] (4,7] (7,16]
> >> 10 6 7 6 6
> >>
> >> HTH,
> >> Dennis
> >>
> >> On Sat, May 14, 2011 at 9:42 AM, gj <gawesh_at_gmail.com> wrote:
> >> > Hi Jim,
> >> > Thanks very much for the code. I modified it a bit because I needed to
> >> > allocate the track ratings by userid (eg if user 1 plays track x once,
> he
> >> > gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and
> >> not
> >> > by track (sorry if this wasn't clear in my original post).
> >> >
> >> > This is almost working! What I can't get right at the moment is the
> >> cutoff
> >> > interval for the ratings.
> >> > Any help please?
> >> >
> >> > Sample data:
> >> >
> >> > userid,track,freq
> >> > 1,1,1
> >> > 1,2,10
> >> > 1,3,1
> >> > 1,4,1
> >> >
> >> > 1,5,15
> >> > 1,6,4
> >> > 1,7,16
> >> > 1,8,6
> >> > 1,9,1
> >> > 1,10,1
> >> > 1,11,2
> >> > 1,12,2
> >> > 1,13,1
> >> > 1,14,6
> >> > 1,15,7
> >> > 1,16,13
> >> > 1,17,3
> >> > 1,18,2
> >> > 1,19,5
> >> > 1,20,2
> >> > 1,21,2
> >> > 1,22,6
> >> > 1,23,4
> >> > 1,24,1
> >> > 1,25,1
> >> > 1,26,16
> >> > 1,27,4
> >> > 1,28,1
> >> > 1,29,4
> >> > 1,30,4
> >> > 1,31,4
> >> > 1,32,1
> >> > 1,33,14
> >> > 1,34,2
> >> > 1,35,7
> >> >
> >> >
> >> >>tq
> >> >
> >> >
> >> > $'1'
> >> >
> >> >
> >> > 20% 40% 60% 80% 100%
> >> > 1 2 4 7 16
> >> >
> >> >
> >> > >From this distribution I would expect to distribute ratings as
> follows:
> >> > freq: <=1 <=2 <=4 <=7 <=16
> >> > rating: 1 2 3 4 5
> >> >
> >> > But my output is (the ratings are correct except at the cut off
> points):
> >> > "userid" "freq" "track" "rating"
> >> > 1 1 1 2
> >> > 1 10 2 5
> >> > 1 1 3 2
> >> > 1 1 4 2
> >> > 1 15 5 5
> >> > 1 4 6 4
> >> > 1 16 7 5
> >> > 1 6 8 4
> >> > 1 1 9 2
> >> > 1 1 10 2
> >> > 1 2 11 3
> >> > 1 2 12 3
> >> > 1 1 13 2
> >> > 1 6 14 4
> >> > 1 7 15 5
> >> > 1 13 16 5
> >> > 1 3 17 3
> >> > 1 2 18 3
> >> > 1 5 19 4
> >> > 1 2 20 3
> >> > 1 2 21 3
> >> > 1 6 22 4
> >> > 1 4 23 4
> >> > 1 1 24 2
> >> > 1 1 25 2
> >> > 1 16 26 5
> >> > 1 4 27 4
> >> > 1 1 28 2
> >> > 1 4 29 4
> >> > 1 4 30 4
> >> > 1 4 31 4
> >> > 1 1 32 2
> >> > 1 14 33 5
> >> > 1 2 34 3
> >> > 1 7 35 5
> >> >
> >> > This is the code:
> >> > # cheers jim holtman
> >> >
> x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,
> >> > sep=',')
> >> > # get the quantiles for each track
> >> > tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
> >> > # create a matrix with the rownames as the tracks to use in the
> >> findInterval
> >> > tqm <- do.call(rbind, tq)
> >> > #now put the ratings
> >> > require(data.table)
> >> > x.dt <- data.table(x)
> >> > x.new <- x.dt[,list(freq = freq,track=track,rating =
> >> > findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed =
> >> TRUE) +
> >> > 1L),by=userid]
> >> >
> >> > Regards
> >> > Gawesh
> >> >
> >> > On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholtman_at_gmail.com>
> wrote:
> >> >
> >> >> try this:
> >> >>
> >> >> > # create some data
> >> >> > x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '')
> >> >> + , track = rep(1:20, 20)
> >> >> + , freq = floor(runif(400, 10, 200))
> >> >> + , stringsAsFactors = FALSE
> >> >> + )
> >> >> > # get the quantiles for each track
> >> >> > tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8,
> 1))
> >> >> > # create a matrix with the rownames as the tracks to use in the
> >> >> findInterval
> >> >> > tqm <- do.call(rbind, tq)
> >> >> > # now put the ratings
> >> >> > require(data.table)
> >> >> > x.dt <- data.table(x)
> >> >> > x.new <- x.dt[,
> >> >> + list(userid = userid
> >> >> + , freq = freq
> >> >> + , rating = findInterval(freq
> >> >> + # use track as index into
> >> >> quantile matrix
> >> >> + ,
> tqm[as.character(track[1L]),]
> >> >> + , rightmost.closed = TRUE
> >> >> + ) + 1L
> >> >> + )
> >> >> + , by = track]
> >> >> >
> >> >> > head(x.new)
> >> >> track userid freq rating
> >> >> [1,] 1 u1 10 1
> >> >> [2,] 1 u2 15 1
> >> >> [3,] 1 u3 126 4
> >> >> [4,] 1 u4 117 3
> >> >> [5,] 1 u5 76 2
> >> >> [6,] 1 u6 103 3
> >> >> >
> >> >>
> >> >>
> >> >> On Sun, May 8, 2011 at 2:48 PM, gj <gawesh_at_gmail.com> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I have a mysql table with fields userid,track,frequency e.g
> >> >> > u1,1,10
> >> >> > u1,2,100
> >> >> > u1,3,110
> >> >> > u1,4,200
> >> >> > u1,5,120
> >> >> > u1,6,130
> >> >> > .
> >> >> > u2,1,23
> >> >> > .
> >> >> > .
> >> >> > where "frequency" is the number of times a music track is played by
> a
> >> >> > "userid"
> >> >> >
> >> >> > I need to turn my 'frequency' table into a rating table (it's for a
> >> >> > recommender system). So, for each user, I need to categorise the
> >> >> frequency
> >> >> > of tracks played by quintile so that each particular track can have
> 5
> >> >> > ratings (1-5), with the ratings allocated as follows:
> inter-quintile
> >> >> range
> >> >> > 100-80% = rating 5, inter-quintile range 80-60% = rating 4,
> >> >> > ..., inter-quintile range 20-0% = rating 1)
> >> >> >
> >> >> > Hence, I need to create a table with fields userid,track,rating:
> >> >> > u1,1,1
> >> >> > u1,2, 3
> >> >> > ...
> >> >> >
> >> >> > Can anybody help me to do this with R?
> >> >> >
> >> >> > Regards
> >> >> > Gawesh
> >> >> >
> >> >> > [[alternative HTML version deleted]]
> >> >> >
> >> >> > ______________________________________________
> >> >> > R-help_at_r-project.org mailing list
> >> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> > PLEASE do read the posting guide
> >> >> http://www.R-project.org/posting-guide.html
> >> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jim Holtman
> >> >> Data Munger Guru
> >> >>
> >> >> What is the problem that you are trying to solve?
> >> >>
> >> >
> >> > [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help_at_r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >
> >>
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 14 May 2011 - 23:33:26 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 15 May 2011 - 13:10:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive