From: John Dennison <dennison.john_at_gmail.com>

Date: Sun, 06 Mar 2011 17:34:12 -0500

...

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 06 Mar 2011 - 23:29:23 GMT

Date: Sun, 06 Mar 2011 17:34:12 -0500

So there are a couple parts to this question. I am trying to implement the
rpart/random forest algorithms on a transaction lists. That is to say i am
trying to train models in order to deduce what are the most predictive
transactions within a customers history in order apply this model to future
test data and identify accounting irregularities(ie. this account had x and
y so they should have also had z.) I have utilized the arules package with
some success but the output does not deduce what independent transactions
are most telling just rates of co-occurrence.(ie x appears with y 75% of the
time, not that x and y OR a and b should also have z) This form of
independent transaction groupings are potentially very meaningful.

ok now to actual R questions.

i can load my transaction lists using the read.transaction function in arules

cust1 | 2 cust1 | 3 cust1 | 5 cust2 | 5 cust2 | 3 cust3 | 2

...

#read in data to a sparse binary transaction matrix

txn = read.transactions(file="tranaction_list.txt", rm.duplicates= TRUE,
format="single",sep="|",cols =c(1,2));

#tranaction matrix to matrix

a<-as(txn, "matrix")

#matrix to data.frame

b<-as.data.frame(a)

I end up with a data.frame like:

X X.1 X.2 X.3 X.4 X.5 ...

cust1 0 1 1 0 1

cust2 0 0 1 0 1

cust3 0 1 0 0 0

...

However the as.data.frame(a) transforms the matrix into a numeric data.frame so when I implement the rpart algorithm it automatically returns a regression classification tree.

calling rpart like

names<-colnames(b)

tree_X.9911 <- rpart(X.9911 ~ .,

data=b[, c(names)],

method="class")

and returns:

- root 20000 625 0 (0.96875000 0.03125000)
- X.9342< 0.5 19598 311 0 (0.98413103 0.01586897) *
- X.9342>=0.5 402 88 1 (0.21890547 0.78109453)
- X.9984>=0.5 81 7 0 (0.91358025 0.08641975) *
- X.9984< 0.5 321 14 1 (0.04361371 0.95638629)
- X.9983>=0.5 14 0 0 (1.00000000 0.00000000) *
- X.9983< 0.5 307 0 1 (0.00000000 1.00000000)

I understand that it would approach the numeric cols with a regression approach but is there any way to force it to view them as logical(yes, no or T/F) codes. I can't successfully transform the data.frame to a factor. i tried:

b_factor<-as.factor(b)

Error in sort.list(y) :

'x' must be atomic for 'sort.list'

Have you called 'sort' on a list?

Furthermore i am fearful if i am ever successful i will compound my memory problems. 20,000 rows by 3,000 cols (while a substantial subset to the total training data is already is causing my 8gig linux box to moan.) Does a factor/logical col take up more room then a numeric col populated by 1 and 0's. I remember reading that R stores it factors as numeric anyways. I know that more precise variable section could reduce my memory usage but that is what rpart is good at, highlight the most meaningful/predictive variables so may work larger numbers of cust by removing uninformative variables.

My final question is that is there a better aproach to load the data. rpart only works on data.frame(to the best of my knowledge)? How can one coerce a list to a form where predictive models can be applied?

I am new to the world of R and to data mining for that matter.I am loving the diversity of applications and would appreciate any help.

Many Thanks,

John

[[alternative HTML version deleted]]

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 06 Mar 2011 - 23:29:23 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Mon 07 Mar 2011 - 09:50:19 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*