[R] Question on RandomForest in unsupervised mode

From: Irilenia Nobeli <irilenia.nobeli_at_kcl.ac.uk>
Date: Wed, 06 Jun 2007 17:27:23 +0100


Hi,

I attempted to run the randomForest() function on a dataset without predefined classes. According to the manual, running randomForest without a response variable/class labels should result in the function assuming you are running in unsupervised mode. In this case, I understand that my data is all assigned to one class whereas a second synthetic class is made up, which is assigned to a second class. The online manual suggests that an oob misclassification error in this two-class problem of ~40% or more would indicate that the x- variables look like independent variables to random forests (and I assume that in this case the proximities obtained by the randomForest would not be informative for clustering).

When I run randomForest() in the unsupervised mode my first problem is that I get NULL entries for the confusion matrix and the err.rate, but I suppose this is normal behaviour. My only information (apart from the proximities of course), seems to be the votes, from which I can deduce whether the variables are meaningful or not. The second problem is that, in my case, all my observations seem to have about 20-40% of the votes from class 1 and the rest from class 2 (i.e. class 2 "wins" always). Assuming that class 1 was assigned to my original data, I'd say this is rather surprising. Initially I thought this was simply a problem of my data not being meaningful, but I repeated simply the forest with the "iris" example data and I get more or less the same result. I did simply:

iris.urf <- randomForest(iris[,-5])
iris.urf$votes

and I got again most of the votes coming from class 2, although here vote percentages are slightly more balanced than with my data (approximately 40 to 60% most of the time).

Has anyone got experience with unsupervised randomForest() in R and can explain to me why I'm observing this behaviour? In general, any hints about pitfalls regarding random forests in unsupervised mode would be very much appreciated.

Many thanks in advance,

Irilenia



Irilenia (Irene) Nobeli
Randall Division of Cell and Molecular Biophysics New Hunt's House (room 3.14)
King's College London, Guy's Campus
London, SE1 1UL
U.K.
irilenia.nobeli_at_kcl.ac.uk
+44(0)207-8486329

R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 06 Jun 2007 - 16:45:58 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 06 Jun 2007 - 17:31:45 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.