Re: [R] cluster

From: Weiwei Shi <>
Date: Wed 27 Jul 2005 - 05:38:07 EST

Dear Chris:

You are right and It IS too general. I think I should ask like "what kind of cluster algorithms or functions are available in R" , which might be easier. But for that, I probably can google or use help() in R to find out. I want to know more about the performance of clustering on this kind of problems and hope someone can share previous experince if he/she had similar situation or problems before. And I will share my experience later :)

As to the reason of using downsampling here, it is one fo the straightforward ways to deal with imbalanced data classification problem. In my understanding of classification problems, among others, two things are important: feature construction/selection and sample selection. I had an idea (which might be discovered by others) that finding the best subset of features in clustering (to get highest inter-cluster dissimilarities and the largest intra-cluster similarity) might help the next classification process. I quickly read through the abstract of your paper and I think your approach here is applying feature selection (use p instead of n), while here, in my proposal, I would like to try both.

thanks for further advice!


On 7/26/05, Christian Hennig <> wrote:
> Dear Weiwei,
> your question sounds a bit too general and complicated for the R-list.
> Perhaps you should look for personal statistical advice.
> The quality of methods (and especially distance choice) for down-sampling
> ceratinly depends on the structure of the data set. I do not see at the moment why
> you need any down-sampling at all, and you should find out first if and
> why it's a good thing to do (by whatever method).
> An obvious candidate for a clustering algorithm would be pam/clara in
> package cluster, because this approach chooses points already in the data
> set as cluster centroids (and produces therefore a proper subsample),
> which does not apply to most other clustering methods.
> However, in
> C. Hennig and L. J. Latecki: The choice of vantage objects for image
> retrieval. Pattern Recognition 36 (2003), 2187-2196.
> the clustering approach has been clearly outperformed by some stepwise
> selection approaches for down-sampling - admittedly in a different kind of
> problem, but I think that the reasons for this may apply also to your
> situation,
> You can compare different clusterings (or choices of a subset) by
> cross-validation or
> bootstrap applied to the resulting decision tree in the classification
> problem.
> Best,
> Christian
> On Mon, 25 Jul 2005, Weiwei Shi wrote:
> > Dear listers:
> >
> > Here I have a question on clustering methods available in R. I am
> > trying to down-sampling the majority class in a classification problem
> > on an imbalanced dataset. Since I don't want to lose information in
> > the original dataset, I don't want to use naive down-sampling: I think
> > using clustering on the majority class' side to select
> > "representative" samples might help. So, my question is, which
> > clustering method should be tested to get the best result. I think the
> > key thing might be the selection of "distance" considering the next
> > step in which I would like to use decision trees.
> >
> > Please share your experience in using clustering (Any available
> > implementation outside R is also welcome)
> >
> > weiwei
> > --
> > Weiwei Shi, Ph.D
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
> > ______________________________________________
> > mailing list
> >
> > PLEASE do read the posting guide!
> >
> *** NEW ADDRESS! ***
> Christian Hennig
> University College London, Department of Statistical Science
> Gower St., London WC1E 6BT, phone +44 207 679 1698

Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________ mailing list
PLEASE do read the posting guide!
Received on Wed Jul 27 05:47:10 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:34:01 EST