From: Weiwei Shi <helprhelp_at_gmail.com>

Date: Fri 05 Aug 2005 - 05:13:14 EST

Date: Fri 05 Aug 2005 - 05:13:14 EST

Dear listers:

I have an idea to do the outlier detection and I need to use R to
implement it first. Here I hope I can get some input from all the
guru's here.

I select distance-based approach---

step 1:

calculate the distance of any two rows for a dataframe. considering
the scaling among different variables, I choose mahalanobis, using
variance as scaler.

step 2:

Let k be the number of points in one "cluster". K is decided by
answering the following question: how many neighbors a point needs for
not being an outlier.

for each point, get the smallest (k-1) distances from step1. Among the (k-1) distances of each point, get the max for the point.

step 3:

get the distribution of those max for all the points. Thus, the
multivariate problem becomes a univariate one. Then the outlier in
those max's will define the outlier of the point.

My question is:

1. I don't know if using mahalanobis is proper or not since most
clustering algorithms implemented in R (like pam or clara) use
euclidean or mahattan.

2. Is there a way to get the mahalanobis distance matrix for any two
rows of a dataframe or matrix?

3. My approach does allow a point belonging to more than one
k-cluster. Is there similar algorithm in R or published?

Thanks for any suggestions,

weiwei

-- Weiwei Shi, Ph.DReceived on Fri Aug 05 05:34:27 2005

"Did you always know?"

"No, I did not. But I believed..."

---Matrix III ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:39:41 EST
*