From: Aamir M <intuitionist_at_gmail.com>

Date: Wed 01 Jun 2005 - 06:07:11 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Jun 01 06:10:46 2005

Date: Wed 01 Jun 2005 - 06:07:11 EST

Martin> there is no 'm' in the book there, but they talk about the Martin> exponent "^ 2" used in some places {but "^ 1" in other places}, Martin> notably in 5.2 "Why did we choose FANNY?"

Martin> There is no "fuzziness" parameter defined there, so can you be Martin> more specific?

Martin> Is it the exponent 2 in u_{jv}^2 ? Martin> That one is currently fixed at 2, and yes, that could be made a Martin> parameter though K & R warn against going all the way to "1" Martin> where their algorithm can happend to converge very slowly.

Yes, that is what I am referring to. If you refer to equation (1) in section 4.1 of K&R (1990), where the FANNY objective function is defined, you can see that the membership values are all raised to the power two. In fact, the choice of raising them to the power 2 is arbitrary. Rather, the value of this exponent should be a user specified parameter. This is called the "m parameter" or the "fuzziness parameter" in Fuzzy k-Means.

Now that you mentioned it, I see that K&R did in fact comment on this in section 5.2. K&R say that setting m=1 will cause slower convergence; in Fuzzy k-Means, setting m=1 will cause a hard clustering (minumum fuzziness), and setting m=infinity will cause maximum fuzziness (i.e. all cluster membership values will be equal to 1/k). They go on to say that "exponents equal to 2 seem to be a reasonable choice, as is confirmed by actual clustering analyses." I do not know about FANNY, but in Fuzzy k-Means, studies have shown that values of the exponents between 1 and 2 can lead to better results than the rather arbitrary choice of m=2.

Aamir> Is there, then, any way to compute the FANNY Aamir> clustering membership values of a test data point Aamir> without affecting the clustering membership values of Aamir> the training data? Perhaps there are enough Aamir> conditions to use the objective function as a way of Aamir> computing the membership values of the test data?

Martin> That's an interesting proposal, at least the way I choose to understand you :-)

Martin> Yes, why not look at the objective function C {eq.(1), p.182}

Martin> One could think of optimizing it with respect to new data only, Martin> by keeping all "old data" memberships. Martin> For that to work, one would need the n dissimilarites Martin> d[i', j] where i' : `index for' new data Martin> j = 1,..,n : indices for training data. Martin> Is this feasible in your situation?

Yes, this would be feasible, I think. If I understand it correctly, this would just involve recomputing the DAISY dissimilarity matrix on the combined set of both training data and test data. It seems that the resulting optimization problem would also be uniquely solvable.

Martin> Alternatively, when we *did* assume ``all continuous'' data Martin> *and* the use of simple Euclidean distances, Martin> we could easily compute the cluster centers, determine (by Martin> minimization!) memberships for new observations.

The problem of "predicting" fuzzy cluster memberships for new data appears to be much simpler in Euclidean space; one could just compare the new data to the cluster centers computed in Fuzzy k-Means. Unfortunately, the data I'm working with is not all continuous.

Martin> In any case that needs some assumptions (and code!) currently Martin> not part of fanny().

I'll have to work on this. Thought I'm guessing fanny() is written in FORTRAN, which I cannot (yet) program in.

- Aamir

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Jun 01 06:10:46 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:32:17 EST
*