From: Richard A. O'Keefe <ok_at_cs.otago.ac.nz>

Date: Fri 09 Dec 2005 - 09:43:31 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Dec 09 09:48:14 2005

Date: Fri 09 Dec 2005 - 09:43:31 EST

I am trying to automatically construct a distance function from
a training set in order to use it to cluster another data set.
The variables are nominal. One variable is a "class" variable
having two values; it is kept separate from the others.

I have a method which constructs a distance matrix for the levels of a nominal variable in the context of the other variables.

I want to construct a linear combination of these which gives me a distance between whole cases that is well associated with the class variable, in that

"combined distance between two cases large => they most likely belong to different classes."

So from my training set I construct a set of

(d1(x1,y1), ..., dn(xn,yn), x_class != y_class) rows bound together as a data frame (actually I construct it by columns), and then the obvious thing to try was

glm(different.class ~ ., family = binomial(), data = distance.frame)

The thing is that this gives me both positve and negative coefficients, whereas the linear combination is only guaranteed to be a metric if the coefficients are all non-negative.

There are four fairly obvious ways to deal with that:

(1) just force the negative coefficients to 0 and hope.

This turns out to work rather well, but still...

(2) keep all the coefficients but take max(0, linear combination of distances).

This turns out to work rather well, but still...

(3) Drop the variables with negative coefficients from the model,

refit, and iterate until no negative coefficients remain.
This can hardly be said to work; sometimes nearly all the variables
are dropped.

(4) Use a version of glm() that will let me constrain the coefficients

to be non-negative.

I *have* searched the R-help archives, and I see that the question about logistic regression with constrained coefficients has come up before, but it didn't really get a satisfactory answer. I've also searched the documentation of more contributed packages than I could possibly understand.

There is obviously some way to do this using R's general non-linear optimisation functions. However, I don't know how to formulate logistic regression that way.

This whole thing is heuristic. I am not hell-bent on (ab?)using logistic regression this way. It was just an obvious thing to try. Suggestions for other means to the same end will be welcome.

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Dec 09 09:48:14 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:41:35 EST
*