[R] logistic regression

From: Stephen Choularton <mail_at_bymouth.com>
Date: Fri 27 May 2005 - 14:22:33 EST


Hi  

I am working on corpora of automatically recognized utterances, looking for features that predict error in the hypothesis the recognizer is proposing.  

I am using the glm functions to do logistic regression. I do this type of thing:  

and end up with a model:  

> summary(logistic.model)
 

Call:
glm(formula = similarity ~ ., family = binomial, data = data)  

Deviance Residuals:

    Min 1Q Median 3Q Max -3.1599 0.2334 0.3307 0.4486 1.2471  

Coefficients:

                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           11.1923783  4.6536898   2.405  0.01617 *  
length                -0.3529775  0.2416538  -1.461  0.14410    
meanPitch             -0.0203590  0.0064752  -3.144  0.00167 ** 
minimumPitch           0.0257213  0.0053092   4.845 1.27e-06 ***
maximumPitch          -0.0003454  0.0030008  -0.115  0.90838    
meanF1                 0.0137880  0.0047035   2.931  0.00337 ** 
meanF2                 0.0040238  0.0041684   0.965  0.33439    
meanF3                -0.0075497  0.0026751  -2.822  0.00477 ** 
meanF4                -0.0005362  0.0007443  -0.720  0.47123    
meanF5                -0.0001560  0.0003936  -0.396  0.69187    
ratioF2ToF1            0.2668678  2.8926149   0.092  0.92649    
ratioF3ToF1            1.7339087  1.7655757   0.982  0.32607    
jitter                -5.2571384 10.8043359  -0.487  0.62656    
shimmer               -2.3040826  3.0581950  -0.753  0.45120    
percentUnvoicedFrames  0.1959342  1.3041689   0.150  0.88058    
numberOfVoiceBreaks   -0.1022074  0.0823266  -1.241  0.21443    
percentOfVoiceBreaks  -0.0590097  1.2580202  -0.047  0.96259    
meanIntensity         -0.0765124  0.0612008  -1.250  0.21123    
minimumIntensity       0.1037980  0.0331899   3.127  0.00176 ** 
maximumIntensity      -0.0389995  0.0430368  -0.906  0.36484    
ratioIntensity        -2.0329346  1.2420286  -1.637  0.10168    
noSyllsIntensity       0.1157678  0.0947699   1.222  0.22187    
startSpeech            0.0155578  0.1343117   0.116  0.90778    
speakingRate          -0.2583315  0.1648337  -1.567  0.11706    
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
(Dispersion parameter for binomial family taken to be 1)
 
    Null deviance: 2462.3  on 4310  degrees of freedom
Residual deviance: 2209.5  on 4287  degrees of freedom
AIC: 2257.5
 
Number of Fisher Scoring iterations: 6
 
 
I have seen models where almost all the features are showing one in a
thousand significance but I accept that I could improve my model by
normalizing some of the features (some are left skewed and I understand
that I will get a better fir by taking their logs, for example).
 
What really worries me is that the logistic function produces
predictions that appear to fall well outside 0 to 1.
 
If I make a dataset of the medians of the above features and use my
logistic.model on it, it produces a 
figure of:
 

> x = predict(logistic.model, medians)
> x
[1] 2.82959
>
which is well outside the range of 0 to 1. The actual distribution of all the predictions is:
> summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max. -1.516 2.121 2.720 2.731 3.341 6.387
>
I can get the model to give some sort of prediction by doing this:
> pred = predict(logistic.model, data)
> pred[pred <= 1.5] = 0
> pred[pred > 1.5] = 1
> t = table(pred, data[,24])
> t
pred 0 1 0 102 253 1 255 3701
>
> classAgreement(t)
$diag [1] 0.8821619 $kappa [1] 0.2222949 $rand [1] 0.7920472 $crand [1] 0.1913888
>
but as you can see I am using a break point well outside the range 0 to 1 and the kappa is rather low (I think). I am a bit of a novice in this, and the results worry me. Can anyone comment if the results look strange, or if they know I am doing something wrong? Stephen -- No virus found in this outgoing message. Checked by AVG Anti-Virus. [[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri May 27 14:28:45 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:08 EST