Re: [R] Handle missing values

From: Ted Harding <Ted.Harding_at_manchester.ac.uk>
Date: Mon, 23 Jun 2008 10:59:00 +0100 (BST)


On 23-Jun-08 09:35:10, Francisco Pastor wrote:
> Hi everyone
> I am new to R and have a question about missing values. I am
> trying to do a cluster analysis of monthly temperatures and
> my data are 14 columns with spatial coordinates (lat,lon)
> and 12 monthly values:
>
> /lat - lon - temp1 - //temp2 - temp3 - .... - //temp12/
>
> If I omit missing values (my missing values are 99.00) with
>
> /mydata <- na.omit(mydata)/
>
> every row with a missing value (i.e. eleven good temperature values
> and one month missing) is deleted. I would like to retain all valid
> values for the k-means analysis but excluding.
>
> I've been trying and searching about na.omit, na.action, na.exclude
> but can't find the right point.
>
> Any help would be appreciated.

As ?na.omit states, "incomplete cases" (any row in which one or more values are missing) are removed by na.omit(), so you are getting what you ask for.

Also, many functions "silently" do the same thing. For example, fitting a linear model with lm() will also remove incomplete cases.

What happens when you apply a function for clustering would depend on how the function is written to deal with incomplete cases. I'm no expert on the various clustering functions in R, so hope others can give specific advice.

Often, however, to do what you want will require code to be written specially. For example, if you have 14 columns as in your example with columns 3-14 temperatures, and you wanted to compute means, variances and covariances of the temperatures, then for the means you could simply take the temepratures one by one, and compute the mean over the non-missing values, Similarly for the variances. For the covariances you could take the "pairwise complete" cases: for each pair of temperatures (say col 3 and col 7) you would use the cases

  mydata[(!is.na(mydata[,3]))&(!is.na(mydata[,7])),c(3,7)]

And so on. However, you could end up with inconsistencies between variances and covariances with such code (e.g. the variance-covariance matrix might not be positive definite); this would not happen is you confined yourself to complete cases.

So it all depends on what you want to do with the data, and on how the R functions which address your objectives behave when faced with incomplete data.

Hoping this helps,
Ted.



E-Mail: (Ted Harding) <Ted.Harding_at_manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 23-Jun-08                                       Time: 10:58:57
------------------------------ XFMail ------------------------------

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 23 Jun 2008 - 10:02:36 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 23 Jun 2008 - 10:30:46 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive