Re: [R] na.omit - Is it working properly?

From: peter dalgaard <pdalgd_at_gmail.com>
Date: Wed, 04 May 2011 08:02:51 +0200

On May 3, 2011, at 21:18 , Kalicin, Sarah wrote:

>
> I have a work around for this, but can someone explain why the first example does not work properly? I believed it worked in the previous version of R, by selecting just the rows=200525 and omitting the na's. I just upgraded to 2.13. I am also concern with the row numbers being different in the selections, should I be worried? FYI, I just selected the first few rows for demonstration, please do not worry that the number of rows shown are not equal. - Sarah
>
> With na.omit around the column, but it is showing other values in the F.WW column other than 200525, along with NA. I was hoping that this would omit all the NA's, and show all the rows that P$F.WW=200525. I believe it did with the previous version of R.

That's highly unlikely. na.omit(P$WW) has fewer elements than there are rows in P so you get vector recycling in the style of

> thuesen[c(F,F,F,F,T),]

   blood.glucose short.velocity

5            7.2           1.27
10          12.2           1.22
15           6.7           1.52
20          16.1           1.05

(now why don't we get the usual warning about "not a multiple of" in this case?)

Worse, if you omit observations prior to comparison, the result won't line up. E.g. in the thuesen data, obs.

> thuesen[na.omit(thuesen$short.velocity)==1.12,]

   blood.glucose short.velocity

16           8.6             NA
22           4.9           1.03

whereas in fact

> subset(thuesen, short.velocity==1.12)

   blood.glucose short.velocity

17           4.2           1.12
23           8.8           1.12




> P[na.omit(P$F.WW)==200525, c(51, 52)]
> F.WW R.WW
> 45 200525 NA
> 53 NA NA
> 61 200534 200534
> 63 200608 200608
> 66 200522 200541
> 80 NA NA
> 150 200521 200516
> 231 200530 200530
>
> No na.omit, the F.WW=200525 seems to work, but lots of NA included. This is what is expected!! The row numbers are not the same as the above example, except the first row.
>> P[P$F.WW==200525, c(51, 52)]

> F.WW R.WW
> 45 200525 NA
> NA NA NA
> NA.1 NA NA
> NA.2 NA NA
> NA.3 NA NA
> 57 200525 200526
> 65 200525 NA
> 67 200525 NA
> 70 200525 200525
> NA.4 NA NA
> NA.5 NA NA
> 86 200525 NA
Presumably, a number of rows got omitted here? The NA's are a bit of a pain, but that's the way things work: If there is an observation that you don't know whether to include, you get an NA filled row.

> thuesen[thuesen$short.velocity==1.12,]

   blood.glucose short.velocity

NA            NA             NA
17           4.2           1.12
23           8.8           1.12

To avoid this, you explicitly test for NA using is.na() or use subset() which does it internally.

>
> Na.omit excludes the na's. This is what I want. The concern I have is why the row numbers do not match any of those shown in the examples above.

>> na.omit(P[P$F.WW==200525, c(51, 52)])

> F.WW R.WW
> 57 200525 200526
> 70 200525 200525
> 161 200525 200525
> 245 200525 200525
> 246 200525 200525
> 247 200525 200526
> 256 200525 200525
> 266 200525 200525
> 269 200525 200525
> 271 200525 200526
> 276 200525 200526
> 278 200525 200526
>

Well, now you remove rows with NA _anywhere_, so e.g. row #65 is out because R.WW is missing. I expect #161 and higher was just chopped from the earlier list.

In short, nothing out of the ordinary seems to be going on here.

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes_at_cbs.dk  Priv: PDalgd_at_gmail.com

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 05 May 2011 - 06:25:03 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 05 May 2011 - 07:00:05 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive