Re: [R] Function for finding NA's

From: David Winsemius <dwinsemius_at_comcast.net>
Date: Sun, 03 Apr 2011 17:44:55 -0400

On Apr 3, 2011, at 3:46 PM, Tyler Rinker wrote:

> aThanks David,
>
> After seeing the simplicity of your function versus the convoluted
> mess I worked up I now understand why it's not necessary to have a
> package to find NA's (and from what you said is a part of other
> packages such as Hmisc already).

I'm actually not aware that any of the `describe` variants will return the indices of NA's. In the case of real dataset such an object could be fairly large. It was the other descriptive functions that I said were probably already coded.

>
> I am at the 2 1/2 month mark as an R user and have loads to learn.
> Simpler is better. Thanks David for your time and I will take the
> information you gave and put it to use in new situations.

You should also familiarize yourself with complete.cases() and the various functions that handle na.action parameters (linked from that help page). Note that complete.cases returns a logical vector (not the cases themselves) and is designed for indexing matrices or dataframes.

>
> Tyler
>
> > CC: r-help_at_r-project.org
> > From: dwinsemius@comcast.net
> > To: tyler_rinker_at_hotmail.com
> > Subject: Re: [R] Function for finding NA's
> > Date: Sun, 3 Apr 2011 14:19:40 -0400
> >
> >
> > On Apr 3, 2011, at 1:44 PM, Tyler Rinker wrote:
> >
> > >
> > > Quick question,
> > >
> > > I tried to find a function in available packages to find NA's
> for an
> > > entire data set (or single variables) and report the row of
> missing
> > > values (NA's for each column). I searched the typical routes
> > > through the blogs and the help manuals for 15 minutes. Rather than
> > > spend any more time searching I created my own function to do this
> > > (probably in less time than it would have taken me to find the
> > > function).
> > >
> > > Now I still have the same question: Is this function (NAhunter I
> > > call it) already in existence? If so please direct me (because I'm
> > > sure they've written better code more efficiently). I highly doubt
> > > I'm this first person to want to find all the missing values in a
> > > data set so I assume there is a function for it but I just didn't
> > > spend enough time looking. If there is no existing function (big
> if
> > > here), is this something people feel is worthwhile for me to put
> > > into a package of some sort?
> >
> > I'm not sure that it would have occurred to people to include it
> in a
> > package. Consider:
> >
> > getNa <- function(dfrm) lapply(dfrm, function(x) which(is.na(x) ) )
> >
> > > cities
> > long lat city pop
> > 1 -58.38194 -34.59972 Buenos Aires NA
> > 2 14.25000 40.83333 <NA> NA
> > > getNa(cities)
> > $long
> > integer(0)
> >
> > $lat
> > integer(0)
> >
> > $city
> > [1] 2
> >
> > $pop
> > [1] 1 2
> >
> > There are several packages with functions by the name `describe`
> that
> > do most or all of rest of what you have proposed. I happen to use
> > Harrell's Hmisc but the other versions should also be reviewed if
> you
> > want to avoid re-inventing the wheel.
> > --
> > David.
> >
> > >
> > > Tyler
> > >
> > > Here's the code:
> > >
> > > NAhunter<-function(dataset)
> > > {
> > > find.NA<-function(variable)
> > > {
> > > if(is.numeric(variable)){
> > > n<-length(variable)
> > > mean<-mean(variable, na.rm=T)
> > > median<-median(variable, na.rm=T)
> > > sd<-sd(variable, na.rm=T)
> > > NAs<-is.na(variable)
> > > total.NA<-sum(NAs)
> > > percent.missing<-total.NA/n
> > > descriptives<-
> data.frame(n,mean,median,sd,total.NA,percent.missing)
> > > rownames(descriptives)<-c(" ")
> > > Case.Number<-1:n
> > > Missing.Values<-ifelse(NAs>0,"Missing Value"," ")
> > > missing.value<-data.frame(Case.Number,Missing.Values)
> > > missing.values<-missing.value[ which(Missing.Values=='Missing
> > > Value'),]
> > > list("NUMERIC DATA","DESCRIPTIVES"=t(descriptives),"CASE # OF
> > > MISSING VALUES"=missing.values[,1])
> > > }
> > > else{
> > > n<-length(variable)
> > > NAs<-is.na(variable)
> > > total.NA<-sum(NAs)
> > > percent.missing<-total.NA/n
> > > descriptives<-data.frame(n,total.NA,percent.missing)
> > > rownames(descriptives)<-c(" ")
> > > Case.Number<-1:n
> > > Missing.Values<-ifelse(NAs>0,"Missing Value"," ")
> > > missing.value<-data.frame(Case.Number,Missing.Values)
> > > missing.values<-missing.value[ which(Missing.Values=='Missing
> > > Value'),]
> > > list("CATEGORICAL DATA","DESCRIPTIVES"=t(descriptives),"CASE # OF
> > > MISSING VALUES"=missing.values[,1])
> > > }
> > > }
> > > dataset<-data.frame(dataset)
> > > options(scipen=100)
> > > options(digits=2)
> > > lapply(dataset,find.NA)
> > > }
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help_at_r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > David Winsemius, MD
> > West Hartford, CT
> >

David Winsemius, MD
West Hartford, CT



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 03 Apr 2011 - 22:06:33 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 03 Apr 2011 - 22:20:28 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive