# [R] How to deal with missing data?

From: Chaouch, Aziz <achaouch_at_nrcan.gc.ca>
Date: Fri 19 May 2006 - 22:57:07 EST

Hi All,

This is a question not directly related to R itself, it's about how to deal with missing data. I want to build wind roses i.e. circular histograms of wind directions and associated speeds to look for trends or changes in the wind patterns over several decades for some meteo stations. The database I have contains hourly records of wind direction and speed over the past 50 years.......obviously that's a huge database! Of course there are a lot of missing data and they are causing problems. Two major problems arise from the temporal distribution of wind records:

1. Data are missing because of station shutdowns (consecutive missing data over days, weeks, months and even years for some stations!!!)
2. In the past, wind records were performed only during daytime while recently they cover day and night time

On top of these situations, data can also miss "at random". The analysis is complicated by the fact that wind direction is a circular variable so specific tools must be used to handle this. I know there are different ways to deal with missing data such as Multiple Imputation but most assume gaussianity of the variables. Moreover when a record is missing in the database, it is missing for all variables so that it is apparently not possible to use other variables to produce estimates of missing wind records.

For now I'm considering the following:
- look at copula function to build a bivariate distribution of wind
direction and speeds and simulate values out of it to fill-in missing data. Produce several estimate of each missing data to assess the variability of the final results. The bivariate distribution should be modelled for every 5 or 10 years interval to accommodate for a possible trend in the data.

• time series approach: it seems that wind direction and wind speed are autocorrelated over . But it seems to be due to a non stationarity since computing the autocorrelation on first derivative destroys everything (correlation of wind direction is performed using the circular-circular correlation coeff as defined by Mardia 1976).
• Correlate with other meteo stations: this is a problem because wind patterns are affected by topography for instance and even nearby stations may have different wind patterns. Also the correlation between meteo stations is questionable since a N wind will first affect Northern stations while a S wind will first affect southern stations so the lagged correlation between stations may appear lower than what it should be I guess.
• Neural networks: Data driven approach but since missing data are missing for all variables, I do not have much inputs to feed in the network.
• Data weighing: this sounds stupid but I tried to give a weight to data according to the time difference between records. Data next to a missing value receive more weight than other and the weight is bigger as the number of missing data increases between two data. I thought about that because I remember using Voronoi polygons in spatial statistics to weight data according to the monitoring network density. However I'm not confident in this approach because I don't like the idea of giving a higher weight to a data simply because it is surrounded by missing values....
• Do nothing! Sometime it's better to consider raw data rather than applying questionable techniques. Computing wind roses with raw data sure produces artefacts but....

Well now you know more or less that I do not know a lot on the topic of missing data and desperately need your help :) If you have some hints on what techniques I may use or general advices, please let me know.

Thanks a lot,

Aziz

[[alternative HTML version deleted]]

R-help@stat.math.ethz.ch mailing list