Re: [R] Calculate NAs from known data: how to?

From: Ted Harding <>
Date: Tue 17 Oct 2006 - 16:07:52 GMT

On 17-Oct-06 Torleif Markussen Lunde wrote:
> Hi
> In a dataset I have length and age for cod. The age, however,
> is only given for 40-100% of the fish. What I need to do is
> to fill in the NAs in a correct way, so that age has a value
> for each length. This is to be done for each sample seperately
> (there are 324 samples), meaning the NAs for sample no 1 shall
> be calculated from the known values from sample no 1.
> As for example length 55 cm can be both 4 and 5 years, I guess
> a fish with NA age and length 55 cm should be given a "random"
> age given a probability for example "55 cm = 4 years has a p=75%,
> while 55 cm = 4 years has a p=25%". Those "p-values" should be
> calculated from the real data.
> How can this be done in R, and what is the right way to do it?
> Sample number 1 is given below.

A question with many ramifications!

First of all, there are several possible approaches to imputing missing values. You are wise in recognising that there is uncertainty in this, in general and also for your data set. For this, I would normally recommend a Multiple Imputation approach, since this would proceed by sampling from a posterior distribution for Age given Length, as estimated by Maximum Likelihood from your data. The differing results of successive imputations then exhibit a variability corresponding to the uincertainty about what value to impute. Furthermore, when subsequent analyses (such as estimating the paameters of a growth curve from Age and Length, or estimating population dynamics from Age distributions in successive years) are carried out, these can be done for each of the imputations in the multiple set, and the results (and etimated standard errors) can be combined to give an overall estimate and the unceertainty in this--which not only includes the variability in the complete data, but also the uncertainty due to imputation.

For this approach, I would be inclined to start with Shafer's "norm" or "mix" packages, available in R. But see below.

However, I have had a look at the data for "Sample 1" which you included. This throws up several features which should be taken into account, and which indicate that "blind" use of an imputation package may not be the best approach.

First, I made a CSV file from your table (3 columns: length, age, sample). Then:


Now: index which lines have "NA" for age, then a histogram of Length when Age is present:

  ix0 <-

Superimpose a histogram of Length when Age is NA:

  hist(L[ix0],add=TRUE,breaks=5*(0:20),col="red")   hist(L[!ix0],add=TRUE,breaks=5*(0:20))

(the repetition of L[!ix0] is done because the red has overlaid it for one length range, and the repetition restores it).

This immediately shows that, in general, missing Age is unusual, except for Length in the range (25:50) in which it is the majority. An alternative picture of the same scene appears if you first make the histogram of all lengths:

  X11() ## to get a second graphics window   hist(L,breaks=5*(0:20))

Now you can ask why there should be so many NAs for Age in the Length range (25:30) and, indeed (2nd histogram) why there are so many specimens anyway in that Length range compared with their neghbours. Comparing the two histograms idicates that the excess in that Length range arises from the fish with NA Ages: in the first histogram the number of non-NA Ages in )25:30) is very comparable with the numbers of all fish in other Length ranges.

So my immediate suspicion is that there is something special about the Length range (25:30) in relation to whether Age is recorded or not.

The next thing that emerges from the first histogram is the sharp decline in numbers above Length=55.

I therefore wonder why this also happens.

Is 55cm a magic (e.g. legal) length threshold for cod? And is 30cm also special in some way? If so, could there be some pressure on whoever records the data not to take measurements of Age when the fish is near but under 30cm, or to record a value of Length just below the threshold (i.e. incorrectly)? Does this happen in your other data sets?

As well as these "why" questions, however, there is a a technical issue arising from the fact that many missing Ages occur in a very limited range of Length:

  [1] 27 ### total number of Age = NA

  [1] 12 ### Age = NA in (25:50)
  > sum((L[!ix0]>25)&(L[!ix0]<=30))
  [1] 11 ### Age is known in (25:30)

This is: whether "Age=NA" is "uninformative" about Age given recorded Length. More precisely, whether

  Prob[Age=NA | true Age, recorded Length]

since many imputation techniques depend on this. If it is not true, then being missing is "informative" about Age (i.e. the distribution of Age at Length is different for Age=NA cases than for Age != NA cases), though you would not be able to ascertain this without a suitable model for Prob[missing | Age, Length].

The above remarks concern possible traps and pitfalls in approaching the problem, which need to be considered.

Next, it is useful to look at the relationship between Length and Age where bothe are known. First plot L vs A:


(I've added (0,0) to the list of (Age,Length) to indicate the difference between 1st-year growth and subsequent years).

This shows a smooth growth curve, with quite regular scatter with the possible exception of the 3 highest values of L (which I'm inclined to treat as "outliers").

Now look at it logarithmically:


The three outliers still stand out, but for the rest it seems that a straight line with constant variance is a good description.

Therefore, onitting these three, and working with non-NA cases:



  LA.lm <- lm(log(L1) ~ log(A1))

              Estimate Std. Error t value Pr(>|t|)   (Intercept) 3.05062 0.03139 97.20 <2e-16 ***   log(A1) 0.55405 0.02454 22.58 <2e-16 ***   Residual standard error: 0.1255 on 91 degrees of freedom

indicating a growth curve

  L(A) = exp(3.05062)*(A ^ 0.55405)

Now re-do

  plot(c(0,A),c(0,L), ylim=c(0,100))

and now

  AA <-0.1*(0:70)
  lines(AA,exp(3.05062)*(AA ^ 0.55405),col="green")

which looks pretty good (at least for Ages 1:5); maybe it is visubly starting to level off at Age 6 (ignoring outliers) as growth curves do. Nevertheless, for the sake of illustration, I'll stick with the above.

Note that the intercept (3.05..) and coefficent (0.55...) are estimated with relatively small SE, so I'll adopt a normally distributed log(L), conditional on log(A), with mean given by the above fit and SD = 0.1255.

Now, in order to infer what A might be, given L, we need to get a distribution for A given L; and for this we need a distribution for A.


  hist(A,breaks=0.5 + (0:7))

Here we see that the two extreme Ages (1 and 6, since Age=7 is 2 outliers) are a bit less frequent than the other; but it looks as though it might be feasible (at least as an approximation) to see if we can work with a uniform distribution of Age:

   1 2 3 4 5 6
   9 19 16 16 21 12

    Chi-squared test for given probabilities     data: table(A1)
    X-squared = 6.2903, df = 5, p-value = 0.279

so the data are quite consistent with a uniform distribution, i.e. p1 = P[Age=1] = p2 = P[Age=2] = ... = 1/6.


  P[Age = A | Length = L]

for A = 1,2,3,4,5,6 and Const is such that P[Age=A|Length=L] sums to 1 over these Ages, and

  mean.log.LA = 3.05062)*(A ^ 0.55405)

  sd.log.LA = 0.1255

and here we are proceeding as though pA is constant (1/6).

Now we are close. For a given Length L, compute the probabilities that Age = 1,2,3,4,5,6 and then sample from this diatribution to make a "random" imputation.

Here (simplified because of the uniform prior on pA) the result is simply proportional to



  As <- (1:6)
  mean.log.LA <- 3.05062 + 0.55405*log(As)   sd.log.LA <- 0.1255

  pA <- function(L){
  temp<-exp(-(log(L)-mean.log.LA)^2/(2*sd.log.LA^2))   temp/sum(temp)

Now (e.g.) for L=30, we get

  0.0182 0.8695 0.1087 0.0036 0.0001 0.0000

so that A=2 is likely, A=3 is possible, others unlikely.


  0.0000 0.0700 0.5299 0.3191 0.0709 0.0101

  0.0000 0.0003 0.0540 0.3108 0.3980 0.2370

  0.0000 0.0000 0.0016 0.0600 0.3216 0.6167

  0.0000 0.0000 0.0001 0.0090 0.1610 0.8299

Thus, in particular, for L=50, at least 3 Ages (5,6,7) are likely to occur. This shows the desirability of making multiple imputations to represent this undertainty.

So, for example, you could sample 10 imputed values for a Length=50 fish as

  sample((1:6),10,replace=TRUE,prob=pA(50))   5 5 5 6 6 4 6 4 4 3 ### --> 3, 4^3, 5^3, 6^3

Such considerations could be developed into an approach for your probalem. The problem with many Multiple Imputation packages (and Shafer's are among them) is that they generally do not have provision to incorporate a user-defined model for the relation between variables more sophisticated than a multivariate normal, or a model where continuous variable means depend in an unstructured way on categorical covariates.

Given what was observed in your Sample1, I moght try with "norm", using log(A) and log(L) as if they werre bivariate normals (and hence a linear regression of log(L) on log(A)). This might work (and looked reasonable for Sample 1). But if you want to incorporate a more complex growth curve model, then you may have difficulty doing so using a package.

Yu may like to have a look at the packages "gee" and "geepack" in R to see what has been done, but -- for istance -- I don't see how you would incorporate a "von Bertalanffy" link function without doing a lot of work!

Probably, developing you own approach along the lines shown above will do what you want.


  1. The above approach is derived from observation of the particular sample, as are the remarks about problematic features of the data. Hence it is purely illustrative of a possible way to approach the problem. Whether it would be useful for the rest of your 324 samples is beyond my knowledge!
  2. The fitted linear growth rate of log(L) with Age is again compatible with the data in your sample. More generally, and in particular given the possible flattening at Age = 6 which may be discernible in your data, it would be better in a general context to fit a more realistic growth curve (e.g. von Bertalanffy). This would not affect the general strategy of the approach, and would affect only the expression used to evaluate mean.log.LA in the equations.
  3. The adoption of a uniform prior distribution for Age was chosen (rather, leapt on) for simplicity since it was compatible with the data. In real life (i.e. 324 samples) things may look different. In any case, you may have a model for the distribution of Age in your fish populations (though the features noted in the histograms of Age need to be accounted for).

Hoping this help!
Best wishes,

E-Mail: (Ted Harding) <> Fax-to-email: +44 (0)870 094 0861
Date: 17-Oct-06                                       Time: 17:07:48
------------------------------ XFMail ------------------------------

______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Wed Oct 18 02:31:06 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 17 Oct 2006 - 17:30:10 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.