Re: [R] options in 'rnorm' to set the lower bound of normal dist

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu, 27 Mar 2008 15:12:51 +0000 (GMT)

On Thu, 27 Mar 2008, Ted.Harding_at_manchester.ac.uk wrote:

> Brian Ripley's suggestions of truncated normal and
> log-normal are of course resources for ensuring that
> you get positive simulated results.

Wanting to simulate from the fitted mean and sd does *not* mean you want to emulate the data -- indeed, it is often done to see if peculiar aspects of data have a noticeable effect on the rest of the analysis.

I've done this with analytical chemist colleagues several times, although I did point out it was better to use robust summaries.

> However, I think youre real problem (having looked at
> the numbers you quote) is that you should not be thinking
> of using a normal distribution, or anything similar, at all.
>
> Your variables dat\$x and dat\$y have:
> mean(dat\$x)
> ## [1] 0.3126667
> sd(dat\$x)
> ## [1] 0.3372137
>
> mean(dat\$y)
> ## [1] 0.4223137
> sd(dat\$y)
> ## [1] 0.5120841
>
> so in both cases the SD is about the same as the mean, and
> getting negative values from simulated normal distributions
> with these means and SDs is inevitable.
>
> But now look at the two series of 15 numbers in dat\$x and dat\$y:
> The first 12 are of the order of 0.15 and 0.20 respectively,
> while the final 3 in each case are of the order of 1.0 and 1.5
> respectively. And this is where your large SD is coming from.
> Neither series is from a simple normal distribution.
>
> mean(dat\$x[1:12])
> ## [1] 0.1519167
> sd(dat\$x[1:12])
> ## [1] 0.02447432
> and it is impossible for the last 3 ( 1.032 0.803 1.032)
> to have come from a normal distribution giving the first 12.
>
> mean(dat\$y[1:12])
> ## [1] 0.1807932
> sd(dat\$y[1:12])
> ## [1] 0.03380083
> with a similar conclusion; and likewise for the last three
> 1.551043 1.063100 1.551043
>
> Note that, for the first 12 in each case, the SD less that
> 1/5 of the mean:
>
> mean(dat\$x[1:12])/sd(dat\$x[1:12])
> ## [1] 6.207186
> mean(dat\$y[1:12])/sd(dat\$y[1:12])
> ## [1] 5.348779
>
> so, where the first 12 of x and y are concerned, if you sampled
> from normal distributions with the same means and SDs you would
> get a negative number with probability less than
>
> pnorm(-5)
> ## [1] 2.866516e-07
>
> What you in fact have here is that the numbers are in two groups,
> each with a small SD relative to its mean:
>
> dat\$x dat\$y
> Mean SD Mean SD
> ------------------------+-------------------
> 1:12 0.152 0.024 | 0.181 0.034
> |
> 13:15 0.956 0.132 | 1.39 0.282
>
> Note that for dat\$x Mean/SD approx = 6 for each sub-series,
> and for data\$y Mean/SD approx = 5 for each subseries, so
> you could be looking at results which display a nearly
> constant coefficient of variation. Now, this is indeed a
> property of the log-normal distribution (as well as of
> others), so that could indeed be worth considering.
> However, you still have to account for the apparent split
> noted above into distinct groups.
>
> So you are really facing a modelling question: why did
> the numbers come out as they did, and what is a good way
> to represent that mechanism as a distribution?
>
> With best wishes,
> Ted.
>
> On 27-Mar-08 12:27:55, Tom Cohen wrote:
>>
>> Dear list,
>> I have a dataset containing values obtained from two different
>> instruments (x and y).
>> I want to generate 5 samples from normal distribution for each
>> instrument based on
>> their means and standard deviations. The problem is values from both
>> instruments are
>> non-negative, so if using rnorm I would get some negative values. Is
>> there any options

>> to determine the lower bound of normal distribution to be 0 or can I
>> simulate the
>> samples in different ways to avoid the negative values?
>>
>>
>> > dat
>> id x y
>> 75 101 0.134 0.1911315
>> 79 102 0.170 0.1610306
>> 76 103 0.134 0.1911315
>> 84 104 0.170 0.1610306
>> 74 105 0.134 0.1911315
>> 80 106 0.170 0.1610306
>> 77 107 0.134 0.1911315
>> 81 108 0.170 0.1610306
>> 82 109 0.170 0.1610306
>> 78 111 0.170 0.1610306
>> 83 112 0.170 0.1610306
>> 85 113 0.097 0.2777778
>> 2 201 1.032 1.5510434
>> 1 202 0.803 1.0631001
>> 5 203 1.032 1.5510434
>>
>> mu<-apply(dat[,-1],2,mean)
>> sigma<-apply(dat[,-1],2,sd)
>> len<-5
>> n<-20
>> s1<-vector("list",len)
>> set.seed(7)
>> for(i in 1:len){
>> s1[[i]]<-cbind.data.frame(x=rnorm(n*i,mean=mu[1],sd=sigma[1]),
>> y=rnorm(n*i,mean=mu[2],sd=sigma[2]))
>> }
>>
>> Thanks for any help,
>> Tom
>>
>>
>> ---------------------------------
>> Sök efter kärleken!
>>
>> [[alternative HTML version deleted]]
>>
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding_at_manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 27-Mar-08 Time: 14:00:44
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> and provide commented, minimal, self-contained, reproducible code.
>

```--
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help