Re: [R] exploring dist()

From: Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
Date: Sun, 20 Mar 2011 19:43:47 +0000

On Fri, 2011-03-18 at 06:21 -0700, bra86 wrote:
> Hello, everybody,
>
> I hope somebody could help me with a dist() function.
> I have a data frame of size 2*4087 (col*row), where col corresponds to the
> treatment and rows are

So you have 4087 species? If yes, normally, you'd have the species in the columns and the samples/treatments in the row.

> species, values are Hellinger distances, I should reconstruct a distance
> matrix

This doesn't make sense - distances would mean you have a square symmetric matrix but 2 * 4087 isn't square. Do you mean you have Hellinger **transformed** the data such that when you take the Euclidean distances of this transformed data you get the Hellinger distance rather than the Euclidean distance?

If yes - and you sort the rows/columns issue - R wants the samples in rows - then it is reasonably simple.

Here is a much simplified example with 5 species and 4 samples:

dat <- data.frame(runif(4, 1, 10), runif(4, 2, 10), runif(4, 4, 20),
                  runif(4, 1, 4), runif(4, 0, 5))
names(dat) <- paste("spp", LETTERS[1:5])
rownames(dat) <- paste("samp", 1:4)

So we have data that looks like this:

> dat

          spp A spp B spp C spp D spp E

samp 1 6.974237 7.933403  5.460453 3.975219 4.6818142
samp 2 1.049801 6.751013 14.143798 1.777532 4.0261914
samp 3 5.742314 2.243850 15.613524 3.476935 0.4144043
samp 4 5.985012 9.576440  8.722579 3.411262 1.8126338

Then I apply a Hellinger transformation:

require(vegan)
datH <- decostand(dat, method = "hellinger")

So at this point we have something that I think you are telling us you have:

> datH

           spp A spp B spp C spp D spp E

samp 1 0.4901864 0.5228086 0.4337378 0.3700782 0.4016244
samp 2 0.1945069 0.4932488 0.7139447 0.2530989 0.3809156
samp 3 0.4570334 0.2856942 0.7536245 0.3556336 0.1227769
samp 4 0.4503635 0.5696823 0.5436922 0.3400073 0.2478481

We can use dist() on this data frame via:

dij <- dist(datH)

If we look at the object created, we see the **printed** representation of the dissimilarity matrix, which is a 4*4 matrix in this example:

> dij

          samp 1    samp 2    samp 3
samp 2 0.4253576                    
samp 3 0.4874570 0.4367179          

samp 4 0.2010581 0.3543312 0.3750363

Note that the diagonal and the upper triangle of the matrix are not printed, or stored even, because they are trivial (0 for all diagonals and the upper triangle is the same as the lower triangle).

dist() actually creates a vector of numbers that will fill the lower triangle of the dissimilarity matrix. This saves on storage space. If you want the add the diagonal and upper triangle, we can get it one of two ways:

  1. dist(datH, diag = TRUE, upper = TRUE)
  2. as.matrix(dij)

However only the second actually returns a matrix with 16 numbers, the former still only computes the 6 pair-wise distances, but when **printed** it shows the full matrix.

If you really have species in rows and smaples in columns, then you can transpose your matrix, e.g.

datH.t <- t(dat.H)

and then compute the dissimilarity matrix as above.

Does this help?

G

> with a dist() function. I know that "euclidean" method should be used.
>
> When I type:
> dist(dframe,"euclidean")
> it gives me a truncated table, where values are missing.
>
> I suppose that I have to define something for the values,
> but I have no idea what exactly, because I am not familiar with r at all.
>
> I would be very appreciated for every kind of suggestions or tips.
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/exploring-dist-tp3387187p3387187.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sun 20 Mar 2011 - 19:48:56 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 21 Mar 2011 - 11:50:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive