Re: [R] Unexpected behaviour in reading genomic coordinate files of R-2.7.0

From: Michael Lawrence <mflawren_at_fhcrc.org>
Date: Thu, 29 May 2008 16:03:40 -0700

This is not really addressing your problem, but I thought you might want to know that the rtracklayer package in Bioconductor already supports parsing BED files, as well as GFF and WIG. It's main purpose is to load the tracks into genome browsers, like UCSC.

Michael

On Wed, May 28, 2008 at 1:11 AM, Margherita <atreeneedsaforest_at_yahoo.it> wrote:

> Great R people,
>
> I have noticed a strange behaviour in read.delim() and friends in the R
> 2.7.0 version. I will describe you the problem and also the solution I
> already found, just to be sure it is an expected behaviour and also to tell
> people, who may experience the same difficulty, a way to overcome it.
> And also to see if it is a proper behaviour or maybe a correction is
> needed.
>
> Here is the problem:
> I have some genomic coordinates files (bed files, a standard format, for
> example) containing a column (Strand) in which there is either a "+" or a
> "-".
> In R-2.6.2patched (and every past version I have used) I never had problems
> in reading them in, as for example:
> > a <- read.table("coords.bed", skip=1)
> > disp(a)
> class data.frame
> dimensions are 38650 6
> first rows:
> V1 V2 V3 V4 V5 V6
> 1 chr1 100088396 100088446 seq1 0 +
> 2 chr1 100088764 100088814 seq2 0 -
>
> If I do exactly the same command on the same file in R-2.7.0 the result I
> obtain is:
> > a <- read.table("coords.bed", skip=1)
> > disp(a)
> class data.frame
> dimensions are 38650 6
> first rows:
> V1 V2 V3 V4 V5 V6
> 1 chr1 100088396 100088446 seq1 0 0
> 2 chr1 100088764 100088814 seq2 0 0
>
> and I completely loose the strand information, they are all zeros! I have
> also tried to put quotes around "+" and "-" in the file before reading it,
> to set in read.table() call stringsAsFactors=FALSE, to set "encoding" to a
> few different alternatives, but the result was always the same: they are all
> transformed in 0.
>
> Then I tried scan() and I saw it was reading the character "+" properly:
> > scan("coords.bed", skip=1, nlines=1, what="ch")
> Read 6 items
> [1] "chr1" "100088396" "100088446.00" "seq1" "0" [6]
> "+"
> ...my conclusion is that the lone "+" or "-" are not taken as "characters"
> in the data frame creation step, they are taken as "numeric" but, being
> without numbers are all converted to 0.
> Is it correct if this behaviour happens also if they are surrounded by
> quotes?
>
> Anyway, my temporary solution (which works without the need of changing the
> files) is:
> a <- read.table("coords.bed", skip=1, colClasses=c("character", "numeric",
> "numeric", "character", "numeric", "character"))
> > a[1:2,]
> V1 V2 V3 V4 V5 V6
> 1 chr1 100088396 100088446 seq1 0 +
> 2 chr1 100088764 100088814 seq2 0 -
>
> Another way to avoid loosing strand information was to manually substitute
> an "R" to "-" and an "F" to "+" in the file before reading it in R. But it
> is much more cumbersome since the use of + and - is, for example, a standard
> format in bed files accepted and generated by the Genome Browser and other
> genome sites.
>
> Please let me know what do you think. Ps. I saw this first in the Fedora
> version (rpm automatically updated), but it is reproduced also in the
> Windows version.
>
> Thank you all people for your work and for making R the wonderful tool it
> is!
>
> Cheers,
>
> Margherita
>
> --
> --
>
> -----------------------------------------------------------------------------------
> Margherita Mutarelli, PhD Seconda Universita' di Napoli
> Dipartimento di Patologia Generale
> via L. De Crecchio, 7
> 80138 Napoli - Italy
> Tel/Fax. +39.081.5665802
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 29 May 2008 - 23:07:17 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 29 May 2008 - 23:30:43 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive