Re: [R] read.table mystery

From: Johannes Graumann <johannes_graumann_at_web.de>
Date: Sun, 06 Mar 2011 21:41:28 +0300

Opted for a solution with 100 column names, which is unlikely to be met ...

Thanks for your guidance.

Joh

On Sunday 06 March 2011 20:57:11 Sarah Goslee wrote:
> You could pre-process your data into a more sensible format.
> Or you could use scan to read each line of the file, count the number of
> colons, then use read.table with ncolons + 1 columns.
> Or you could use read.table with many more columns than are ever going to
> be in the data, then delete the empty ones.
> Or you could use read.table to read everything in as a signle column, then
> use strsplit() to split it at the colons.
>
> There are generally lots of ways to do things, but they vary in efficiency
> both on the programming side and the execution side. For instance, the
> lots of columns
> solution is by far the easiest on the programmer, but is terribly
> inefficient and
> may fail completely for very large datasets.
>
> Sarah
>
> On Sun, Mar 6, 2011 at 12:47 PM, Johannes Graumann
>
> <johannes_graumann_at_web.de> wrote:
> > Thank you for pointing this out. This is really inconvenient as I do not
> > know a priori how many and where those darn cases containing an
> > additional (or more) ":" might be ...
> >
> > The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf"
> > somewhere (1 & 2 both integerable).
> >
> > na.exclude(as.integer(scan("/tmp/testfile.txt",sep=":",what="integer")))
> >
> > More robust pointers anyone?
> >
> > Joh
> >
> > Sarah Goslee wrote:
> >> Not so much a mystery. read.table() only looks at the first 5 lines when
> >> decided how many columns your file has (as described in the Details
> >> section of the help).
> >>
> >> The easiest solution is to add a col.names argument to read.table() with
> >> the correct number of names.
> >>
> >> You may want to also include as.is=TRUE if you don't want your data to
> >> be imported as factors. If you expect character but have factor you may
> >> get unexpected results later.
> >>
> >> Sarah
> >>
> >> On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann
> >>
> >> <johannes_graumann_at_web.de> wrote:
> >>> Hello,
> >>>
> >>>
> >>> Please have a look at the code below, which I use to read in the
> >>> attached file. As line 18 of the file reads
> >>> "1065:>sp|Q9V3T9|ADRO_DROME NADPH:adrenodoxin oxidoreductase,
> >>> mitochondrial OS=Drosophila
> >>> melanogaster GN=dare PE=2 SV=1", I expect the code below to produce a 3
> >>> column data frame with most of the last column empty and line 18 to
> >>> produce a data.frame row like so:
> >>>
> >>> V1
> >>> 1065
> >>> V2
> >>> >sp|Q9V3T9|ADRO_DROME NADPH
> >>> V3
> >>> adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
> >>> melanogaster GN=dare PE=2 SV=1
> >>>
> >>> Why is that not so?
> >>>
> >>> Thanks for any hint.
> >>>
> >>> Sincerely, Joh
> >>>
> >>> read.table(
> >>> "/tmp/testfile.txt",
> >>> sep=":",
> >>> header=FALSE,
> >>> quote="",
> >>> fill=TRUE
> >>> )[19,]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Sun 06 Mar 2011 - 19:12:16 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 06 Mar 2011 - 23:00:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive