Re: [R] read.table mystery

From: David Winsemius <dwinsemius_at_comcast.net>
Date: Sun, 06 Mar 2011 13:48:32 -0500

On Mar 6, 2011, at 12:47 PM, Johannes Graumann wrote:

> Thank you for pointing this out. This is really inconvenient as I do
> not
> know a priori how many and where those darn cases containing an
> additional
> (or more) ":" might be ...

There is a count.fields function that might assist with this task.

You seem to have a multiline (variable number of lines) format of:

NNNN:>sp|header with "|" AND white space separators NNNN:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEEEEE

NNNN+60:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEE
NNNN+120:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDE
NNNN+180:EXCEPT_LAST

No way that read.table can work. You might create an index with the location of the high-count headers and then reprocess.

log.idx <- count.fields("/tmp/testfile.txt") > 1 corpus <- readLines("/tmp/testfile.txt")

Then parse the headers and rejoin the broken multi-line content. There may be worked examples in the archive for variable number multi-line file formats.

-- 
David.



>
> The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf"
> somewhere (1 & 2 both integerable).
>
> na.exclude(as.integer(scan("/tmp/
> testfile.txt",sep=":",what="integer")))
>
> More robust pointers anyone?
>
> Joh
>
> Sarah Goslee wrote:
>
>> Not so much a mystery. read.table() only looks at the first 5 lines
>> when
>> decided how many columns your file has (as described in the Details
>> section of the help).
>>
>> The easiest solution is to add a col.names argument to read.table()
>> with
>> the correct number of names.
>>
>> You may want to also include as.is=TRUE if you don't want your data
>> to
>> be imported as factors. If you expect character but have factor you
>> may
>> get unexpected results later.
>>
>> Sarah
>>
>> On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann
>> <johannes_graumann_at_web.de> wrote:
>>> Hello,
>
>>>
>>> Please have a look at the code below, which I use to read in the
>>> attached
>>> file. As line 18 of the file reads "1065:>sp|Q9V3T9|ADRO_DROME
>>> NADPH:adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
>>> melanogaster GN=dare PE=2 SV=1", I expect the code below to
>>> produce a 3
>>> column data frame with most of the last column empty and line 18 to
>>> produce a data.frame row like so:
>>>
>>> V1
>>> 1065
>>> V2
>>>> sp|Q9V3T9|ADRO_DROME NADPH
>>> V3
>>> adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
>>> melanogaster GN=dare PE=2 SV=1
>>>
>>> Why is that not so?
>>>
>>> Thanks for any hint.
>>>
>>> Sincerely, Joh
>>>
>>> read.table(
>>> "/tmp/testfile.txt",
>>> sep=":",
>>> header=FALSE,
>>> quote="",
>>> fill=TRUE
>>> )[19,]
>>
>> ---
>> Sarah Goslee
>> http://www.functionaldiversity.org
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Sun 06 Mar 2011 - 18:52:28 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 07 Mar 2011 - 05:20:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive