Re: [R] Parsing

From: jim holtman <jholtman_at_gmail.com>
Date: Wed, 09 Jul 2008 11:47:37 -0400

How much time is it taking on the files and how many files do you have to process? I tried it with your data duplicated so that I had 57K lines and it took 27 seconds to process. How much faster to you want?

On Wed, Jul 9, 2008 at 10:57 AM, Paolo Sonego <paolo.sonego_at_gmail.com> wrote:
> Thanks so much Jim! It works without a glitch!
> My only problem is that the text files to be parsed are quite big, up to
> several thousands rows (my apologies for the incomplete informations in my
> former post), so loops are not my first choice. I'll take a look at 'lapply'
> using your code as a model. Thanks again!
>
> Sincerely,
> Paolo
>
> jim holtman ha scritto:
>>
>> This should do what you want: (it uses loops; you can work at
>> replacing those with 'lapply' and such -- it all depends on if it is
>> going to take you more time to rewrite the code than to process a set
>> of data; you never did say how large the data was). This also "grows"
>> a data.frame, but you have not indicated how efficient is has to be.
>> So this could be used as a model.
>>
>>
>>>
>>> x <- readLines(textConnection("x x_string
>>>
>>
>> + y y_string
>> + id1 id1_string
>> + id2 id2_string
>> + z z_string
>> + w w_string
>> + stuff stuff stuff
>> + stuff stuff stuff
>> + stuff stuff stuff
>> + //
>> + x x_string1
>> + y y_string1
>> + z z_string1
>> + w w_string1
>> + stuff stuff stuff
>> + stuff stuff stuff
>> + stuff stuff stuff
>> + //
>> + x x_string2
>> + y y_string2
>> + id1 id1_string1
>> + id2 id2_string1
>> + z z_string2
>> + w w_string2
>> + stuff stuff stuff
>> + stuff stuff stuff
>> + stuff stuff stuff
>> + //"))
>>
>>>
>>> # I assume that each group is delimited by "//"
>>> # initialize data.frame with desired values
>>> .keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA)
>>> .out <- .keys # for the first pass
>>> .save <- NULL
>>> for (i in seq_along(x)){
>>>
>>
>> + if (x[i] == "//"){ # output the current data
>> + .save <- rbind(.save, .out)
>> + .out <- .keys # setup for the next pass
>> + } else {
>> + .split <- strsplit(x[i], "\\s+")
>> + if (.split[[1]][1] %in% names(.out)){
>> + .out[[.split[[1]][1]]] <- .split[[1]][2]
>> + }
>> + }
>> + }
>>
>>>
>>> .save
>>>
>>
>> x y id1 id2 w
>> 1 x_string y_string id1_string id2_string w_string
>> 2 x_string1 y_string1 <NA> <NA> w_string1
>> 3 x_string2 y_string2 id1_string1 id2_string1 w_string2
>>
>>
>> On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <paolo.sonego_at_gmail.com>
>> wrote:
>>
>>>
>>> Dear R users,
>>>
>>> I have a big text file formatted like this:
>>>
>>> x x_string
>>> y y_string
>>> id1 id1_string
>>> id2 id2_string
>>> z z_string
>>> w w_string
>>> stuff stuff stuff
>>> stuff stuff stuff
>>> stuff stuff stuff
>>> //
>>> x x_string1
>>> y y_string1
>>> z z_string1
>>> w w_string1
>>> stuff stuff stuff
>>> stuff stuff stuff
>>> stuff stuff stuff
>>> //
>>> x x_string2
>>> y y_string2
>>> id1 id1_string1
>>> id2 id2_string1
>>> z z_string2
>>> w w_string2
>>> stuff stuff stuff
>>> stuff stuff stuff
>>> stuff stuff stuff
>>> //
>>> ...
>>> ...
>>>
>>>
>>> I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields
>>> and
>>> save them into a a matrix object:
>>>
>>> x y id1 id2 z w
>>> x_string y_string id1_string id2_string z_string w_string
>>> x_string1
>>> y_string1 NA NA z_string1 w_string1
>>> x_string2 y_string2 id1_string1 id2_string1 z_string2 w_string2
>>> ...
>>> ...
>>>
>>> id1, id2 fields are not always present within a section (the interval
>>> between x and the last stuff) and
>>> I'd like to insert a NA when they are absent (see above) so that
>>> length(x)==length(y)==length(id1)==... .
>>>
>>> Without the id1, id2 fields the task is easily solvable importing the
>>> text
>>> file with readLines and retrieving the single fields with grep:
>>>
>>> input = readLines("file.txt")
>>> x = grep("^x\\s", input, value = T)
>>> id1 = grep("^id1\\s", input, value = T)
>>> ...
>>>
>>> I'd like to accomplish this task entirely in R (no SQL, no perl script),
>>> possibly without using loops.
>>>
>>> Any suggestions are quite welcome!
>>>
>>> Regards,
>>> Paolo
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>>
>>
>
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 09 Jul 2008 - 16:01:44 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 09 Jul 2008 - 17:31:15 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive