Re: [R] Parsing

From: Paolo Sonego <paolo.sonego_at_gmail.com>
Date: Wed, 09 Jul 2008 16:57:27 +0200

Thanks so much Jim! It works without a glitch! My only problem is that the text files to be parsed are quite big, up to several thousands rows (my apologies for the incomplete informations in my former post), so loops are not my first choice. I'll take a look at 'lapply' using your code as a model. Thanks again!

Sincerely,
Paolo

jim holtman ha scritto:
> This should do what you want: (it uses loops; you can work at
> replacing those with 'lapply' and such -- it all depends on if it is
> going to take you more time to rewrite the code than to process a set
> of data; you never did say how large the data was). This also "grows"
> a data.frame, but you have not indicated how efficient is has to be.
> So this could be used as a model.
>
>
>> x <- readLines(textConnection("x x_string
>>
> + y y_string
> + id1 id1_string
> + id2 id2_string
> + z z_string
> + w w_string
> + stuff stuff stuff
> + stuff stuff stuff
> + stuff stuff stuff
> + //
> + x x_string1
> + y y_string1
> + z z_string1
> + w w_string1
> + stuff stuff stuff
> + stuff stuff stuff
> + stuff stuff stuff
> + //
> + x x_string2
> + y y_string2
> + id1 id1_string1
> + id2 id2_string1
> + z z_string2
> + w w_string2
> + stuff stuff stuff
> + stuff stuff stuff
> + stuff stuff stuff
> + //"))
>
>> # I assume that each group is delimited by "//"
>> # initialize data.frame with desired values
>> .keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA)
>> .out <- .keys # for the first pass
>> .save <- NULL
>> for (i in seq_along(x)){
>>
> + if (x[i] == "//"){ # output the current data
> + .save <- rbind(.save, .out)
> + .out <- .keys # setup for the next pass
> + } else {
> + .split <- strsplit(x[i], "\\s+")
> + if (.split[[1]][1] %in% names(.out)){
> + .out[[.split[[1]][1]]] <- .split[[1]][2]
> + }
> + }
> + }
>
>> .save
>>
> x y id1 id2 w
> 1 x_string y_string id1_string id2_string w_string
> 2 x_string1 y_string1 <NA> <NA> w_string1
> 3 x_string2 y_string2 id1_string1 id2_string1 w_string2
>
>
> On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <paolo.sonego_at_gmail.com> wrote:
>
>> Dear R users,
>>
>> I have a big text file formatted like this:
>>
>> x x_string
>> y y_string
>> id1 id1_string
>> id2 id2_string
>> z z_string
>> w w_string
>> stuff stuff stuff
>> stuff stuff stuff
>> stuff stuff stuff
>> //
>> x x_string1
>> y y_string1
>> z z_string1
>> w w_string1
>> stuff stuff stuff
>> stuff stuff stuff
>> stuff stuff stuff
>> //
>> x x_string2
>> y y_string2
>> id1 id1_string1
>> id2 id2_string1
>> z z_string2
>> w w_string2
>> stuff stuff stuff
>> stuff stuff stuff
>> stuff stuff stuff
>> //
>> ...
>> ...
>>
>>
>> I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields and
>> save them into a a matrix object:
>>
>> x y id1 id2 z w
>> x_string y_string id1_string id2_string z_string w_string x_string1
>> y_string1 NA NA z_string1 w_string1
>> x_string2 y_string2 id1_string1 id2_string1 z_string2 w_string2
>> ...
>> ...
>>
>> id1, id2 fields are not always present within a section (the interval
>> between x and the last stuff) and
>> I'd like to insert a NA when they are absent (see above) so that
>> length(x)==length(y)==length(id1)==... .
>>
>> Without the id1, id2 fields the task is easily solvable importing the text
>> file with readLines and retrieving the single fields with grep:
>>
>> input = readLines("file.txt")
>> x = grep("^x\\s", input, value = T)
>> id1 = grep("^id1\\s", input, value = T)
>> ...
>>
>> I'd like to accomplish this task entirely in R (no SQL, no perl script),
>> possibly without using loops.
>>
>> Any suggestions are quite welcome!
>>
>> Regards,
>> Paolo
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 09 Jul 2008 - 15:02:35 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 09 Jul 2008 - 16:31:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive