Re: [R] Parsing

From: jim holtman <jholtman_at_gmail.com>
Date: Wed, 09 Jul 2008 08:30:42 -0400

This should do what you want: (it uses loops; you can work at replacing those with 'lapply' and such -- it all depends on if it is going to take you more time to rewrite the code than to process a set of data; you never did say how large the data was). This also "grows" a data.frame, but you have not indicated how efficient is has to be. So this could be used as a model.

> x <- readLines(textConnection("x x_string

+ y      y_string
+ id1    id1_string
+ id2    id2_string
+ z      z_string
+ w      w_string
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ //
+ x      x_string1
+ y      y_string1
+ z      z_string1
+ w      w_string1
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ //
+ x      x_string2
+ y      y_string2
+ id1    id1_string1
+ id2    id2_string1
+ z      z_string2
+ w      w_string2
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ //"))

> # I assume that each group is delimited by "//"
> # initialize data.frame with desired values
> .keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA)
> .out <- .keys # for the first pass
> .save <- NULL
> for (i in seq_along(x)){
+     if (x[i] == "//"){  # output the current data
+         .save <- rbind(.save, .out)
+         .out <- .keys    # setup for the next pass
+     } else {
+         .split <- strsplit(x[i], "\\s+")
+         if (.split[[1]][1] %in% names(.out)){
+             .out[[.split[[1]][1]]] <- .split[[1]][2]
+         }
+     }
+ }

> .save
x y id1 id2 w 1 x_string y_string id1_string id2_string w_string 2 x_string1 y_string1 <NA> <NA> w_string1
3 x_string2 y_string2 id1_string1 id2_string1 w_string2

On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <paolo.sonego_at_gmail.com> wrote:
> Dear R users,

>

> I have a big text file formatted like this:
>

> x x_string
> y y_string
> id1 id1_string
> id2 id2_string
> z z_string
> w w_string
> stuff stuff stuff
> stuff stuff stuff
> stuff stuff stuff
> //
> x x_string1
> y y_string1
> z z_string1
> w w_string1
> stuff stuff stuff
> stuff stuff stuff
> stuff stuff stuff
> //
> x x_string2
> y y_string2
> id1 id1_string1
> id2 id2_string1
> z z_string2
> w w_string2
> stuff stuff stuff
> stuff stuff stuff
> stuff stuff stuff
> //
> ...
> ...
>
>

> I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields and
> save them into a a matrix object:
>

> x y id1 id2 z w
> x_string y_string id1_string id2_string z_string w_string x_string1
> y_string1 NA NA z_string1 w_string1
> x_string2 y_string2 id1_string1 id2_string1 z_string2 w_string2
> ...
> ...
>

> id1, id2 fields are not always present within a section (the interval
> between x and the last stuff) and
> I'd like to insert a NA when they are absent (see above) so that
> length(x)==length(y)==length(id1)==... .
>

> Without the id1, id2 fields the task is easily solvable importing the text
> file with readLines and retrieving the single fields with grep:
>

> input = readLines("file.txt")
> x = grep("^x\\s", input, value = T)
> id1 = grep("^id1\\s", input, value = T)
> ...
>

> I'd like to accomplish this task entirely in R (no SQL, no perl script),
> possibly without using loops.
>

> Any suggestions are quite welcome!
>

> Regards,
> Paolo
>

> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 09 Jul 2008 - 12:35:29 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 09 Jul 2008 - 15:31:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive