Re: [R] preprocessing data

From: Jean Eid <jeaneid_at_chass.utoronto.ca>
Date: Wed 17 Aug 2005 - 04:07:24 EST


Thank you Gabor,

Jean

On Tue, 16 Aug 2005, Gabor Grothendieck wrote:

> On 8/16/05, Jean Eid <jeaneid@chass.utoronto.ca> wrote:
> > Dear all,
> >
> > My question is concerning the line
> > "This is adequate for small files, but for anything more complicated we
> > recommend using the facilities of a language like perl to pre-process
> > the file."
> >
> > in the import/export manual.
> >
> > I have a large fixed-width file that I would like to preprocess in Perl or
> > awk. The problem is that I do not know where to start. Does anyone have a
> > simple example on how to turn a fixed-width file in any of these
> > facilities into csv or tab delimited file. I guess I am looking for
> > somewhat a perl for dummies or awk for dummies that does this. any
> > pointers for website will be greatly appreciated
> >
>
>
>
> Try to do it in R first. I have found that I rarely need to go to
> an outside language to massage my data.
>
> # fixed with fields of 10 and 5
> Lines <- readLines("mydata.dat")
> data.frame( field1 = as.numeric(substring(1,10,Lines),
> field2 = as.numeric(substring(11,15,Lines) )
>
> If you do find that you have speed or memory problems that
> require that you go outside of R to preprocess your data
> then the gawk version of awk has a FIELDWIDTHS variable that
> makes handling fixed fields very easy. The gawk program below
> assumes two fields of widths 10 and 5, respectively, which
> is set in the first line. Then it repeatedly executes the
> second line for each input line forcing field splitting by a
> dummy manipulation (since field splitting is lazy) and then
> printing each line, the default being to print out the
> entire line with a space between successive fields:
>
> BEGIN { FIELDWIDTHS = "10 5" }
> { $1 = $1; print }
>
> In R, do the following assuming the above two lines are in
> split.awk:
>
> read.table(pipe("gawk -f split.awk mydata.dat"))
>
> or else run gawk outside of R then read in the output file
> created:
>
> gawk -f split.awk mydata.dat > mydata2.dat
>
> For more information, google for
>
> FIELDWIDTHS gawk
>
> for that portion of the manual on FIELDWIDTHS -- it includes
> an example and, of course, the whole manual is there too. The
> book by Kernighan et al is also good.
>
> I have used both awk and perl and I think its unlikely you
> would need perl given that you have R at your disposal for
> the hard parts and awk is easier to learn, better designed
> and more focused on this sort of task.
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Aug 17 04:18:42 2005

This archive was generated by hypermail 2.1.8 : Sun 23 Oct 2005 - 15:24:02 EST