Re: [R] queer data set

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Tue 16 Aug 2005 - 09:45:53 EST


On 15-Aug-05 S.O. Nyangoma wrote:
> I have a dataset that is basically structureless. Its dimension
> varies from row to row and sep(s) are a mixture of tab and semi
> colon (;) an example is
>
HEADER1 HEADER2 HEADER3 HEADER3

A1       B1      C1       X11;X12;X13
A2       B2      C2       X21;X22;X23;X24;X25
A3       B3      C3       
A4       B4      C4       X41;X42;X43
A5       B5      C5       X51

>
> etc., say. Note that a blank under HEADER3 corresponds to non
> occurance and all semi colon (;) delimited variables are under
> HEADER3. These values run into tens of thousands. I want to give some
> order to this queer matrix to something like:
>
> HEADER1 HEADER2 HEADER3 HEADER3
> A1 B1 C1 X11
> A1 B1 C1 X12
> A1 B1 C1 X13
> A1 B1 C1 X14
> A2 B2 C2 X21
> A2 B2 C2 X22
> A2 B2 C2 X23
> A2 B2 C2 X24
> A2 B2 C2 X25
> A2 B2 C2 X26
> A3 B3 C3 NA
> A4 B4 C4 X41
> A4 B4 C4 X42
> A4 B4 C4 X43
>
> Is there a brilliant R-way of doing such task?
>
> Goodday. Stephen.

I don't know about a brilliant R trick (though I'm sure others do).

But (on my usual hobby-horse) if you have 'awk' available (and don't mind using it) then it will do the job:

First create an 'awk' program file as follows:

  {for(i in A) delete A[i]}
  {if($4==""){A[1]="NA"}
    else {split($4,A,";")}}
  {B = $1 "\t" $2 "\t" $3}
  {for(i in A) print B "\t" A[i]}

and call this say split.awk

Then run

  awk -f split.awk

and feed it the lines of your primary dataset as above. Here's a cut&paste from my Linux session, where the first block of lines after "awk -f split.awk" are the lines being input to the program, starting with the header, followed by the output of the program starting with the header again:

awk -f split.awk
HEADER1 HEADER2 HEADER3 HEADER3

A1       B1      C1       X11;X12;X13
A2       B2      C2       X21;X22;X23;X24;X25
A3       B3      C3       
A4       B4      C4       X41;X42;X43
A5       B5      C5       X51
HEADER1 HEADER2 HEADER3 HEADER3
A1      B1      C1      X11
A1      B1      C1      X12
A1      B1      C1      X13
A2      B2      C2      X24
A2      B2      C2      X25
A2      B2      C2      X21
A2      B2      C2      X22
A2      B2      C2      X23
A3      B3      C3      NA
A4      B4      C4      X41
A4      B4      C4      X42
A4      B4      C4      X43
A5      B5      C5      X51

In unixoid systems, with a large file of such lines, the command would be

  cat yourdatafile | awk -f split.awk

and then you would only see the output, not the input as shown above, and you can of course redirect it into a new file with

  cat yourdatafile | awk -f split.awk > newdatafile

Note, however, that the order of the lines output for the third line of input (the one with X21;X22;X23;X24;X25) is not the same as the order of the X21;X22;X23;X24;X25 though they are all there.

This is a "feature" of the way 'awk' handles arrays (which are "associative arrays" indexed by values, not by position).

This may not matter for your application; but if it does matter then I'm not sure how to force the correct order.

Hoping this helps,
Ted.



E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 16-Aug-05                                       Time: 00:45:49
------------------------------ XFMail ------------------------------

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Aug 16 10:10:16 2005

This archive was generated by hypermail 2.1.8 : Sun 23 Oct 2005 - 15:21:55 EST