Re: [R] multiple separators in sep argument for read.table?

From: Ted Harding <Ted.Harding_at_manchester.ac.uk>
Date: Sat, 19 Apr 2008 08:52:59 +0100 (BST)


On 19-Apr-08 06:19:09, Johan Jackson wrote:
> Hello,
> Is there any way to add multiple separators in the sep= argument
> in read.table? I would like to be able to create different columns
> if I see a white space OR a "/".
>
> Thanks in advance,
> JJ

As well as Brian Ripley's suggestion for how to do it withnin R, if you have access to the 'awk' program (as on all Unix/Linux systems and, in principle, installable in Windows) then you can pre-process the file outside of R on the following lines. First, here is a test file "temp.txt":

R1C1 R1C2;R1C3
R2C1,R2C2 R2C3
R3C1,R3C2;R3C3

where each line has 3 fields, separated by any of " " or "," or ";" and it is desired to obtain a purely comma-separated version of it.

awk '
  BEGIN{FS="[ ]|[;]|[,]";OFS=","};{$1=$1};{print $0} ' < temp.txt > temp2.txt

produces a file temp2.txt with contents

R1C1,R1C2,R1C3
R2C1,R2C2,R2C3
R3C1,R3C2,R3C3

The logic is that the intialisation
  BEGIN{FS="[ ]|[;]|[,]"} ; OFS=","}
sets up the Field Separator variable FS as a regular expression which matches any one of " " ";" "," and the Output Field Separator OFS to be ",".

$0 denotes the entire input line, and the "$1=$1" causes the first field to be re-computed (to be equal to itself) so that the whole input line $0 is re-computed at which point the OFS is then set to "," in $0.

Hence an 'awk' program to handle the case you describe could be

awk '
  BEGIN{FS="[ ]|[/]";OFS=" "};{$1=$1};{print $0} ' < myrawfile > myfinalfile

It gets slightly more interesting if your "white space" separating two fields might be any number of consecutive spaces or a TAB, say.

In that case something like

awk '
  BEGIN{FS="[ ][ ]*|[;]|[,]|[\t]";OFS=","};{$1=$1};{print $0} ' < myrawfile > myfinalfile

might be needed. Here "[ ][ ]*" means "one space followed by zero or more spaces", and "\t" is the notation for TAB.

If I change the test file above to

R1C1 R1C2;R1C3
R2C1,R2C2 R2C3
R3C1,R3C2;R3C3

where the long blank in the first line is 3 consecutive " ", and the long blank in the second line is a single TAB, then the second 'awk' program above generates exactly the same output as before.

Just a thought! I'm always tempted to suggest that people use 'awk' in conjunction with R, not only to deal with the kind of relatively simple substitutions you describe, but also for exploring and cleaning up the sort of mess that people can send you after exporting a CSV file from an Excel spreadsheet, etc. (It would go on for too long, to give examples of this sort of thing.)

With best wishes,
Ted.



E-Mail: (Ted Harding) <Ted.Harding_at_manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 19-Apr-08                                       Time: 08:52:56
------------------------------ XFMail ------------------------------

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 19 Apr 2008 - 07:57:57 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 19 Apr 2008 - 08:30:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive