Re: [R] Tools For Preparing Data For Analysis

From: Christophe Pallier <christophe_at_pallier.org>
Date: Fri, 22 Jun 2007 17:27:42 +0200

If I understand correctly (from your Perl script)

  1. you count the number of occurences of each "(echo, muga)" pairs in the first file.
  2. you remove from the second file the lines that correspond to these occurences.

If this is indeed your aim, here's a solution in R:

cumcount <- function(x) {
 y <- numeric(length(x))
 for (i in 1:length(y)) {
     y[i] = sum(x[1:i] == x[i])

 }
 y
}

both <- read.csv('both_echo.csv')
v <- table(paste(both$echo, "_", both$muga, sep=""))

semi <- read.csv('qual_echo.csv')
s <- paste(semi$echo, "_", semi$muga, sep="") cs = cumcount(s)
count = v[s]
count[is.na(count)]=0

semi2 <- data.frame(semi, s, cs, count, keep = cs > count)

> semi2

  echo muga quant     s cs count  keep
1   10   20     0 10_20  1     0  TRUE
2   10   20     0 10_20  2     0  TRUE
3   10   21     0 10_21  1     1 FALSE
4   10   21     0 10_21  2     1  TRUE
5   10   24     0 10_24  1     0  TRUE
6   10   25     0 10_25  1     2 FALSE
7   10   25     0 10_25  2     2 FALSE
8   10   25     0 10_25  3     2  TRUE


My code is not very readable...
Yet, the 'trick' of using an helper function like 'cumcount' might be instructive.

Christophe Pallier

On 6/22/07, Kevin E. Thorpe <kevin.thorpe_at_utoronto.ca> wrote:
>
> I am posting to this thread that has been quiet for some time because I
> remembered the following question.
>
> Christophe Pallier wrote:
> > Hi,
> >
> > Can you provide examples of data formats that are problematic to read
> and
> > clean with R ?
>
> Today I had a data manipulation problem that I don't know how to do in R
> so I solved it with perl. Since I'm always interested in learning more
> about complex data manipulation in R I am posting my problem in the
> hopes of receiving some hints for doing this in R.
>
> If anyone has nothing better to do than play with other people's data,
> I would be happy to send the row files off-list.
>
> Background:
>
> I have been given data that contains two measurements of left
> ventricular ejection fraction. One of the methods is echocardiogram
> which sometimes gives a true quantitative value and other times a
> semi-quantitative value. The desire is to compare echo with the
> other method (MUGA). In most cases, patients had either quantitative
> or semi-quantitative. Same patients had both. The data came
> to me in excel files with, basically, no patient identifiers to link
> the "both" with the semi-quantitative patients (the "both" patients
> were in multiple data sets).
>
> What I wanted to do was extract from the semi-quantitative data file
> those patients with only semi-quantitative. All I have to link with
> are the semi-quantitative echo and the MUGA and these pairs of values
> are not unique.
>
> To make this more concrete, here are some portions of the raw data.
>
> "Both"
>
> "ID NUM","ECHO","MUGA","Semiquant","Quant"
> "B",12,37,10,12
> "D",13,13,10,13
> "E",13,26,10,15
> "F",13,31,10,13
> "H",15,15,10,15
> "I",15,21,10,15
> "J",15,22,10,15
> "K",17,22,10,17
> "N",17.5,4,10,17.5
> "P",18,25,10,18
> "R",19,25,10,19

>
> Seimi-quantitative
>
> "echo","muga","quant"
> 10,20,0 <-- keep
> 10,20,0 <-- keep
> 10,21,0 <-- remove
> 10,21,0 <-- keep
> 10,24,0 <-- keep
> 10,25,0 <-- remove
> 10,25,0 <-- remove
> 10,25,0 <-- keep
>
> Here is the perl program I wrote for this.
>
> #!/usr/bin/perl
>
> open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
> # Discard first row;
> $_ = <BOTH>;
> while(<BOTH>) {
> chomp;
> ($id, $e, $m, $sq, $qu) = split(/,/);
> $both{$sq,$m}++;
> }
> close(BOTH);
>
> open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv";
> print OUT "pid,echo,muga,quant\n";
> $pid = 2001;
>
> open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv";
> # Discard first row
> $_ = <QUAL>;
> while(<QUAL>) {
> chomp;
> ($echo, $muga, $quant) = split(/,/);
> if ($both{$echo,$muga} > 0) {
> $both{$echo,$muga}--;
> }
> else {
> print OUT "$pid,$echo,$muga,$quant\n";
> $pid++;
> }
> }
> close(QUAL);
> close(OUT);
>
> open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv";
> print OUT "pid,echo,muga,quant\n";
> $pid = 3001;
>
> open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
> # Discard first row;
> $_ = <BOTH>;
> while(<BOTH>) {
> chomp;
> ($id, $e, $m, $sq, $qu) = split(/,/);
> print OUT "$pid,$sq,$m,0\n";
> print OUT "$pid,$qu,$m,1\n";
> $pid++;
> }
> close(BOTH);
> close(OUT);
>
>
> --
> Kevin E. Thorpe
> Biostatistician/Trialist, Knowledge Translation Program
> Assistant Professor, Department of Public Health Sciences
> Faculty of Medicine, University of Toronto
> email: kevin.thorpe_at_utoronto.ca Tel: 416.864.5776 Fax: 416.864.6057
>
> ______________________________________________
> R-help_at_stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Christophe Pallier (http://www.pallier.org)

	[[alternative HTML version deleted]]

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri 22 Jun 2007 - 15:41:55 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 22 Jun 2007 - 16:32:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.