Re: [R] A really simple data manipulation example

From: Christophe Pallier <>
Date: Wed, 27 Jun 2007 10:33:30 +0200

For merging, selecting and aggregating, R is not too bad: I believe the following code is more or less equivalent to your Vilno example, isn't?

# to input data, replace the following two lline by 'read.table' labresults <- data.frame(,5), visit.num=gl(5,10), sodium=rnorm(50))
demo <- data.frame(,1), gender=gl(2,5,labels=c('F','M')))

data <- merge(labresults, demo)
data <- subset(data, visit.num!=2)

     aggregate(sodium, list(gender=gender), mean)

If the data sets are very large, then doing merges/selection/aggregation outside of R can be a good idea.


On 6/27/07, Robert Wilkins <> wrote:
> In response to those who asked for a better explanation of what the
> Vilno software does, here's a simple example that gives some idea of
> what it does.
> LABRESULTS is a dataset with multiple rows per patient , with lab
> sodium measurements. It has columns: PATIENT_ID, VISIT_NUM, and
> DEMO is a dataset with one row per patient, with demographic data.
> It has columns: PATIENT_ID, GENDER.
> Here's a simple example, the following paragraph of code is a
> data processing function (dpf) :
> mergeby PATIENT_ID ;
> if (SODIUM == -9) SODIUM = NULL ;
> if (VISIT_NUM != 2) deleterow ;
> turnoff; // just means end-of-paragraph , version 1.0 won't need this.
> Can you guess what it does? The lab result rows are merged with the
> demographic rows, just to get the gender information merged in.
> Obviously, they are merged by patient. The code -9 is used to denote
> "missing", so convert that to NULL. I'm about to take a statistic for
> visit 2, so rows with visit 0 or 1 must be deleted. I'm assuming, for
> visit 2, each patient has at most one row. Now, for each sex group,
> take the average sodium level. After the select statement, I have just
> two rows, for male and female, with the average sodium level in the
> AVERAGE_SODIUM column. Now the sendoff statement just stores the
> current data table into a datafile, called RESULTS_DATASET.
> So you have a sequence of data tables, each calculation reading in the
> current table , and leaving a new data table for the next calculation.
> So you have input datasets, a bunch of intermediate calculations, and
> one or more output datasets. Pretty simple idea.
> *****************************************
> Some caveats:
> LABRESULTS and DEMO are binary datasets. The asciitobinary and
> binarytoascii statements are used to convert between binary datasets
> and comma-separated ascii data files. (You can use any delimiter:
> comma, vertical bar , etc). An asciitobinary statement is typically
> just two lines of code.
> The dpf begins with the inlist statement , and , for the moment ,
> needs "turnoff ;" as the last line. Version 1.0 won't require the use
> of "turnoff;", but version 0.85 does. It only means this paragraph of
> code ends here ( a program can , of course , contain many paragraphs:
> data processing functions, print statements, asciitobinary statements,
> etc.).
> If you've worked with lab data, you know lab data does not look so
> simplistic. I need a simple example.
> Vilno has a lot of functionality, many-to-many joins, adding columns,
> firstrow() and lastrow() flags, and so forth. A fair amount of complex
> data manipulations have already been tested with test programs ( in
> the tarball ). Of course a simple example cannot show you that, it's
> just a small taste.
> *********************************************
> If you've never used SPSS or SAS before, you won't care, but this
> programming language falls in the same family as the SPSS and SAS
> programming languages. All three programming languages have a fair
> amount in common, but are quite different from the S programming
> language. The vilno data processing function can replace the SAS
> datastep. (It can also replace PROC TRANSPOSE and much of PROC MEANS,
> except standard deviation calculations still need to be included in
> the select statement).
> ********************************************
> I hope that helps.
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.

Christophe Pallier (

	[[alternative HTML version deleted]]

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 27 Jun 2007 - 08:40:22 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 27 Jun 2007 - 12:32:28 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.