[R] A really simple data manipulation example

From: Robert Wilkins <irishhacker_at_gmail.com>
Date: Tue, 26 Jun 2007 18:59:44 -0500

In response to those who asked for a better explanation of what the Vilno software does, here's a simple example that gives some idea of what it does.

LABRESULTS is a dataset with multiple rows per patient , with lab sodium measurements. It has columns: PATIENT_ID, VISIT_NUM, and SODIUM. DEMO is a dataset with one row per patient, with demographic data. It has columns: PATIENT_ID, GENDER.

Here's a simple example, the following paragraph of code is a data processing function (dpf) :

mergeby PATIENT_ID ;
if (SODIUM == -9) SODIUM = NULL ;
if (VISIT_NUM != 2) deleterow ;
select AVERAGE_SODIUM = avg(SODIUM) by GENDER ; sendoff(RESULTS_DATASET) GENDER AVERAGE_SODIUM ; turnoff; // just means end-of-paragraph , version 1.0 won't need this.

Can you guess what it does? The lab result rows are merged with the demographic rows, just to get the gender information merged in. Obviously, they are merged by patient. The code -9 is used to denote "missing", so convert that to NULL. I'm about to take a statistic for visit 2, so rows with visit 0 or 1 must be deleted. I'm assuming, for visit 2, each patient has at most one row. Now, for each sex group, take the average sodium level. After the select statement, I have just two rows, for male and female, with the average sodium level in the AVERAGE_SODIUM column. Now the sendoff statement just stores the current data table into a datafile, called RESULTS_DATASET.

So you have a sequence of data tables, each calculation reading in the current table , and leaving a new data table for the next calculation.

So you have input datasets, a bunch of intermediate calculations, and one or more output datasets. Pretty simple idea.

Some caveats:

LABRESULTS and DEMO are binary datasets. The asciitobinary and binarytoascii statements are used to convert between binary datasets and comma-separated ascii data files. (You can use any delimiter: comma, vertical bar , etc). An asciitobinary statement is typically just two lines of code.

The dpf begins with the inlist statement , and , for the moment , needs "turnoff ;" as the last line. Version 1.0 won't require the use of "turnoff;", but version 0.85 does. It only means this paragraph of code ends here ( a program can , of course , contain many paragraphs: data processing functions, print statements, asciitobinary statements, etc.).

If you've worked with lab data, you know lab data does not look so simplistic. I need a simple example.

Vilno has a lot of functionality, many-to-many joins, adding columns, firstrow() and lastrow() flags, and so forth. A fair amount of complex data manipulations have already been tested with test programs ( in the tarball ). Of course a simple example cannot show you that, it's just a small taste.

If you've never used SPSS or SAS before, you won't care, but this programming language falls in the same family as the SPSS and SAS programming languages. All three programming languages have a fair amount in common, but are quite different from the S programming language. The vilno data processing function can replace the SAS datastep. (It can also replace PROC TRANSPOSE and much of PROC MEANS, except standard deviation calculations still need to be included in the select statement).

I hope that helps.


R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 27 Jun 2007 - 00:09:01 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 27 Jun 2007 - 11:32:36 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.