Re: [R] Tools For Preparing Data For Analysis

From: Gabor Grothendieck <>
Date: Sat, 09 Jun 2007 22:16:46 -0400

That can be elegantly handled in R through R's object oriented programming by defining a class for the fancy input. See this post: for a simple example of that style.

On 6/9/07, Robert Wilkins <> wrote:
> Here are some examples of the type of data crunching you might have to do.
> In response to the requests by Christophe Pallier and Martin Stevens.
> Before I started developing Vilno, some six years ago, I had been working in
> the pharmaceuticals for eight years ( it's not easy to show you actual data
> though, because it's all confidential of course).
> Lab data can be especially messy, especially if one clinical trial allows
> the physicians to use different labs. So let's consider lab data.
> Merge in normal ranges, into the lab data. This has to be done by lab-site
> and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where
> you also need to match by sex and age. The sex column in the normal ranges
> could be: blank, F, M, or B ( B meaning for Both sexes). The age column in
> the normal ranges could be: blank, or something like "40 <55". Even worse,
> you could have an ageunits column in the normal ranges dataset: usually "Y",
> but if there are children in the clinical trial, you will have "D" or "M",
> for Days and Months. If the clinical trial is for adults, all rows with "D"
> or "M" should be tossed out at the start. Clearly the statistical programmer
> has to spend time looking at the data, before writing the program. Remember,
> all of these details can change any time you move to a new clinical trial.
> So for the lab data, you have to merge in the patient's date of birth,
> calculate age, and somehow relate that to the age-group column in the normal
> ranges dataset.
> (By the way, in clinical trial data preparation, the SAS datastep is much
> more useful and convenient, in my opinion, than the SQL SELECT syntax, at
> least 97% of the time. But in the middle of this program, when you merge the
> normal ranges into the lab data, you get a better solution with PROC SQL (
> just the SQL SELECT statement implemented inside SAS) This is because of the
> trickiness of the age match-up, and the SAS datastep does not do well with
> many-to-many joins.).
> Merge in various study drug administration dates into the lab data. Now, for
> each lab record, calculate treatment period ( or cycle number ), depending
> on the statistician's specifications and the way the clinical trial is
> structured.
> Different clinical sites chose to use different lab providers. So, for
> example, for Monocytes, you have 10 different units ( essentially 6 units,
> but spelling inconsistencies as well). The statistician has requested that
> you use standardized units in some of the listings ( % units, and only one
> type of non-% unit, for example ). At the same time, lab values need to be
> converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming
> no matter what software you use, and, in my experience, when the SAS
> programmer asks for more clinical information or lab guidebooks, the
> response is incomplete, so he does a lot of guesswork. SAS programmers do
> not have expertise in lab science, hence the guesswork.
> Your program has to accomodate numeric values, "1.54" , quasi-numeric values
> "<1" , and non-numeric values "Trace".
> Your data listing is tight for space, so print "PROLONGED CELL CONT" as
> "PRCC".
> Once normal ranges are merged in, figure out which values are out-of-range
> and high , which are low, and which are within normal range. In the data
> listing, you may have "H" or "L" appended to the result value being printed.
> For each treatment period, you may need a unique lab record selected, in
> case there are two or three for the same treatment period. The statistician
> will tell the SAS programmer how. Maybe the averages of the results for that
> treatment period, maybe that lab record closest to the mid-point of of the
> treatment period. This isn't for the data listing, but for a summary table.
> For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC
> (total white blood cell count) values , to convert values between % units
> and absolute count units.
> When printing the values in the data listing, you need "H" or "L" to the
> right of the value. But you also need the values to be well lined up ( the
> decimal place ). This can be stupidly time consuming.
> I think you see why clinical trials statisticians and SAS programmers enjoy
> lots of job security.

This could be readily handled in R using object oriented programming. You would specify a class for the strange input, mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Sun 10 Jun 2007 - 02:22:03 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 10 Jun 2007 - 09:31:28 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.