Re: [R] How to sum values across multiple variables using a wildcard?

From: Marc Schwartz <MSchwartz_at_mn.rr.com>
Date: Tue 21 Feb 2006 - 12:18:47 EST

On Mon, 2006-02-20 at 18:41 -0600, mtb954@gmail.com wrote:
> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
>
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
>
> I can create a new variable called var6 with the value 15 for each
> record with this code:
>
> > var6=var1+var2+var3+var4+var5
>
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
>
> > Var6=sum(var*)
>
> Any suggestions for implementing this in R? Thanks! Mark

Here is one approach using grep().

Given a data frame called MyDF with the following structure:

> str(MyDF)

`data.frame': 10 obs. of 20 variables:

 $ other4 : num  -0.869  0.376 -2.022  0.619 -0.129 ...
 $ var8   : num  -0.380  1.428 -1.075 -0.796 -0.588 ...
 $ var4   : num  -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ...
 $ other9 : num   0.0210 -0.6455  0.0289  1.2405 -1.3359 ...
 $ var10  : num   0.647 -0.798  0.180  1.135 -0.258 ...
 $ other2 : num   0.1332 -0.2227  0.0423  0.6881  2.0304 ...
 $ other10: num  0.811 2.166 0.569 0.302 0.669 ...
 $ var1   : num  -0.774 -1.812 -1.230 -0.969  0.245 ...
 $ var2   : num  -0.0538  0.3712  0.8222 -0.8025 -0.6914 ...
 $ other6 : num  0.871 0.291 2.079 1.098 1.025 ...
 $ other1 : num  -0.5130  0.1358  0.8744  0.0997  1.7458 ...
 $ var9   : num   0.664 -0.456  0.415  2.090 -0.283 ...
 $ other3 : num  -0.425 -0.283  0.706 -1.879 -0.828 ...
 $ other7 : num   0.100  0.177  0.570 -0.631 -1.009 ...
 $ var3   : num   1.446 -0.862  0.184  1.077  0.146 ...
 $ var5   : num   0.402 -0.498 -0.906  0.641  1.690 ...
 $ var6   : num   0.892 -0.242  0.561  0.530 -0.291 ...
 $ other5 : num  -1.210  0.815 -1.284 -0.152  0.329 ...
 $ other8 : num  -0.265 -1.278  1.152  0.232 -1.189 ...
 $ var7   : num  -0.616 -0.994 -0.263  1.626 -1.372 ...


Note that the column names are either var* or other*.

Using grep() we get the indices of the column names that contain "other" plus one or more following characters, where the "other" begins the word:

> grep("\\bother.", names(MyDF))

 [1] 1 4 6 7 10 11 13 14 18 19

See ?regexp for more information. Note that I use "\\b" to being the search at the starting word boundary and then "." to require that there be following characters. Thus, this would not match " other1" or " other".

You can then use the following to subset the data frame MyDF and sum the rows for the requested columns:

> rowSums(MyDF[, grep("\\bother.", names(MyDF))])

        1 2 3 4 5 6 7 -1.344893 1.531417 2.715234 1.616971 1.307379 4.655568 4.638446

        8 9 10
-2.640485 -2.226270 -2.158248

You could use grep("other", names(MyDF)), but this would also get "other" if it appears anywhere in the name. For example:

> grep("other", "Thisotherone")

[1] 1

It just depends upon your naming schema and how strict you need to be in the search.

HTH, Marc Schwartz



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Feb 21 12:24:14 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:42:38 EST