Re: [R] Basic data structures

From: Satoshi Takahama <s.takahama_at_yahoo.com>
Date: Sun, 10 Aug 2008 17:45:23 -0700 (PDT)


Suppose I want to have a regexp match against a string, and return all the matching substrings in a vector of strings.

   regexp <- "[ab]+"
   strlist <- c( "abc", "dbabddadd", "aaa" )    matches <- gregexpr(regexp,strlist)

With this input, I'd want to return list( list("ab"), list("ab", "a"), list("aaa") ).

Now the matches object prints out as

   [[1]]
   [1] 1
   attr(,"match.length")
   [1] 2

   [[2]]
   [1] 2 7
   attr(,"match.length")
   [1] 3 1

   [[3]]
   [1] 1
   attr(,"match.length")
   [1] 3

which, if I'm interpreting this correctly, means that it is a list (not a vector, because vectors can only have atomic elements) of three elements, each of which is a vector of integers (the matching positions) with an attribute match.length (the length of the corresponding match), which is in turn a vector of integers.

===
Question: is there a more compact standard print format for this? It's a bit disconcerting that printing the 2x2 list list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2 array takes 2 lines! (I guess that arrays are "more native").

Here is one way:

> (mat <- t(sapply(matches,function(x)

+        list(start.index=`attributes<-`(x,NULL),
+             match.length=attr(x,"match.length")))))
     start.index match.length

[1,] 1 2
[2,] Integer,2 Integer,2
[3,] 1 3

The object returned by this function is a 3x2 matrix of mode "list" - each element of the matrix contains a list:

> mat[2,1]
$start.index
[1] 2 7

> mat[[2,1]]
[1] 2 7

also, see below...

===
Now, matches[[1]], the first element of matches, describes the matches in the first string. To extract those strings, I can write

   substr( strlist[[1]],
           matches[[1]],
       attr(matches[[1]],"match.length")+matches[[1]]-1 )

which correctly gives "ab".

Question: This looks awfully clumsy; is there some more idiomatic way to do this, in particular to refer to the match.length attribute without using a quoted string or the attr function? attributes(matches[[1]])$match.length and attributes(matches[[1]])[[1]] work, but seem even clumsier.

Check out the gsubfn package - I'm still learning it myself, but it may provide the functionality you seek. For instance, I believe what you are trying to accomplish is

> strapply(strlist,regexp,identity)
or
> strapply(strlist,regexp,c)
[[1]]
[1] "ab"

[[2]]
[1] "bab" "a"

[[3]]
[1] "aaa"

===
Question: R uses names like xxx.yyy in many places. Is this just a convention to represent spaces (the way most languages use "_"), or is there some semantics attached to "."?

In many examples that I have seen , programmers have used "." in the place of the traditional "_" because "_" used to be an assignment operator in earlier versions of R. Now, "_" is no longer an assignment operator and its use in variable names is permitted also.

The "." notation also plays a role in the implementation of OOP by R. R has two object-oriented approaches: S3 and S4. For both approaches, methods are associated with generic functions, rather than the object itself (which I understand is similar to Lisp's CLOS). For S3 methods, function.objectclass implies the generic "function" to be applied to class "objectclass".

For instance, print() is a generic function: > print
function (x, ...)
UseMethod("print")
<environment: namespace:base>

If you want to define a method for a particular class of objects, you can use the xxx.yyy syntax:
> print.regexp <- function(x)

+   for(i in seq(along=x))
+   cat(i,":", x[[i]], "| match.length =",
+       attr(x[[i]],"match.length"),"\n")

> class(matches) <- "regexp"
> print(matches)

1 : 1 | match.length = 2 
2 : 2 7 | match.length = 3 1 
3 : 1 | match.length = 3 

You can assign a class (or classes) to each object; this information is used for making method dispatch decisions for generic functions. For S3 there is no checking of consistency between object classes and its attributes; S4 is a more formal implementation of OOP in R.

Check out
(S3)
http://www-128.ibm.com/developerworks/linux/library/l-r3.html (S4)
http://developer.r-project.org/howMethodsWork.pdf

The first reference also mentions how to implement infinite sequences in R - which may answer part of your question below.

===
Question: Is it good practice in R to treat a string as a vector of characters so that R's powerful vector operations can be used on it? How would I do that?

I'm sure it can be done by defining your own objects and methods, but it's not done out-of-the-box (that I'm aware of). I believe the most common string operations used by R users are extraction and concatenation; these are effectively achieved by substr(), substring() and paste(), rather than "[", c(), or "+", as you seem to have figured out. In my experience, R's standard objects and functions for string-like objects are immediately convenient for manipulating file and variable names but not necessarily for hard-core text processing.

===
Now suppose I want to list *all* the matches in matches[[2]]. I try:

   substr( strlist[[2]],
           matches[[2]],
       attr(matches[[2]],"match.length")+matches[[2]]-1 )

but only get the first one, so it seems that the recycling rule for vectors doesn't apply here (same thing with [2] instead of [[2]]). Where does recycling apply and not apply?

I don't know if there's a hard rule for that (though I usually expect that recycling works for mathematical operators and plotting functions), but in this case hope the strapply() function above will solve your problem. Otherwise, an inelegant way would be to use Map() or mapply():

> mapply(function(x,y) substr( strlist[[2]],x,y),

+     matches[[2]],
+     attr(matches[[2]],"match.length")+matches[[2]]-1)

[1] "bab" "a"

===
Question: Is there some operator (using promises?) to make strlist[[2]] into a (lazy) infinite vector/list?

Like an iterator? There is some mention of infinite sequences in the IBM DeveloperWorks article above, but I've personally never tried implementing one in R.

Hope this helps,
Satoshi



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 11 Aug 2008 - 02:29:24 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 11 Aug 2008 - 05:33:41 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive