Re: [R] using (g)sub for efficient string handling (was Re: transforming one column into 2 columns)

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Sat, 2 Feb 2008 14:19:25 -0500

This does not answer your question directly but note that strapply in the gsubfn package can be used to select strings by content:

> library(gsubfn)
> (x <- strapply(txt, "Variation_....", simplify = c))
[1] "Variation_0001" "Variation_5452" "Variation_4192" "Variation_4193" [5] "Variation_8246" "Variation_8246"
> paste(x, collapse = ";")
[1] "Variation_0001;Variation_5452;Variation_4192;Variation_4193;Variation_8246;Variation_8246"

On Feb 2, 2008 1:40 PM, Benilton Carvalho <bcarvalh_at_jhsph.edu> wrote:
> That actually reminds me of a problem I had to tackle a while ago.
>
> Say I have the following:
>
> txt <- c("Variation_0001 // chr1:1083805-1283805 // Array CGH //
> 15286789 // Iafrate et al. (2004) // CopyNumber /// Variation_5452 //
> chr1:1142956-1147823 // Computational mapping of resequencing
> traces // 16902084 // Mills et al. (2006) // CopyNumber",
> "Variation_4192 // chr1:2062347-2242269 // Array CGH // 17160897 //
> Wong et al. (2007) // CopyNumber /// Variation_4193 //
> chr1:2145626-2314237 // Array CGH // 17160897 // Wong et al. (2007) //
> CopyNumber /// Variation_8246 // chr1:2224111-3755284 // Affymetrix
> 500K and 100K SNP Mapping Arrays // 17638019 // Zogopoulos et al.
> (2007) // CopyNumber", "Variation_8246 // chr1:2224111-3755284 //
> Affymetrix 500K and 100K SNP Mapping Arrays // 17638019 // Zogopoulos
> et al. (2007) // CopyNumber")
>
> For each record, I'm interested in keeping the following:
>
> results <- c("Variation_0001;Variation_5452",
> "Variation_4192;Variation_4193;Variation_8246", "Variation_8246")
>
> My solution was:
>
> theNames <- function(tmp)
> sapply(strsplit(tmp, " /+ "),
> function(y)
> paste(y[grep("Variation_", y)],
> collapse=";"))
>
> But my wish was to know the regular expression that I needed to select
> everything but "Variation_\\d+"... For example, something like:
>
> gsub( NOT "Variation_\\d+", ";", txt, perl=TRUE)
>
> Suggestions?
>
> b
>
> On Feb 2, 2008, at 1:03 PM, Peter Dalgaard wrote:
>
> > Benilton Carvalho wrote:
> >> help("strsplit")
> >> b
> >>
> > Yes, but...
> >
> > The postprocessing gets a bit awkward. It might be easier to use
> > sub() to get rid of the first/last bit of the string i.e.
> >
> > C2 <- sub("^.*:", "", Col)
> > C1 <- sub(":.*$", "", Col)
> >
> > An orthogonal idea is
> >
> > con <- textConnection("Col")
> > read.table(con, sep=":")
> > close(con)
> >
> >> On Feb 2, 2008, at 12:43 PM, joseph wrote:
> >>
> >>>
> >>>
> >>> Hello
> >>>
> >>> I have a data frame and one of its columns is as follows:
> >>>
> >>>
> >>>
> >>>
> >>> Col
> >>>
> >>>
> >>> chr1:71310034
> >>>
> >>>
> >>>
> >>> chr15:37759058
> >>>
> >>>
> >>> chr22:18262638
> >>>
> >>>
> >>> chrUn:31337214
> >>>
> >>>
> >>> chr10_random:4369261
> >>>
> >>>
> >>> chrUn:3545097
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> I would like to get rid of colon (:) and replace this column
> >>> with two new columns containing the terms on each side of the
> >>> colon. The new columns
> >>> should look as follows:
> >>>
> >>>
> >>>
> >>>
> >>> Col_a Col_b
> >>>
> >>>
> >>> chr1 71310034
> >>>
> >>>
> >>> chr14 23354088
> >>>
> >>>
> >>> chr15 37759058
> >>>
> >>>
> >>> chr22 18262638
> >>>
> >>>
> >>> chrUn 31337214
> >>>
> >>>
> >>> chr10_random 4369261
> >>>
> >>>
> >>> chrUn 3545097
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Any help will be much appreciated
> >>>
> >>>
> >>> Joseph
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ____________________________________________________________________________________
> >>> Looking for last minute shopping deals?
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help_at_r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ------------------------------------------------------------------------
> >>
> >> ______________________________________________
> >> R-help_at_r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> > --
> > O__ ---- Peter Dalgaard ุster Farimagsgade 5, Entr.B
> > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
> > (*) \(*) -- University of Copenhagen Denmark Ph: (+45)
> > 35327918
> > ~~~~~~~~~~ - (p.dalgaard_at_biostat.ku.dk) FAX: (+45)
> > 35327907
> >
>
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 02 Feb 2008 - 19:24:18 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 02 Feb 2008 - 21:30:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive