Re: [R] split character vector by multiple keywords simultaneously

From: sunny <sunayan_at_gmail.com>
Date: Sun, 08 May 2011 04:43:26 -0700 (PDT)

Andrew Robinson-6 wrote:
>
> A hack would be to use gsub() to prepend e.g. XXX to the keywords that
> you want, perform a strsplit() to break the lines into component
> strings, and then substr() to extract the pieces that you want from
> those strings.
>
> Cheers
>
> Andrew
>

Thanks, that got me started. I am sure there are much easier ways of doing this, but in case someone comes looking, here's my solution:

keywordlist <- c("Company name:", "General manager:", "Manager:")

# Attach "XXX" to the beginning of each keyword: for (i in 1:length(keywordlist)) {
temp <- gsub(keywordlist[i],paste("XXX",keywordlist[i],sep=""),temp) }

# Split each row into a list:
temp <- strsplit(temp,"XXX")
# Eliminate empty elements:
temp <- lapply(temp, function(x) x[which(x!='')])

# Since each keyword happens to include a colon at the end, split each list element generated above into exactly two parts, pre-colon for the keyword and post-colon for the value. Since values may contain colons themselves, avoid spurious matches by using n=2 in str_split_fixed function from stringr package:
library(stringr)
temp <- lapply(temp,function(x) str_split_fixed(x,':',n=2))

# Convert each list element into a data frame. The transpose makes sure that the first row of each data frame is the set of keywords. Each data frame has 2 rows - one with the keywords and the second with the values: temp <- lapply(temp, function(x) replace(as.data.frame(t(x)),,t(x)))

# Copy the first row of each data frame to the name of the corresponding column:
for (i in 1:length(temp)) {
names(temp[[i]]) <- as.character(temp[[i]][1,]) }

# Now join all the data frames in the list by column names. This way it doesn't matter if some keywords are absent in some cases: final_data <- do.call(rbind.fill,temp)

# We now have one large data frame with the odd numbered rows containing the keywords and the even numbered rows containing the values. Since we already have the keywords in the name, we can eliminate the odd numbered rows: final_data <- final_data[seq(2,dim(final_data)[1],2),]

-S.

--
View this message in context: http://r.789695.n4.nabble.com/split-character-vector-by-multiple-keywords-simultaneously-tp3497033p3506776.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sun 08 May 2011 - 12:03:40 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 08 May 2011 - 13:00:05 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive