Re: [R] Pattern match

From: David Winsemius <dwinsemius_at_comcast.net>
Date: Thu, 21 Apr 2011 08:30:28 -0400

On Apr 21, 2011, at 5:27 AM, neetika nath wrote:

> Thank you Dennis,
>
> yes the problem is the input file. i have .rdf file and the format
> is in
> same way i have posted earlier. if i open that file in notepad++ the
> lines
> are divided or broken with CR+LF character. so any suggestion to
> retrieve
> SpeciesScientific information without changing the input file?

You might consider attaching the original file named with an extension of `.txt`, since your verbal description does not match your included example. What I see after the various servers have passed this around and inserted line-ends is the string `SpeciesScientific` in the first line, rather than in the third.

-- 
David

-- 

>
> Thank you
>
> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmuser_at_gmail.com>
> wrote:
>
>> Hi:
>>
>> This is a bit of a roundabout approach; I'm sure that folks with
>> regex
>> expertise will trump this in a heartbeat. I modified the last piece
>> of
>> the string a bit to accommodate the approach below. Depending on
>> where
>> the strings have line breaks, you may have some odd '\n' characters
>> inserted.
>>
>> # Step 1: read the input as a single character string
>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>
>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O-
>> H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
>> 502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>
>> # Step 2: Split input lines by the ';' delimiter and then use
>> lapply()
>> to split variable names from values.
>> # This results in a nested list for ulist2.
>> ulist <- strsplit(u, ';')
>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>
>> # Step 3: Break out the results into a matrix whose first column is
>> the variable name
>> # and whose second column is the value (with parens included)
>> # This avoids dealing with nested lists
>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>
>> # Step 4: Strip off the parens
>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>> colnames(w) <- c('Name', 'Value')
>> w
>> Name Value
>> [1,] "SpeciesCommon" "Human"
>> [2,] "SpeciesScientific" "Homo sapiens"
>> [3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H"
>> [4,] "BondInvolved" "C-H"
>> [5,] "EzCatDBID" "S00343"
>> [6,] "BondFormed" "O-H,O-H"
>> [7,] "Bond" "255B"
>> [8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A"
>> [9,] "CatalyticSwissProt" "P25006"
>> [10,] "SpeciesScientific" "Achromobacter\ncycloclastes"
>> [11,] "SpeciesCommon" "Bacteria"
>> [12,] "Reactive" "Ce+"
>>
>> # Step 5: Subset out the values of the SpeciesScientific variables
>> subset(as.data.frame(w), Name == 'SpeciesScientific', select =
>> 'Value')
>> Value
>> 2 Homo sapiens
>> 10 Achromobacter\ncycloclastes
>>
>>
>> One possible 'advantage' of this approach is that if you have a
>> number
>> of string records of this type, you can create nested lists for each
>> string and then manipulate the lists to get what you need. Hopefully
>> you can use some of these ideas for other purposes as well.
>>
>> Dennis
>>
>>
>>
>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi_at_gmail.com> wrote:
>>> Hi ALL,
>>>
>>> I have very simple question regarding pattern matching. Could
>>> anyone tell
>> me
>>> how to I can use R to retrieve string pattern from text file. for
>> example
>>> my file contain following information
>>>
>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>
>> H,O,C,C,C,C,O,H);BondInvolved=(C-
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>
>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
>> 502,A);CatalyticSwissProt=(P25006);Sp+
>>> eciesScientific=(Achromobacter
>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>
>>> and I want to extract “SpeciesScientific = (?)” information from
>>> this
>> file.
>>> Problem is in 3rd line where SpeciesScientific word is divided
>>> with +.
>>>
>>> Could anyone help me please?
>>> Thank you
>>>
>>>
>>> --
>>> View this message in context:
>>
http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Thu 21 Apr 2011 - 12:32:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 22 Apr 2011 - 11:20:31 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive