Re: [R] Pattern match

From: neetika nath <nikkihathi_at_gmail.com>
Date: Fri, 22 Apr 2011 14:29:06 +0100

Thank you so much.

On Fri, Apr 22, 2011 at 1:29 PM, David Winsemius <dwinsemius_at_comcast.net>wrote:

>
> On Apr 22, 2011, at 6:42 AM, neetika nath wrote:
>
>
> Thank you for your message. please see attach file for the template/test
> dataset of my file.
>
>
> On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <dwinsemius_at_comcast.net>wrote:
>
>>
>> On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
>>
>> Thank you Dennis,
>>>
>>> yes the problem is the input file. i have .rdf file and the format is in
>>> same way i have posted earlier. if i open that file in notepad++ the
>>> lines
>>> are divided or broken with CR+LF character. so any suggestion to
>>> retrieve
>>> SpeciesScientific information without changing the input file?
>>>
>>
>> You might consider attaching the original file named with an extension of
>> `.txt`, since your verbal description does not match your included example.
>> What I see after the various servers have passed this around and inserted
>> line-ends is the string `SpeciesScientific` in the first line, rather than
>> in the third.
>>
>> lcon <- file("/Users/davidwinsemius/Downloads/temp_test.txt")
> lines <- readLines(lcon)
> lines
> #-----don't paste---
> [1] "--"
>
> [2] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
>
> [3]
> "lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,1,C-C,1,C=C,2,C-+"
> [4]
> "C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,1);CatalyticResidues=(Gl+"
> [5]
> "y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,601,none);CatalyticSwissProt=(P15559);+"
> [6] "SpeciesCommon=(Human);SpeciesScientific=(Homo
> sapiens);ReactiveCentres=(N,C,C,C,+"
> [7]
> "H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+"
> [8] ""
>
> [9] "--"
>
> [10] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
>
> [11] "$DATUM
> CatalyticCATH=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+"
> # end don't paste-------------
>
>
> # So the first goal is to collapse the broken lines but only within
> boundaries of "--"
> # Find the line numbers with "--"
> startidx <- grep("\\-\\-", lines)
> startidx
> #[1] 1 9 17
> endidx <- c(startidx[-1]-1, length(lines))
> endidx
> #[1] 8 16 25
> # Now collapse within those ranges
> unplus <- sapply(1:length(startidx), function(x){
> gsub("\\+", "", paste(lines[startidx[x]:endidx[x]],
> collapse="") )
> } )
> # break on what appears to be the correct delimiter, ";"
> lapply(unplus, function(longline)
> grep("SpeciesScientific=\\(", strsplit(longline, ";")[[1]]
> ) )
> #[[1]]
> #[1] 7
>
> #[[2]]
> #[1] 5
>
> #[[3]]
> #[1] 6
> #Seems to succeed (admittedly after some errors that were elided. So save
> it
>
> lidx <- lapply(unplus, function(longline) grep("SpeciesScientific=\\(",
> strsplit(longline, ";")[[1]] ) )
> #Create a properly split list to work with
> breaklist <- strsplit(unplus, ";")
> # And extract the desired elements
> sapply(1:length(startidx), function(idx) breaklist[[idx]][ lidx[[idx]] ] )
> #[1] "SpeciesScientific=(Homo sapiens)"
> "SpeciesScientific=(Achromobacter cycloclastes)"
> #[3] "SpeciesScientific=(Triticum aestivum)"
> # Pulling the species from this simple list is left as a reader's exercise
>
> --
>> David
>>
>
>>
>> --
>>
>>>
>>> Thank you
>>>
>>> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmuser_at_gmail.com>
>>> wrote:
>>>
>>> Hi:
>>>>
>>>> This is a bit of a roundabout approach; I'm sure that folks with regex
>>>> expertise will trump this in a heartbeat. I modified the last piece of
>>>> the string a bit to accommodate the approach below. Depending on where
>>>> the strings have line breaks, you may have some odd '\n' characters
>>>> inserted.
>>>>
>>>> # Step 1: read the input as a single character string
>>>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>>
>>>>
>>>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>>>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>>>
>>>> # Step 2: Split input lines by the ';' delimiter and then use lapply()
>>>> to split variable names from values.
>>>> # This results in a nested list for ulist2.
>>>> ulist <- strsplit(u, ';')
>>>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>>>
>>>> # Step 3: Break out the results into a matrix whose first column is
>>>> the variable name
>>>> # and whose second column is the value (with parens included)
>>>> # This avoids dealing with nested lists
>>>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>>>
>>>> # Step 4: Strip off the parens
>>>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>>>> colnames(w) <- c('Name', 'Value')
>>>> w
>>>> Name Value
>>>> [1,] "SpeciesCommon" "Human"
>>>> [2,] "SpeciesScientific" "Homo sapiens"
>>>> [3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H"
>>>> [4,] "BondInvolved" "C-H"
>>>> [5,] "EzCatDBID" "S00343"
>>>> [6,] "BondFormed" "O-H,O-H"
>>>> [7,] "Bond" "255B"
>>>> [8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A"
>>>> [9,] "CatalyticSwissProt" "P25006"
>>>> [10,] "SpeciesScientific" "Achromobacter\ncycloclastes"
>>>> [11,] "SpeciesCommon" "Bacteria"
>>>> [12,] "Reactive" "Ce+"
>>>>
>>>> # Step 5: Subset out the values of the SpeciesScientific variables
>>>> subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
>>>> Value
>>>> 2 Homo sapiens
>>>> 10 Achromobacter\ncycloclastes
>>>>
>>>>
>>>> One possible 'advantage' of this approach is that if you have a number
>>>> of string records of this type, you can create nested lists for each
>>>> string and then manipulate the lists to get what you need. Hopefully
>>>> you can use some of these ideas for other purposes as well.
>>>>
>>>> Dennis
>>>>
>>>>
>>>>
>>>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi_at_gmail.com> wrote:
>>>>
>>>>> Hi ALL,
>>>>>
>>>>> I have very simple question regarding pattern matching. Could anyone
>>>>> tell
>>>>>
>>>> me
>>>>
>>>>> how to I can use R to retrieve string pattern from text file. for
>>>>>
>>>> example
>>>>
>>>>> my file contain following information
>>>>>
>>>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>>>
>>>>> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>>
>>>>>
>>>>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
>>>>
>>>>> eciesScientific=(Achromobacter
>>>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>>>
>>>>> and I want to extract “SpeciesScientific = (?)” information from this
>>>>>
>>>> file.
>>>>
>>>>> Problem is in 3rd line where SpeciesScientific word is divided with +.
>>>>>
>>>>> Could anyone help me please?
>>>>> Thank you
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>>>
>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>
>>>>> ______________________________________________
>>>>> R-help_at_r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>>
>>>> http://www.R-project.org/posting-guide.html
>>>>
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
> <temp_test.txt>
>
>
> David Winsemius, MD
> West Hartford, CT
>
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 22 Apr 2011 - 13:35:09 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 22 Apr 2011 - 14:10:32 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive