Re: [R] unique/subset problem

From: Weiwei Shi <helprhelp_at_gmail.com>
Date: Fri 26 Jan 2007 - 20:53:40 GMT

oh, i forgot, you can also convert factor into string like dataset$genome1 <- as.character(dataset$genome1)

so you don't have to use
as.numeric(dataset$score) if you use "as.is=T" when you read.table

HTH, weiwei

On 1/26/07, Weiwei Shi <helprhelp@gmail.com> wrote:
> check
> ?read.table
>
> and add "as.is=T" in the option. So you read string as character now
> and avoid the factor things.
>
> Then repeat your work.
>
> For example
> > x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10)
> > str(x0,1)
> `data.frame': 10 obs. of 7 variables:
> $ V1: Factor w/ 10 levels "-4086733916",..: 10 9 8 7 6 5 4 3 2 1
> $ V2: Factor w/ 10 levels "-1963744741",..: 10 8 7 4 5 6 3 9 1 2
> $ V3: Factor w/ 7 levels "-1687428658",..: 7 4 4 2 5 1 6 6 3 4
> $ V4: Factor w/ 2 levels "5","MECHANISM": 2 1 1 1 1 1 1 1 1 1
> $ V5: Factor w/ 2 levels "0","TYPE": 2 1 1 1 1 1 1 1 1 1
> $ V6: Factor w/ 2 levels "USER_","alexey": 1 2 2 2 2 2 2 2 2 2
> $ V7: Factor w/ 2 levels "3","TRUST": 2 1 1 1 1 1 1 1 1 1
> > x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10, as.is=T)
> > str(x0,1)
> `data.frame': 10 obs. of 7 variables:
> $ V1: chr "LINK_ID" "-4293537751" "-4247422653" "-4223137153" ...
> $ V2: chr "ID1" "65259" "1020286" "-518245428" ...
> $ V3: chr "ID2" "6436" "6436" "-2099509019" ...
> $ V4: chr "MECHANISM" "5" "5" "5" ...
> $ V5: chr "TYPE" "0" "0" "0" ...
> $ V6: chr "USER_" "alexey" "alexey" "alexey" ...
> $ V7: chr "TRUST" "3" "3" "3" ...
>
> HTH,
>
> weiwei
>
> On 1/26/07, lalitha viswanath <lalithaviswanath@yahoo.com> wrote:
> > Hi
> > I read in my dataset using
> > dt <read.table("filename")
> > calling unique(levels(dt$genome1)) yields the
> > following
> >
> > "aero" "aful" "aquae" "atum_D"
> > "bbur" "bhal" "bmel" "bsub"
> > [9] "buch" "cace" "ccre" "cglu"
> > "cjej" "cper" "cpneuA" "cpneuC"
> > [17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
> > "hinf" "hpyl" "linn" "llact"
> > [25] "lmon" "mgen" "mjan" "mlep"
> > "mlot" "mpneu" "mpul" "mthe"
> > [33] "mtub" "mtub_cdc" "nost" "pabyssi"
> > "paer" "paero" "pmul" "pyro"
> > [41] "rcon" "rpxx" "saur_mu50" "saur_n315"
> > "sent" "smel" "spneu" "spyo"
> > [49] "ssol" "stok" "styp" "synecho"
> > "tacid" "tmar" "tpal" "tvol"
> > [57] "uure" "vcho" "xfas" "ypes"
> >
> > It shows 60 genomes, which is correct.
> >
> > I extracted a subset as follows
> > possible_relatives_subset <- subset(dt, Y < -5)
> > I am pasting the results below
> > genome1 genome2 parameterX Y
> > 21 sent ecoliO157 0.00590 -200.633493
> > 22 sent paer 0.18603 -100.200570
> > 27 styp ecoliO157 0.00484 -240.708645
> > 28 styp paer 0.18497 -30.250127
> > 41 paer sent 0.18603 -60.200570
> > 44 paer styp 0.18497 -80.250127
> > 49 paer hinf 0.18913 -90.056333
> > 53 paer vcho 0.18703 -10.153929
> > 55 paer pmul 0.18587 -100.208042
> > 67 paer buch 0.21485 -80.898667
> > 70 paer ypes 0.18460 -107.267454
> > 82 paer xfas 0.26268 -61.920552
> > 95 hinf ecoliO157 0.07654 -163.018417
> > 96 hinf paer 0.18913 -10.056333
> > 103 vcho ecoliO157 0.09518 -140.921153
> > 104 vcho paer 0.18703 -10.153929
> > 107 pmul ecoliO157 0.07328 -165.215225
> > 108 pmul paer 0.18587 -10.208042
> > 131 buch ecoliO157 0.15412 -11.746939
> > 132 buch paer 0.21485 -8.898667
> > 137 ypes ecoliO157 0.02705 -19.171851
> > 138 ypes paer 0.18460 -10.267454
> > 171 ecoliO157 sent 0.00590 -20.633493
> > 174 ecoliO157 styp 0.00484 -20.708645
> > 179 ecoliO157 hinf 0.07654 -6.018417
> > 183 ecoliO157 vcho 0.09518 -14.921153
> > 185 ecoliO157 pmul 0.07328 -6.215225
> > 197 ecoliO157 buch 0.15412 -11.746939
> > 200 ecoliO157 ypes 0.02705 -9.171851
> > 211 ecoliO157 xfas 0.25833 -71.091552
> > 217 xfas ecoliO157 0.25833 -75.091552
> > 218 xfas paer 0.26268 -64.920552
> >
> > I think even a cursory look will tell us that there
> > are not as many unique genomes in the subset results.
> > (around 8/10).
> > However when I do
> > unique(levels(possible_relatives_subset$genome1)), I
> > get
> >
> > [1] "aero" "aful" "aquae" "atum_D"
> > "bbur" "bhal" "bmel" "bsub"
> > [9] "buch" "cace" "ccre" "cglu"
> > "cjej" "cper" "cpneuA" "cpneuC"
> > [17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
> > "hinf" "hpyl" "linn" "llact"
> > [25] "lmon" "mgen" "mjan" "mlep"
> > "mlot" "mpneu" "mpul" "mthe"
> > [33] "mtub" "mtub_cdc" "nost" "pabyssi"
> > "paer" "paero" "pmul" "pyro"
> > [41] "rcon" "rpxx" "saur_mu50" "saur_n315"
> > "sent" "smel" "spneu" "spyo"
> > [49] "ssol" "stok" "styp" "synecho"
> > "tacid" "tmar" "tpal" "tvol"
> > [57] "uure" "vcho" "xfas" "ypes"
> >
> > Where am I going wrong?
> > I tried calling unique without the levels too, which
> > gives me the following response
> >
> > [1] sent styp paer hinf vcho
> > pmul buch ypes ecoliO157 xfas
> > 60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
> > buch cace ccre cglu cjej cper cpneuA ... ypes
> >
> > --- Weiwei Shi <helprhelp@gmail.com> wrote:
> >
> > > Then you need to provide more details about the
> > > calls you made and your dataset.
> > > For example, you can tell us by
> > > str(prunedrelatives, 1)
> > >
> > > how did you call unique on prunedrelative and so on?
> > > I made a test
> > > data it gave me what you wanted (omitted here).
> > >
> > > On 1/26/07, lalitha viswanath
> > > <lalithaviswanath@yahoo.com> wrote:
> > > > Hi
> > > > The pruned dataset has 8 unique genomes in it
> > > while
> > > > the dataset before pruning has 65 unique genomes
> > > in
> > > > it.
> > > > However calling unique on the pruned dataset seems
> > > to
> > > > return 65 no matter what.
> > > >
> > > > Any assistance in this matter would be
> > > appreciated.
> > > >
> > > > Thanks
> > > > Lalitha
> > > > --- Weiwei Shi <helprhelp@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Even you removed "many" genomes1 by setting
> > > score<
> > > > > -5; it is not
> > > > > necessary saying you changed the uniqueness.
> > > > >
> > > > > To check this, you can do like
> > > > > p0 <- unique(dataset[dataset$score< -5,
> > > "genome1"])
> > > > > # same as subset
> > > > > p1 <- unique(dataset[dataset$score>= -5,
> > > "genome1"])
> > > > >
> > > > > setdiff(p1, p0)
> > > > >
> > > > > if the output above has NULL, then it means even
> > > > > though you remove
> > > > > many genomes1, but it does not help changing the
> > > > > uniqueness.
> > > > >
> > > > > HTH,
> > > > >
> > > > > weiwei
> > > > >
> > > > >
> > > > >
> > > > > On 1/25/07, lalitha viswanath
> > > > > <lalithaviswanath@yahoo.com> wrote:
> > > > > > Hi
> > > > > > I am new to R programming and am using subset
> > > to
> > > > > > extract part of a data as follows
> > > > > >
> > > > > > names(dataset) =
> > > > > > c("genome1","genome2","dist","score");
> > > > > > prunedrelatives <- subset(dataset, score <
> > > -5);
> > > > > >
> > > > > > However when I use unique to find the number
> > > of
> > > > > unique
> > > > > > genomes now present in prunedrelatives I get
> > > > > results
> > > > > > identical to calling unique(dataset$genome1)
> > > > > although
> > > > > > subset has eliminated many genomes and
> > > records.
> > > > > >
> > > > > > I would greatly appreciate your input about
> > > using
> > > > > > "unique" correctly in this regard.
> > > > > >
> > > > > > Thanks
> > > > > > Lalitha
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > ____________________________________________________________________________________
> > > > > > TV dinner still cooling?
> > > > > > Check out "Tonight's Picks" on Yahoo! TV.
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help@stat.math.ethz.ch mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide
> > > > > http://www.R-project.org/posting-guide.html
> > > > > > and provide commented, minimal,
> > > self-contained,
> > > > > reproducible code.
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Weiwei Shi, Ph.D
> > > > > Research Scientist
> > > > > GeneGO, Inc.
> > > > >
> > > > > "Did you always know?"
> > > > > "No, I did not. But I believed..."
> > > > > ---Matrix III
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > ____________________________________________________________________________________
> > > > Bored stiff? Loosen up...
> > > > Download and play hundreds of games for free on
> > > Yahoo! Games.
> > > > http://games.yahoo.com/games/front
> > > >
> > >
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > > Research Scientist
> > > GeneGO, Inc.
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> >
> >
> >
> >
> > ____________________________________________________________________________________
> > We won't tell. Get more on shows you hate to love
> > (and love to hate): Yahoo! TV's Guilty Pleasures list.
> > http://tv.yahoo.com/collections/265
> >
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>

-- 
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sat Jan 27 07:57:21 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 26 Jan 2007 - 21:30:33 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.