From: <rkevinburton_at_charter.net>

Date: Sat, 12 Jul 2008 22:47:00 -0700

..$ Category : Factor w/ 46 levels "(Unknown)","10\" Plates",..: ..$ SubCategory: Factor w/ 246 levels "(Unknown)","70's Disco",..:

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 13 Jul 2008 - 05:51:47 GMT

Date: Sat, 12 Jul 2008 22:47:00 -0700

Thank you. This was very informative.

When I run this command (str(y)), I get something like:

$ WOMEN.X MEN 3 :'data.frame': 0 obs. of 5 variables: ..$ DayOfYear : int(0) ..$ Quantity : int(0) ..$ Fraction : num(0)

..$ Category : Factor w/ 46 levels "(Unknown)","10\" Plates",..: ..$ SubCategory: Factor w/ 246 levels "(Unknown)","70's Disco",..:

What does the output mean 'Factor w/ 46 levels . . . .' or 'Factor w/ 246 levels . .?

Thanks again.

Kevin

- jim holtman <jholtman_at_gmail.com> wrote:

> Is this something like what you were asking for? The output of a

*> 'split' will be a list of the dataframe subsets for the categories you**> have specified.**>**> > x <- data.frame(g1=sample(LETTERS[1:2],30,TRUE),**> + g2=sample(letters[1:2], 30, TRUE),**> + g3=1:30)**> > y <- split(x, list(x$g1, x$g2))**> > str(y)**> List of 4**> $ A.a:'data.frame': 7 obs. of 3 variables:**> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1**> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1**> ..$ g3: int [1:7] 3 4 6 8 9 13 24**> $ B.a:'data.frame': 7 obs. of 3 variables:**> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2**> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1**> ..$ g3: int [1:7] 10 11 16 17 18 20 25**> $ A.b:'data.frame': 6 obs. of 3 variables:**> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1**> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2**> ..$ g3: int [1:6] 2 12 23 26 27 29**> $ B.b:'data.frame': 10 obs. of 3 variables:**> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2**> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 2 2 2 2**> ..$ g3: int [1:10] 1 5 7 14 15 19 21 22 28 30**> > y**> $A.a**> g1 g2 g3**> 3 A a 3**> 4 A a 4**> 6 A a 6**> 8 A a 8**> 9 A a 9**> 13 A a 13**> 24 A a 24**>**> $B.a**> g1 g2 g3**> 10 B a 10**> 11 B a 11**> 16 B a 16**> 17 B a 17**> 18 B a 18**> 20 B a 20**> 25 B a 25**>**> $A.b**> g1 g2 g3**> 2 A b 2**> 12 A b 12**> 23 A b 23**> 26 A b 26**> 27 A b 27**> 29 A b 29**>**> $B.b**> g1 g2 g3**> 1 B b 1**> 5 B b 5**> 7 B b 7**> 14 B b 14**> 15 B b 15**> 19 B b 19**> 21 B b 21**> 22 B b 22**> 28 B b 28**> 30 B b 30**>**> > y[[2]]**> g1 g2 g3**> 10 B a 10**> 11 B a 11**> 16 B a 16**> 17 B a 17**> 18 B a 18**> 20 B a 20**> 25 B a 25**> >**> >**> >**>**>**> On Sat, Jul 12, 2008 at 8:51 PM, <rkevinburton_at_charter.net> wrote:**> > OK. Now I know that I am dealing with a data frame. One last question on this topic. a <- read.csv() gives me a dataframe. If I have 'c <- split(x, x$Category), then what is returned by split in this case? c[1] seems to be OK but c[2] is not right in my mind. If I run ci <- split(nrow(a), a$Category). And then ci[1] seems to be the rows associated with the first category, c[2] is the indices/rows associated with the second category, etc. But this seems different than c[1], c[2], etc.**> >**> > Using the techniques below I can get the information on the categories. Now as an extra level of complexity there are SubCategories within each Category. Assume that the SubCategory names are not unique within the dataset so if I want the SubCategory data I need to retrive the indices (or data) for the Category and SubCategory pair. In other words if I have a Category that ranges from 'A' to 'Z', it is possible that I might have a subcategory A a, A b (where a and b are the sub category names). I also might have B a, B b. I want all of the sub categories A a. NOT the subcategories a (because that might include B a which would be different). I am guessing that this will take more than a simple 'split'.**> >**> > Thank you.**> >**> > Kevin**> >**> > ---- Duncan Murdoch <murdoch_at_stats.uwo.ca> wrote:**> >> On 12/07/2008 3:59 PM, rkevinburton_at_charter.net wrote:**> >> > I am sorry but if read.csv returns a dataframe and a dataframe is like a matrix and I have a set of input like below and a[1,] gives me the first row, what is the second index? From what I read and your input I am guessing that it is the column number. So a[1,1] would return the DayOfYear column for the first row, right? What does a$DayOfYear return?**> >>**> >> a$DayOfYear would be the same as a[,1] or a[,"DayOfYear"], i.e. it would**> >> return the entire first column.**> >>**> >> Duncan Murdoch**> >>**> >> >**> >> > Thank you for your patience.**> >> >**> >> > Kevin**> >> >**> >> > ---- Duncan Murdoch <murdoch_at_stats.uwo.ca> wrote:**> >> >> On 12/07/2008 12:31 PM, rkevinburton_at_charter.net wrote:**> >> >>> I am using a simple R statement to read in the file:**> >> >>>**> >> >>> a <- read.csv("Sample.dat", header=TRUE)**> >> >>>**> >> >>> There is alot of data but the first few lines look like:**> >> >>>**> >> >>> DayOfYear,Quantity,Fraction,Category,SubCategory**> >> >>> 1,82,0.0000390392720794458,(Unknown),(Unknown)**> >> >>> 2,78,0.0000371349173438631,(Unknown),(Unknown)**> >> >>> . . .**> >> >>> 71,2,0.0000009521773677913,WOMEN,Piratesses**> >> >>> 72,4,0.0000019043547355827,WOMEN,Piratesses**> >> >>> 73,3,0.0000014282660516870,WOMEN,Piratesses**> >> >>> 74,14,0.0000066652415745395,WOMEN,Piratesses**> >> >>> 75,2,0.0000009521773677913,WOMEN,Piratesses**> >> >>>**> >> >>> If I read the data in as above, the command**> >> >>>**> >> >>> a[1]**> >> >>>**> >> >>> results in the output**> >> >>>**> >> >>> [ reached getOption("max.print") -- omitted 16193 rows ]]**> >> >>>**> >> >>> Shouldn't this be the first row?**> >> >> No, the first row would be a[1,]. read.csv() returns a dataframe, and**> >> >> those are indexed with two indices to treat them like a matrix, or with**> >> >> one index to treat them like a list of their columns.**> >> >>**> >> >> Duncan Murdoch**> >> >>**> >> >>> a$Category[1]**> >> >>>**> >> >>> results in the output**> >> >>>**> >> >>> [1] (Unknown)**> >> >>> 4464 Levels: Tags ... WOMEN**> >> >>>**> >> >>> But**> >> >>>**> >> >>> a$Category[365]**> >> >>>**> >> >>> gives me:**> >> >>>**> >> >>> [1] 7 Plates (Dessert),Western\n120,5,0.0000023804434194784,7 Plates (Dessert)**> >> >>> 4464 Levels: Tags ... WOMEN**> >> >>>**> >> >>> There is something fundamental about either vectors of the read.csv command that I am missing here.**> >> >>>**> >> >>> Thank you.**> >> >>>**> >> >>> Kevin**> >> >>>**> >> >>> ---- jim holtman <jholtman_at_gmail.com> wrote:**> >> >>>> Please provide commented, minimal, self-contained, reproducible code,**> >> >>>> or at least a before/after of what you data would look like. Taking a**> >> >>>> guess at what you are asking, here is one way of doing it:**> >> >>>>**> >> >>>>**> >> >>>>> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=1:20, b=runif(20))**> >> >>>>> x**> >> >>>> cat a b**> >> >>>> 1 B 1 0.65472393**> >> >>>> 2 C 2 0.35319727**> >> >>>> 3 B 3 0.27026015**> >> >>>> 4 A 4 0.99268406**> >> >>>> 5 C 5 0.63349326**> >> >>>> 6 A 6 0.21320814**> >> >>>> 7 C 7 0.12937235**> >> >>>> 8 A 8 0.47811803**> >> >>>> 9 A 9 0.92407447**> >> >>>> 10 A 10 0.59876097**> >> >>>> 11 A 11 0.97617069**> >> >>>> 12 A 12 0.73179251**> >> >>>> 13 B 13 0.35672691**> >> >>>> 14 C 14 0.43147369**> >> >>>> 15 C 15 0.14821156**> >> >>>> 16 C 16 0.01307758**> >> >>>> 17 B 17 0.71556607**> >> >>>> 18 B 18 0.10318424**> >> >>>> 19 C 19 0.44628435**> >> >>>> 20 B 20 0.64010105**> >> >>>>> # create a list of the indices of the data grouped by 'cat'**> >> >>>>> split(seq(nrow(x)), x$cat)**> >> >>>> $A**> >> >>>> [1] 4 6 8 9 10 11 12**> >> >>>>**> >> >>>> $B**> >> >>>> [1] 1 3 13 17 18 20**> >> >>>>**> >> >>>> $C**> >> >>>> [1] 2 5 7 14 15 16 19**> >> >>>>**> >> >>>>> # or do you want the data**> >> >>>>> split(x, x$cat)**> >> >>>> $A**> >> >>>> cat a b**> >> >>>> 4 A 4 0.9926841**> >> >>>> 6 A 6 0.2132081**> >> >>>> 8 A 8 0.4781180**> >> >>>> 9 A 9 0.9240745**> >> >>>> 10 A 10 0.5987610**> >> >>>> 11 A 11 0.9761707**> >> >>>> 12 A 12 0.7317925**> >> >>>>**> >> >>>> $B**> >> >>>> cat a b**> >> >>>> 1 B 1 0.6547239**> >> >>>> 3 B 3 0.2702601**> >> >>>> 13 B 13 0.3567269**> >> >>>> 17 B 17 0.7155661**> >> >>>> 18 B 18 0.1031842**> >> >>>> 20 B 20 0.6401010**> >> >>>>**> >> >>>> $C**> >> >>>> cat a b**> >> >>>> 2 C 2 0.35319727**> >> >>>> 5 C 5 0.63349326**> >> >>>> 7 C 7 0.12937235**> >> >>>> 14 C 14 0.43147369**> >> >>>> 15 C 15 0.14821156**> >> >>>> 16 C 16 0.01307758**> >> >>>> 19 C 19 0.44628435**> >> >>>>**> >> >>>>**> >> >>>> On Sat, Jul 12, 2008 at 3:32 AM, <rkevinburton_at_charter.net> wrote:**> >> >>>>> I have search the archive and I could not find what I need so I will try to ask the question here.**> >> >>>>>**> >> >>>>> I read a table in (read.table)**> >> >>>>>**> >> >>>>> a <- read.table(.....)**> >> >>>>>**> >> >>>>> The table has column names like DayOfYear, Quantity, and Category.**> >> >>>>>**> >> >>>>> The values in the row for Category are strings (characters).**> >> >>>>>**> >> >>>>> I want to get all of the rows grouped by Category. The number of unique category names could be around 50. Say for argument sake the number of categories is exactly 50. Can I somehow get a vector of length 50 containing the rows corresponding to the category (another vector)? I realize I can access any row a[i]$Category (right?). But I wanta vector containing the rows corresponding to each distinct Category name.**> >> >>>>>**> >> >>>>> Thank you.**> >> >>>>>**> >> >>>>> Kevin**> >> >>>>>**> >> >>>>> ______________________________________________**> >> >>>>> R-help_at_r-project.org mailing list**> >> >>>>>**https://stat.ethz.ch/mailman/listinfo/r-help**> >> >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html**> >> >>>>> and provide commented, minimal, self-contained, reproducible code.**> >> >>>>>**> >> >>>>**> >> >>>> --**> >> >>>> Jim Holtman**> >> >>>> Cincinnati, OH**> >> >>>> +1 513 646 9390**> >> >>>>**> >> >>>> What is the problem you are trying to solve?**> >> >>> ______________________________________________**> >> >>> R-help_at_r-project.org mailing list**> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help**> >> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html**> >> >>> and provide commented, minimal, self-contained, reproducible code.**> >>**> >**> >**>**>**>**> --**> Jim Holtman**> Cincinnati, OH**> +1 513 646 9390**>**> What is the problem you are trying to solve?*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 13 Jul 2008 - 05:51:47 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Sun 13 Jul 2008 - 09:31:23 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*