# Re: [R] difficult data manipulation question

From: jim holtman <jholtman_at_gmail.com>
Date: Tue 04 Jul 2006 - 11:59:43 EST

Here is a modification of Gabor's solution that will return the dataframe with just the maximum columns:

# test data
# read in header separately so R does not make column names unique Lines <- "AAA BBB CCC DDD AAA BBB

```  0      2      1     2      0      0
2      3      7     6      0      1
1.5    4      9     9      6      0
1.0    6      10    11     3      3
```

"

DF <- read.table(textConnection(Lines), skip = 1) names(DF) <- scan(textConnection(Lines), what = "", nlines = 1)

f <- function(x) x[which.max(colSums(DF[x]!=0))] tapply(seq(DF), names(DF), f)

```#================added code================#
```
# compute the number of non-zeros in each column MostZeros <- colSums(DF != 0)
# determine which column is the maximum
```x.max <- lapply(unique(names(DF)), function(.name){
.col <- which(names(DF) == .name)  # find columns of matching names
.max <- which.max(MostZeros[.col]) # determine max
.col[.max]  # return the column number of the max
```
})
DF[unlist(x.max)] # select only the unique maximums

On 7/3/06, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:
>
> Try this:
>
> # test data
> # read in header separately so R does not make column names unique
> Lines <- "AAA BBB CCC DDD AAA BBB
> 0 2 1 2 0 0
> 2 3 7 6 0 1
> 1.5 4 9 9 6 0
> 1.0 6 10 11 3 3
> "
> DF <- read.table(textConnection(Lines), skip = 1)
> names(DF) <- scan(textConnection(Lines), what = "", nlines = 1)
>
> f <- function(x) x[which.max(colSums(DF[x]!=0))]
> tapply(seq(DF), names(DF), f)
>
> On 7/3/06, markleeds@verizon.net <markleeds@verizon.net> wrote:
> >
> > hi everyone :
> >
> > suppose i have a matrix in which some column names are identical so,
> > for example, TEMP
> >
> > "AAA", "BBB", "CCC", "DDD","AAA", "BBB"
> > 0 2 1 2 0 0
> > 2 3 7 6 0 1
> > 1.5 4 9 9 6 0
> > 1.0 6 10 11 3 3
> >
> >
> > I didn't even check yet whether identical column names are allowed
> > in a matrix but i hope they are.
> >
> > assuming that they are, then i would like to be able to take the matrix
> and make a new matrix with the following requirements.
> >
> > 1) whenever there is a unique column name, just take that column for the
> new matrix
> >
> > 2) whenever the column name is not unique, take the one
> > that has the most non zero elements ? ( in the case of
> > ties, i don't care which one is picked ).
> >
> > so, in this case, the resulting matrix would just be the first 4
> columns.
> >
> > i realize ( or atleast i think ) that
> > sum( TEMP[(TEMP[,columnname] !=0) ,columnname) will give me the
> > number of non elements in a column with the name columnmame
> > but how to use this deal with the non uniqueness to solve my particular
> problem is beyond me. plus, i think the command will
> > bomb because columnname will not always be unique ?
> > Thanks for any help. I realize this is not a trivial problem so I really
> appreciate it.
> >
> > Mark
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> http://www.R-project.org/posting-guide.html
> >
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> http://www.R-project.org/posting-guide.html
>

```--
Jim Holtman
Cincinnati, OH
+1 513 646 9390 (Cell)
+1 513 247 0281 (Home)

What is the problem you are trying to solve?

[[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help