Yes all of this is possible in R and more.

You might find the which() command helpful for subsetting. You could

write a simple function to automate this. For graphing facilities, see

plot(), par(), postscript() etc.

In my opinion, it might not be worth the effort and time to save it to

MYSQL if you only want to perform a couple of queries. Plus R has

excellent graphing facilities. If you really want to automate the

process, then a combination of Perl and GNUplot seems like a good

combination. The choice depends on which software you are most

comfortable with.

Another advantage R has is that it is an interactive language. So it is

great for exploratory analysis with minimum effort (unlike Excel in

which you spend 90% of your time dragging the mouse and sorting the

data).

See the Bioconductor project, which focuses on genomic and expression

data and has many great functions specifically designed for microarray

etc. I doubt you will be able to find such vast collection of tools for

free.

Good luck.

Is it possible to use R as a data-mining tool? Here's the problem

I've got. I have a couple of data sets consisting of results from a cDNA

microarray experiment - the details about the biology don't really

matter here, the same theory applies for any other data-mining task

(that's why I thought it'd be more appropriate to post this on r-user).

Each of these datasets consists of about 30000 rows by 20 to 30 columns.

Let's say that each row represents (very roughly speaking) a gene, and

the columns are details about its level of expression, reliability of

the measurament, coordinates and so on.

The main objetive here is identify some genes (rows) according to

some criteria. In order to do that, what I want to be able to do, is

selectively filter the rows, graph some convinient variables, do some

further filtering and so on.

Let me take a more concrete example to make myself clear. Let's

say that I load a given dataset on a dataframe, namely expr1. This

dataframe would have the fields expr1$name, expr1$expression,

expr1$reliablity, expr1$x, expr1$y and so on, containing, for instance,

26000 rows. Now from these 26000 I'd like to select only those ones

satisfying expr1$expression > 2000, expr1$reliability = 100 and plot a

graph on expr1$x x expr1$y, for them. I'd have then a reduced dataset of

the first one. Let's say now that I want to narrow my filter even more,

selecting only (among the ones I have already selected) the ones where

expr1$x > 20.

This would be done many times and in different orders. I'd like to

be able to, among those 26000 rows, take only the 100 whose expr$x are

the 100 greatest . And so on, many times, until I found a set of

suitable rows.

What is the proper way to do that using R, if any? I've played a

little with dataframes (I could for instance use: expr1$names[expr1$x >

20] to get the names of those genes whose x > 20) but it seemed a little

clumsy. Should I keep trying to manipulate directly the dataframe, or

perhaps should I save it on a mysql database and do que queries using

RMYSql? Or maybe there is a better option?

I know that these things I've said are pretty easy to implement

using, for instance M$ Excel (I've seen them working on it). You just

select drop-down menus and filter the rows to your liking. But I really

would like to be able to accomplish this task using R and other open

source tools like MySql, Perl, etc.

Thank you in advance,

