Re: [R] RJDBC vs RMySQL vs ???

From: Ralf B <ralf.bierig_at_gmail.com>
Date: Thu, 24 Jun 2010 11:15:19 -0400

Unfortunately, I have a lot of errors with RMySQL -- but that is another thread...

Ralf

On Thu, Jun 24, 2010 at 10:31 AM, James W. MacDonald <jmacdon_at_med.umich.edu> wrote:
> Hi Ralf,
>
> Ralf B wrote:
>>
>> Sorry for the lack of details. Since I run the same SQL first directly
>> on MySQL (using the MySQL Query Browser) and then again using R
>> through the RJDBC interface, I assume that I won't simply have a badly
>> constructed SQL query. However, just to clear possible objection, here
>> the SQL:
>>
>>
>> # Extracts vector of data points
>> getData <- function(connection) {
>>        queryStart <- "SELECT id1, id2, x, y FROM `mytable` "
>>        queryEnd <- ";"
>>        query <- paste(queryStart, " WHERE id1 IN(", id1s, ") AND id2 IN(",
>> id2s, ") AND subtype='TYPE1'", queryEnd)
>>        # execute query
>>        data =  dbGetQuery(connection, query)
>>        return(data)
>> }
>>
>> When running this method using either RGUI or the command line, I have
>> a runtime that reaches an incredible 10 minutes (!) for selecting
>> about 50k - 80k data points (which I consider not much) based on the
>> range of IDs I choose. The table size is about 5-8 million data points
>> total. The same SQL query directly executed in MySQL Query Browser
>> takes about 20 seconds which I would consider fine. There are no
>> indices created for any of the fields but since the query runs a lot
>> faster in the query browser I don't suspect this to be the main
>> reason.
>>
>> Any ideas?
>
> Well, the RJDBC rforge page has this note:
>
> Note: The current implementation of RJDBC is done entirely in R, no Java
> code is used. This means that it may not be extremely efficient and could be
> potentially sped up by using Java native code. However, it was sufficient
> for most tasks we tested. If you have performance issues with RJDBC, please
> let us know and tell us more details about your test case.
>
> And from my quick peek at the page, it appears RJDBC is designed to allow
> one to query any DBMS. Since RMySQL is MySQL-specific, it may be more
> efficient. Anyway, why don't you just try it and see?
>
> Best,
>
> Jim
>
>
>>
>> Best,
>> Ralf
>>
>>
>>
>>
>> On Wed, Jun 23, 2010 at 4:36 PM, James W. MacDonald
>> <jmacdon_at_med.umich.edu> wrote:
>>>
>>> Hi Ralf,
>>>
>>> Ralf B wrote:
>>>>
>>>> I am running a simple SQL SELECT statement that involvs 50k + data
>>>> points using R and the RJDBC interface. I am facing very slow response
>>>> times in both the RGUI and the R console. When running this SQL
>>>> statement directly in a SQL client I have processing times that are a
>>>> lot lot faster (which means that the SQL statement itself is not the
>>>> problem).
>>>>
>>>> Did any of you compare RJDBC vs RMySQL or is there a better, more
>>>> efficient way to extract large data from databases using R? Would you
>>>> recommend dumping data out completely into flat files and working with
>>>> flat files instead? I expected that this would not be such a problem
>>>> given that businesses maintain their data in DBs and R is supposed to
>>>> be good in shifting around data. Am I doing something wrong?
>>>
>>> Well, if you don't show people what you have done, how can anybody tell
>>> if
>>> you are doing something wrong or not?
>>>
>>> I have no experience with RJDBC, so cannot say anything about that.
>>> However,
>>> I have always found RMySQL to be speedy enough. As an example:
>>>
>>>> library(RMySQL)
>>>
>>> Loading required package: DBI
>>>>
>>>> con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user =
>>>> "genome", dbname = "hg18")
>>>> system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129
>>>> where
>>>> chrom='chr1' and chromStart between 1 and 1e8;")
>>>
>>> + )
>>>  user  system elapsed
>>>  7.95    0.06   38.59
>>>>
>>>> dim(a)
>>>
>>> [1] 508676      2
>>>
>>> So 40 seconds to get half a million records. Since this is via the
>>> internet,
>>> I have to imagine things would be much faster querying a local DB.
>>>
>>> But then you never say what constitutes 'slow' for you, so maybe this is
>>> slow as well?
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>> Ralf
>>>>
>>>> ______________________________________________
>>>> R-help_at_r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> Douglas Lab
>>> University of Michigan
>>> Department of Human Genetics
>>> 5912 Buhl
>>> 1241 E. Catherine St.
>>> Ann Arbor MI 48109-5618
>>> 734-615-7826
>>> **********************************************************
>>> Electronic Mail is not secure, may not be read every day, and should not
>>> be
>>> used for urgent or sensitive issues
>>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be
> used for urgent or sensitive issues
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 24 Jun 2010 - 15:17:34 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 24 Jun 2010 - 15:50:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive