Re: [Rd] RSQLite indexing

From: Kasper Daniel Hansen <khansen_at_stat.berkeley.edu>
Date: Mon, 22 Oct 2007 18:07:18 -0700

On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:

>
> I am trying to use RSQLite for storing data and I need to create
> indexes on
> two variables in the table. It appears from searching the web that
> the CREATE
> INDEX operation in SQLite is relatively slow for large files, and
> this has been
> my experience as well.
>
> The two index variables are crossed. One has about 350,000 levels
> [yes, it's
> genetic association data]. The other will have about 4000 levels
> eventually,
> but is up to about 100 now. When the data were entered they were
> already ordered by this second index variable.
>
> Creating the index took about an hour on the 100-level, presorted
> variable and about 12 hours on the 350,000-level unsorted
> variable. I'm looking for advice on how to reduce this. Specifically
> 1/ would it be faster if the variable with more levels was the
> presorted one?
> 2/ would it be faster or slower if the index were created before
> adding all the data?
> 3/ are there any options that can be set to speed up the indexing?
>
> The SQLite database will not be the primary archive for the data,
> so optimizations that are risky in the case of power loss or
> hardware failure are still acceptable. Since Bioconductor seems to
> use SQLite a lot I'm hoping there is some simple solution.

I have not used RSQLite, but have some experience doing this thing for big sqlite databases using the command line client. Every database in sqlite has a number of parameters associated with it. You want to make sure that cache_size is at _most_ 2000 (yes, I know this is totally counterintuitive as it tells sqlite to use as little memory as possible). You also tell it to be non-synchronous. In sqlite the commands are
sqlite> pragma default_cache_size = 2000 sqlite> pragma_synchronous = off
You can test the setting of these parameters by just doing a sqlite> pragma default_cache_size
As far as I remember, cache size can only be set when you create the database. I have no idea how RSQlite handles it.

When I asked about this problem on the sqlite mailing list, the sqlite-creator said that this was a "locality of reference problem" and that it was being "worked on". And that I could search the archives for more info (which did not help me back then).

I don't know whether or not sorting helps.

Another thing to do is to check in what amount sqlite sits idle while doing I/O. It is probably impossible to avoid some idleness with such a thing, but it should of course be kept to a minimum.

It is true that some of the other databases are probably much faster at creating indices. But in the post-index analysis, sqlite is a really fast database, probably amongst the fastest there is. It does not do a good job a converting your queries into smart queries, but if you are doing something simple, it is blazingly fast with the right user options.

Kasper



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 23 Oct 2007 - 01:09:59 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 25 Oct 2007 - 11:37:11 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.