Re: [Rd] RSQLite indexing

From: Jeffrey Horner <jeff.horner_at_vanderbilt.edu>
Date: Mon, 22 Oct 2007 18:03:54 -0500

Thomas Lumley wrote on 10/22/2007 04:54 PM:
> I am trying to use RSQLite for storing data and I need to create indexes on
> two variables in the table. It appears from searching the web that the CREATE
> INDEX operation in SQLite is relatively slow for large files, and this has been
> my experience as well.
>
> The two index variables are crossed. One has about 350,000 levels [yes, it's
> genetic association data]. The other will have about 4000 levels eventually,
> but is up to about 100 now. When the data were entered they were already ordered by this second index variable.
>
> Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable. I'm looking for advice on how to reduce this. Specifically
> 1/ would it be faster if the variable with more levels was the presorted one?
> 2/ would it be faster or slower if the index were created before adding all the data?
> 3/ are there any options that can be set to speed up the indexing?
>
> The SQLite database will not be the primary archive for the data, so optimizations that are risky in the case of power loss or hardware failure are still acceptable. Since Bioconductor seems to use SQLite a lot I'm hoping there is some simple solution.

Well then, while some may think it overkill to run a mysql database server for 1 to a handfull of clients, the benefits of using typed columns (rather than strings, which might have something to do with slow indexing) plus all the other goodies could save you the extra day or two speeding up SQLite indexing.

A project here at vanderbilt stores (mostly read-only) data in mysql 5.0 tables in myisam format. The tables have 3 to 5 million records each. Index creation speed for a varchar(9) column with 1,559,100 levels takes roughly 1.5 minutes, for example.

Jeff

-- 
http://biostat.mc.vanderbilt.edu/JeffreyHorner

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 22 Oct 2007 - 23:07:43 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 25 Oct 2007 - 11:37:11 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.