Re: [Rd] RSQLite indexing

From: Kasper Daniel Hansen <khansen_at_stat.berkeley.edu>
Date: Tue, 23 Oct 2007 13:09:03 -0700

On Oct 22, 2007, at 6:07 PM, Kasper Daniel Hansen wrote:

> On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:
>
>>
>> I am trying to use RSQLite for storing data and I need to create
>> indexes on
>> two variables in the table. It appears from searching the web that
>> the CREATE
>> INDEX operation in SQLite is relatively slow for large files, and
>> this has been
>> my experience as well.
>>
>> The two index variables are crossed. One has about 350,000 levels
>> [yes, it's
>> genetic association data]. The other will have about 4000 levels
>> eventually,
>> but is up to about 100 now. When the data were entered they were
>> already ordered by this second index variable.
>>
>> Creating the index took about an hour on the 100-level, presorted
>> variable and about 12 hours on the 350,000-level unsorted
>> variable. I'm looking for advice on how to reduce this. Specifically
>> 1/ would it be faster if the variable with more levels was the
>> presorted one?
>> 2/ would it be faster or slower if the index were created before
>> adding all the data?
>> 3/ are there any options that can be set to speed up the indexing?
>>
>> The SQLite database will not be the primary archive for the data,
>> so optimizations that are risky in the case of power loss or
>> hardware failure are still acceptable. Since Bioconductor seems to
>> use SQLite a lot I'm hoping there is some simple solution.
>
> I have not used RSQLite, but have some experience doing this thing
> for big sqlite databases using the command line client. Every
> database in sqlite has a number of parameters associated with it. You
> want to make sure that cache_size is at _most_ 2000 (yes, I know this
> is totally counterintuitive as it tells sqlite to use as little
> memory as possible). You also tell it to be non-synchronous. In
> sqlite the commands are
> sqlite> pragma default_cache_size = 2000
> sqlite> pragma_synchronous = off
> You can test the setting of these parameters by just doing a
> sqlite> pragma default_cache_size
> As far as I remember, cache size can only be set when you create the
> database. I have no idea how RSQlite handles it.
>
> When I asked about this problem on the sqlite mailing list, the
> sqlite-creator said that this was a "locality of reference problem"
> and that it was being "worked on". And that I could search the
> archives for more info (which did not help me back then).
>
> I don't know whether or not sorting helps.
>
> Another thing to do is to check in what amount sqlite sits idle while
> doing I/O. It is probably impossible to avoid some idleness with such
> a thing, but it should of course be kept to a minimum.
>
> It is true that some of the other databases are probably much faster
> at creating indices. But in the post-index analysis, sqlite is a
> really fast database, probably amongst the fastest there is. It does
> not do a good job a converting your queries into smart queries, but
> if you are doing something simple, it is blazingly fast with the
> right user options.

Let me just emphasize (based on Seth's email) that the default_cache_size settings only is for an indexing command. For actual operations on the database using the index, like a select statement, you would want to increase the cache_size to something bigger.

My comments are based on experience with a database with 315 *10^6 rows, and the importance of this is quite dependent on database size.

Kasper



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 23 Oct 2007 - 20:33:30 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 25 Oct 2007 - 11:37:11 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.