Re: [R] Fast update of a lot of records in a database?

From: McGehee, Robert <Robert.McGehee_at_geodecapital.com>
Date: Sat 20 May 2006 - 06:13:20 EST


Hadley,
There are several reasons that running one large load and one large update would be significantly faster than thousands of individual updates.

First, the time it takes to execute a query does not grow linearly with the size of a query. That is, the statement: "SELECT TOP 100 * FROM table" takes about 1.8 times as long as "SELECT TOP 10 * FROM table", not 10 times longer (using an estimated query cost on tables in my database using MS-SQL). The reason is that SQL is optimized to perform well for large queries, and many of the steps used in a SQL operation are needlessly repeated when multiple UPDATE/SELECT statements are given rather than one large UPDATE/SELECT.

For instance, on most SQL UPDATES, the most time is spent primarily on 1) Sorting the input, 2) performing a clustered index seek, and 3) performing a clustered index update. In a toy example using UPDATE, the physical operation of sorting 2000 rows takes only 6 times longer than sorting a little over 100 rows. The clustered index seek and update take about 10 times longer. This, however, is far less than the 20x longer we would expect from doing a linear row-by-row update. So even if it takes an additional 50% longer to first load the data into a temporary table, we still see the opportunity for large speed increases.

A second reason we would expect one large query to run faster is that it is much easier to parallel process on multiple processors. That is, one processor can be joining the tables while a second processor simultaneously is performing clustered indexing. For a bunch of single UPDATE statements, we are forced to run the operation in serial. Thus, if the above examples were more complicated, we should expect an even larger cost savings / row.

>From your example, a third reason is that in multiple updates, the SQL
server (at least my MS-SQL server) updates the transaction log after every query (unless you wisely run in batch mode as Duncan did with his BEGIN/COMMIT syntax), and thus significant more I/O time is spent between each UPDATE statement. For instance, the RODBC sqlUpdate function does not take advantage of transaction control, so to speed up long queries, I've often resorted to sending over temporary tables (as I suggested here), stored procedures, or even data stored as XML tables.

Lastly, removing your indices before the update would likely only slow down the query. If the table is not indexed on the id key, then the SQL server has to search through the entire table to find the matching id before it can be updated. It would be like searching through a dictionary that wasn't in alphabetical order. That said, indices can slow down queries when a significant number of rows are being added, as you then have to reindex the table when the insert completes. However, Duncan isn't doing that here.

Best,
Robert

-----Original Message-----
From: r-help-bounces@stat.math.ethz.ch
[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of hadley wickham Sent: Friday, May 19, 2006 3:20 PM
To: Duncan Murdoch
Cc: Robert.McGehee@geodecapital.com; r-help@stat.math.ethz.ch; br44114@gmail.com
Subject: Re: [R] Fast update of a lot of records in a database?

> put the updates into a temporary table called updates
>
> UPDATE bigtable AS a
> FROM updates AS b
> WHERE a.id = b.id
> SET a.col1 = b.col1

I don't think this will be any faster - why would creating a new table be faster than updating existing rows?

I've never had a problem with using large numbers of SQL update statements (in the order of hundreds of thousands) to update a table and having them complete in a reasonable time (a few minutes). How heavily indexed is the field you are updating? You may be able to get some speed improvements by turning off indices before the update and back on again afterwards (highly dependent on your database system though).

I would strongly suspect your bottleneck lies elsewhere (eg. when generating the statements in R, or using ODBC to send them)

Hadley



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat May 20 06:33:24 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sat 20 May 2006 - 08:10:16 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.