Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.2k views
in Technique[技术] by (71.8m points)

database - Updating a large table from another takes a long time (more than 6 hours) in Postgresql

I want to "update" 31 columns of table A (~ 22 million rows) from table B (~ 31 million rows). I created a temporary table with only the columns I was interested in (i.e. 31) and also in order to eliminate a second condition (number of rows is also reduced to ~ 22 million). Next, I created an index for two columns : "column1" of the temporary table and table A (used for join). Then I created a "multicolumn indexes" on all the columns of the temporary table except for the column "column1". Unfortunately, the update is still very long (I stopped the request after 6 hours of loading) I don't know much about the indexes. Is there any bad practice in what I have done? Are there solutions to considerably reduce the duration of the update?
Thanks in advance.
Part of the sql script and an explain plan of the update are below :

Create temp table if not exists table_TMP as (
        select
        "column1",
        "column2",
        "column3",
        all other columns...
        from table_B
        where one_column_of_table_B=TRUE
);
CREATE INDEX if not exists idx_table_TMP
    ON table_TMP USING btree
    ("column1")
    TABLESPACE pg_default;

//multiindex on 31 columns
CREATE INDEX if not exists idx2_table_TMP ON table_TMP (
    "column2",
    "column3",
    all other columns...);

update table_A set
    "table_A_column2"=tmp."column2",
    "table_A_column3"=tmp."column3",
    all other columns...
    FROM table_TMP tmp
    Where  table_A.column1 = tmp.column1;

//explain of the update :
"Update on table_A  (cost=5798151.63..24891890.93 rows=21716146 width=4104)"
"  ->  Hash Join  (cost=5798151.63..24891890.93 rows=21716146 width=4104)"
"        Hash Cond: ((tmp.column1)::text = (table_A.column1)::text)"
"        ->  Seq Scan on table_TMP tmp  (cost=0.00..770061.46 rows=21716146 width=2342)"
"        ->  Hash  (cost=622658.39..622658.39 rows=22008739 width=1798)"
"              ->  Seq Scan on table_A  (cost=0.00..622658.39 rows=22008739 width=1798)"
question from:https://stackoverflow.com/questions/65904042/updating-a-large-table-from-another-takes-a-long-time-more-than-6-hours-in-pos

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You should post new plans by editing your question rather than in the comments, that way they could be formatted correctly.

Buckets: 16384  Batches: 4096  Memory Usage: 842kB

Your work_mem seems to be 1MB. That is low on any modern system, and especially for this type of query. As a result of the low memory, each table is divided into 4096 batches and processed one batch at a time. In read-only statement, that should not be too bad because all the IO is done sequentially (I have no idea why it took 3 hours though, that does seem pretty long and I can only guess that your hardware is really quite horrible). But for an UPDATE, this is completely devastating as it can't update the batched table files, it has to update the original table. It is essentially making 4096 passes over the table to be updated.

Increasing work_mem should make both the select and the update faster, but the update may still be unusably slow. Making a new table from the select and then replacing the old table with the new one maybe be the best option.

If you do set enable_hashjoin=off; in your current session, it will force the planner to use some other join method, and it might be interesting to see what that is but there is not much reason to think it would be hugely faster.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...