ssd – Optimization of PostgreSQL ETL processing process

I am looking for suggestions to create several different versions of postgresql.conf: one optimized for writing, the second optimized for reading.

Context

I ETL dozens of very large text files (multi-GB) in PostgreSQL 11. The process is repeated twice a month for some data sets and monthly for other data sets. Then I replicate the local database tables on a cloud-based server so that subscribers can access.

Step 01: Local Processing Server

This is an iMac 27 with 64GB in RAM and a SSD of 01 TB.
I have an instance of PostgreSQL 11 with the data directory stored in the SSD. This is the "SSD local instance".

In this stage, I take raw text files and insert the records in PostgreSQL (using the fantastic pgLoader or my own custom Python code). Then there is the significant post-processing: indexing of fields; many UPDATES; cross-reference data; and geocoding of records.

This process is totally linked to the hardware. And I wrote (what I think it is) a postgresql.conf file optimized for writing, below. My goal is to optimize PostgreSQL for fast writing (without WAL, huge control intervals, etc.). At this stage they do not care about shock protection.

Also, during the ETL process, I use SQL tables creation commands such as:

  • FILLFACTOR low ("in very updated tables, the smaller filling factors are appropriate").
  • CREATE DOWNLOADED TABLE

Is there a way to load the complete PostgreSQL application into RAM and perform all the operations (raw data loading, indexing, field enrichment) in the RAM? Once finished, the data files saved back to SSD.

Step 02: Transfer to the local repository

Once a large table is loaded, indexed and processed later in Step 01 above, I replicate the table in a second local instance of PostgreSQL 11 where the data directory is stored on an external 16 TB Thunderbolt 3 disk, the "Instance of local Thunderbolt 3". My iMac does not have enough SSD storage to store all the tables in the SSD.

This is the command I use to transfer a table from "Local SSD Instance" to "Local Thunderbolt 3 Instance":

PostgreSQL pg_dump: //:@:/ -t & # 39;_ * & # 39; | psgre of PostgreSQL: //:@:/

One problem here is that each index is recreated after the transfer of the actual records. Re-sorting takes a long time. Is there another way to transfer the tables from "Local SSD Instance" to "Local Thunderbolt 3 Instance" without having to re-index each table?

No, I do not want to transfer the tables directly from the "local SSD instance" to the remote server. I need to have a local repository of all tables since there will be multiple instances based on the cloud with different variations of the individual tables loaded.

I read somewhere that transferring a table to a different server somehow "compacts" the data so that you do not have to run VACUUM on the target table. Is this correct? Or do I still need to run VACUUM after re-indexing the table?

Step 03: Replicate the data on a cloud-based server

I have configured a "Droplet" in Digital Ocean with 48 GB of memory, 960 GB of disk + 1.66 TB of attached storage, running Ubuntu 18.04.2 x64. Here I am running PostgreSQL 11 as well.

The postgresql.conf file must be optimized for writing, since I need to transfer more than 100 tables with more than 1.5 TB of data as fast as possible. In this step, I take individual tables from the "Local Thunderbolt 3 Instance" and transfer the tables to the Digital Ocean drop.

The table transfer command I use is:

PostgreSQL pg_dump: //:@:/ -t & # 39;_ * & # 39; | psgre of PostgreSQL: //:@:/

But one problem here is that each index is recreated after transferring the actual records. Re-sorting takes a long time. How can I speed up the process of transferring tables to the remote server?

Step 04: Optimize the cloud-based server for reading.

Once all the tables have been transferred, I need to change the postgresql.conf file and use an optimized read configuration.

The optimized read configuration should provide the fastest possible response rates to PostgreSQL queries. Any suggestions?

My questions:

  • How to create a postgresql.conf file optimized for writing?
  • How to create a postgresql.conf file optimized for reading?
  • How to transfer tables to a different server without re-indexing to the destination server?

Thank you.

# Postgresql.conf file optimized for writing:
#
# WAL:
wal_level = minimal
max_wal_size = 10 GB
wal_buffers = 16MB
archive_mode = off
max_wal_senders = 0
#
# Memory:
shared_buffers = 1280MB
temp_buffers = 800MB
work_mem = 400MB
maintenance_work_mem = 640MB
dynamic_shared_memory_type = posix
#
autovacuum = off
bonjour = off
checkpoint_completion_target = 0.9
default_statistics_target = 1000
Effective_cache_size = 4GB
#
synchronous_commit = off
#
random_page_cost = 1
seq_page_cost = 1