postgresql – Postgres use index to create index

I have a partial index where a certain column is not null. This is a very small percentage of the table. Thanks to this index, SELECT * FROM table WHERE column IS NOT NULL is incredibly fast (5 milliseconds). But the table has hundreds of millions of rows.

If I want to create a second index on the same set of rows (where that same column is not null), how can I make Postgres use the first index that already exists to find those rows? Currently Postgres just scans the entire table to find them again, which takes many minutes. I can query all the rows in millisceonds, so why can’t CREATE INDEX get them in the same way?

postgresql – find out why table insert is slow in postgres cluster

Recently i did create a PostgresQL-HA 11, cluster using bitnami helm charts.
There is 1 master and 1 replica, pg_pool and metrics.

I am using NFS as file storage for the cluster and it is connected to a vmware guest machine with enough memory and cpu and a HP fast hard disk drive, in same broadcast domain on same VSwitch with a 1Gbps v-port.

But insert into database using a js written batch script is slow like this:
Insert 2000 records in 13.952 seconds.

The script inserts very fast on my local non-clustered postgresql database, on my local machine and inserts 2000 records in less than a second.

So since I am not a DBA or someone like that, how i can find why it is slow? can metrics or sth else help?

postgresql – Postgres, trigger function, is it bad for performance?

I have added the following trigger function to one of my tables in my Postgres database.

CREATE OR REPLACE FUNCTION trigger_set_timestamp()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

I am new to this, so, I am wondering, is this going to slow down my database? Should I avoid trigger functions for performance reasons and if so, what’s the alternative?

(In this case this is the only way I knew how to make the updated_at date column always set the date when any of the columns in the table would change….)

postgresql – Question about postgres commits to table in UNLOGGED mode

I have three tables: data, processed, and processing_status. The first one has rows coming in quite quickly. Every few minutes, I check the highest ID in data and compare it to processing_status. If enough new datapoints have come in, the new ones are processed and aggregated into processed table, and processing_status is updated to the highest ID. Obviously, both of those happen in one transaction, so that if there’s a crash, everything stays sane, and neither table is updated without the other.

I’d like to improve the performance of this, regarding the bloat to the WAL logs. If I set processed and processing_status to UNLOGGED in Postgres, will I retain the property that when I commit the transaction, either both tables are updated, or neither? Even in the event of a crash?

I believe that if I set UNLOGGED, the difference is that in a crash, it will always revert both, whereas in LOGGED mode, the WAL could be used to include a few extra in-progress transactions right during the crash. Is this understanding accurate?

postgresql – How restore a Postgres database on AMAZON RDS when the file is in another server?

I’m around 2 days to simply restore my database to an AMAZON RDS server. The tutorial page from AMAZON is so bad that I can’t do a simple task like this.

I have db_example.bak file on my local machine and in my other server and I want to simply restore this database on RDS from the amazon rds server.

Because of my database, I try to make this on the client-side but it will take the whole day to restore. I can connect with psql on AMAZON RDS but I don’t know how I can import a file that doesn’t exist there.

I don’t know if this a noob question but I will try everything to make this simple thing done and I think my doubt can be the doubt of someone.

observation: I only find tutorial in SQL SERVER but not in POSTGRES

postgresql – Postgres NOT IN performance

I’m currently doing some exploratory work with Postgres 12, and have bumped into an issue I want to try and understand more about. My background is primarily Microsoft SQL Server, so I’m looking at where various differences lie between the two.

Now in SQL Server, the query optimizer is quite good and seeing similar patterns and generating equivalent plans for a given query. However in Postgres, I’m getting different plans: now this isn’t the issue – that’s fine, and have no issue with, however one of the execution plans turns out incurs a large performance hit, so I don’t know if I’ve missed something obvious, or whether this is just the nature of Postgres (if so, fine, it’s something we can work with).

The questions I’ve trying answer are:

  1. Is this behaviour expected?
  2. Is NOT IN generally best avoided with Postgres where data sizes can’t be guranteed?
  3. Have I missed something obvious with Postgres that would resolve the problem with the query?

I’ve created a basic table that I’ve loaded with various word lists (the scrabble TWL, SOWPODS, and a few others sourced from this repository on github):

CREATE TABLE wordlist
(
    src character varying(7) COLLATE pg_catalog."default" NOT NULL,
    word character varying(100) COLLATE public.utf8_uk_ci_ai NOT NULL,
    wordlength integer GENERATED ALWAYS AS (char_length((word)::text)) STORED,
    seqid integer NOT NULL GENERATED ALWAYS AS IDENTITY (INCREMENT 1 START 1)
)

I’ve got the following indexes on the table:

CREATE INDEX ix_words ON wordlist (word);
CREATE INDEX ix_words2 ON wordlist (word) INCLUDE (src);
CREATE INDEX ix_words3 ON wordlist (src) INCLUDE (word);
CREATE INDEX ix_srcword ON wordlist (src, word);
CREATE INDEX ix_wordsrc ON wordlist (word, src);

This is more due to experiementation/poking, rather than any actual plan, so apologies if they don’t make immediate sense. I appreciate ix_words, ix_words2 and ix_words3 are likely superflouous to requirements given the two indexes.

The collation utf8_uk_ci_ai is defined as:

CREATE COLLATION public.utf8_uk_ci_ai
    (LC_COLLATE = 'en_GB@colStrength=primary', LC_CTYPE = 'en_GB@colStrength=primary');

The breakdown of word counts (SELECT src, COUNT(*) FROM wordlist GROUP BY src) in the table looks like this:

Word counts by source

The three queries I’ve been comparing are these – they’re all intended to do return the same dataset, just in different ways; the first two run quickly (EXCEPT and NOT EXISTS), it’s the third one (NOT IN) that has problems:

select word from wordlist where src = "https://dba.stackexchange.com/SOWPODS"
except
select word from wordlist where src = 'TWL'

select a.word 
    from wordlist a 
    where a.src = "https://dba.stackexchange.com/SOWPODS" 
        and not exists (select 1 from wordlist b where b.src = 'TWL' and a.word = b.word)

select a.word 
    from wordlist a 
    where a.src = "https://dba.stackexchange.com/SOWPODS" 
        and a.word not in (select word from wordlist b where b.src = 'TWL')

The first two complete in a second or so; the third however doesn’t complete for minutes. I’ve killed the query as I don’t know how long it would have taken.

The EXPLAIN for the third reads as follows:

Gather  (cost=13054.03..1667897336.79 rows=136620 width=13)
  Workers Planned: 2
  ->  Parallel Bitmap Heap Scan on wordlist a  (cost=12054.03..1667882674.79 rows=56925 width=13)
        Recheck Cond: ((src)::text = "https://dba.stackexchange.com/SOWPODS"::text)
        Filter: (NOT (SubPlan 1))
        ->  Bitmap Index Scan on ix_words3  (cost=0.00..7261.72 rows=273239 width=0)
              Index Cond: ((src)::text = "https://dba.stackexchange.com/SOWPODS"::text)
        SubPlan 1
          ->  Materialize  (cost=4758.15..33614.09 rows=177254 width=13)
                ->  Bitmap Heap Scan on wordlist b  (cost=4758.15..31861.82 rows=177254 width=13)
                      Recheck Cond: ((src)::text = 'TWL'::text)
                      ->  Bitmap Index Scan on ix_words3  (cost=0.00..4713.84 rows=177254 width=0)
                            Index Cond: ((src)::text = 'TWL'::text)

Graphically, looks like this:

enter image description here

However if I do the same query with a smaller list in the NOT IN (src = “https://dba.stackexchange.com/ME” that has 1000 rows, rather than the 178,000 of src = ‘TWL’), I get a much faster result and a different (and simpler) plan:

select a.word 
    from good.wordlist a 
    where a.src = "https://dba.stackexchange.com/SOWPODS" 
        and a.word not in (select word from good.wordlist b where b.src = "https://dba.stackexchange.com/ME")

EXPLAIN gives:

Bitmap Heap Scan on wordlist a  (cost=7304.08..36290.67 rows=136620 width=13)
  Recheck Cond: ((src)::text = "https://dba.stackexchange.com/SOWPODS"::text)
  Filter: (NOT (hashed SubPlan 1))
  ->  Bitmap Index Scan on ix_words3  (cost=0.00..7261.72 rows=273239 width=0)
        Index Cond: ((src)::text = "https://dba.stackexchange.com/SOWPODS"::text)
  SubPlan 1
    ->  Index Only Scan using ix_words3 on wordlist b  (cost=0.43..8.20 rows=1 width=13)
          Index Cond: (src = "https://dba.stackexchange.com/ME"::text)

which graphically looks like:

enter image description here

So it’s gone straight to one of the indexes with a smaller list, which it’s ignoring for the larger list. Now in MSSQL, all three of these queries give the same execution plan, as a result it’s given us more flexibility in how we write SQL as we can trust (to a point) that the query optimizer will do the right thing.

With Postgres, it feels like the approach we need to take to writing SQL needs to be different, hence the three questions I posed at the start of this:

  1. Is this behaviour expected?
  2. Is NOT IN generally best avoided with Postgres where data sizes can’t be guranteed?
  3. Have I missed something obvious with Postgres that would resolve the problem with the query?

Apologies for the length of this question, felt a truncated version of it wasn’t working (I did try).

postgresql – Get partial results in Postgres in case of timeout?

Is it possible to send a query that does a sequential scan of the table – something like

SELECT content
  FROM some_big_table
 WHERE <some criteria(content) are met>;
 LIMIT <...>;

with a time limit (e. g. 1000 milliseconds) – so that the query terminates after the specified time period and returns any rows it could find by that time as a result of the scan?

I’m not worried about the result being predictable – I just need to give the user the first matches the server can find, if any.

SET statement_timeout does not help here as it cancels the query altogether if it doesn’t execute within the time limit, while I need the partial results.

tls – Connecting to a Postgres Database over SSL using Sockets in Python

I am trying to connect to a Postgres Database using sockets to enforce a particular TLS version from the client in order to verify that the Database does not accept connections from the client which uses an older version of TLS like tlsv1.1. The connection is failing on handshake with the following error :

python test2.py

Traceback (most recent call last): File “test2.py”, line 12, in

ssl_version=ssl.PROTOCOL_TLSv1_2) File “<>/python3.6/lib/python3.6/ssl.py”, line 1232, in
get_server_certificate
with context.wrap_socket(sock) as sslsock: File “<>/python3.6/lib/python3.6/ssl.py”, line 407, in wrap_socket
_context=self, _session=session) File “<>/python3.6/lib/python3.6/ssl.py”, line 817, in init
self.do_handshake() File “<>/python3.6/lib/python3.6/ssl.py”, line 1077, in do_handshake
self._sslobj.do_handshake() File “<>/python3.6/lib/python3.6/ssl.py”, line 689, in do_handshake
self._sslobj.do_handshake() ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:852)

The following is a snippet of the code:

    import socket
    import ssl
    
    hostname = <DB_Endpoint>
    context = ssl.create_default_context()
    
    with socket.create_connection((hostname, 8200)) as sock:
        with context.wrap_socket(sock, server_hostname=hostname) as ssock:
            print(ssock.version())
    

PS: I am just trying to figure out a way to test if the Postgres Server rejects a connection from a client which only has

postgresql – Postgres sometimes uses sequential scan instead of an index only scan for data that doesn’t exist

We are using Postgres 12 and We have a simple table like this:

> d live_event_segmentation
                  Table "public.live_event_segmentation"
   Column   |            Type             | Collation | Nullable | Default 
------------+-----------------------------+-----------+----------+---------
 login_id   | bigint                      |           | not null | 
 event_id   | text                        |           | not null | 
 expires_at | timestamp without time zone |           |          | 
Indexes:
    "live_event_segmentation_pkey" PRIMARY KEY, btree (login_id, event_id)
    "live_event_segmentations_event_id_idx" btree (event_id)

Size of the table is:

> dt+ live_event_segmentation
                              List of relations
 Schema |          Name           | Type  |   Owner    |  Size  | Description 
--------+-------------------------+-------+------------+--------+-------------
 public | live_event_segmentation | table | liveevents | 171 MB | 
(1 row)

and the whole DB fits into RAM.

This table has this distribution of event_id:

> select event_id, count(*) from live_event_segmentation group by 1;
        event_id        | count  
------------------------+--------
 euro2020Test           |     67
 fevent20               | 164310
 summer2020Test         |      9
(3 rows)

And our app is executing this query a few times a second:

explain analyze select 1 as "one" where exists (select 1 as "one" from "public"."live_event_segmentation" where "public"."live_event_segmentation"."event_id" = 'summer2020' limit 1);
                                                                                 QUERY PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=2.44..2.45 rows=1 width=4) (actual time=0.023..0.023 rows=0 loops=1)
   One-Time Filter: $0
   InitPlan 1 (returns $0)
     ->  Index Only Scan using live_event_segmentations_event_id_idx on live_event_segmentation  (cost=0.42..2.44 rows=1 width=0) (actual time=0.022..0.022 rows=0 loops=1)
           Index Cond: (event_id = 'summer2020'::text)
           Heap Fetches: 0
 Planning Time: 0.106 ms
 Execution Time: 0.040 ms
(8 rows)

When we run this query from psql it always uses an index only scan, but when our Java app runs this same query it often uses a sequential scan. We saw this from the logs by using auto_explain extension and setting log_min_duration_statement=20 (so that we can see query parameters which auto_explain doesn’t display). The only difference between psql and our app is that the app uses a prepared statement, so when we tried it in psql sometimes it would actually use a sequential scan:

> prepare p1(text, int) AS select 1 as "one" where exists (select 1 as "one" from "public"."live_event_segmentation" where "public"."live_event_segmentation"."event_id" = $1 limit $2);
PREPARE
> explain analyze execute p1('summer2020', 1);
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.29..0.30 rows=1 width=4) (actual time=67.093..67.093 rows=0 loops=1)
   One-Time Filter: $0
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..2396.72 rows=8129 width=4) (actual time=67.090..67.090 rows=0 loops=1)
           ->  Seq Scan on live_event_segmentation  (cost=0.00..23968.34 rows=81294 width=4) (actual time=67.087..67.087 rows=0 loops=1)
                 Filter: (event_id = $1)
                 Rows Removed by Filter: 163728
 Planning Time: 0.138 ms
 Execution Time: 67.171 ms
(9 rows)

What’s important here is that the event summer2020 doesn’t exist in the table. We noticed that only queries that are selecting event_id that doesn’t exist sometimes use a sequential scan.

Table live_event_segmentation increases constantly with about 1 row per second, which is very slow. Autovacuum is working and the table is correctly analyzed. Also this happens constantly, we constantly have sequential scans on this table because of the query above, not int some strange spikes.

Upgrade Postgres 9.6 to 12

If I up to date my postgres from 9.6 to 12, can I get some issues with my queries?

I’m afraid to have some problems with my actual code if I upgrade my version.