Master slave replication – PostgreSQL – Replication process of clogged replica frequently (epoll_wait takes too long)

We have the configuration of Master Slave with asynchronous transmission replication (replication slots). Frequently, the whole Postgres process of our replica seems to do nothing (or is it blocked?) Since replication_slots is used for replication, this has been affecting the master with pg_xlog being left behind.

postgres = # select version ();
version
-------------------------------------------------- -------------------------------------------------- ----------------------------------
PostgreSQL 9.6.11 in x86_64-pc-linux-gnu (Ubuntu 9.6.11-1.pgdg18.04 + 1), compiled by gcc (Ubuntu 7.3.0-27ubuntu1 ~ 18.04) 7.3.0, 64 bits
(1 row)

According to the Postgres record of the slave, Wal's last receipt and the next one are processed restart point is not starting No records after this. Simply stuck. After rebooting, the replica begins to catch up, but again it crashes after some time.

2018-11-23 12:47:26,625 UTC [4193] REGISTRATION: start restart point: xlog
2018-11-23 12:49:48,258 UTC [4193] REGISTRATION: full restart point: wrote 10057 buffers (10.3%); 0 transaction log files added, 198 deleted, 0 recycled; write = 130.568 s, sync = 0.702 s, total = 141.633 s; synchronization files = 1874, longer = 0.205 s, average = 0.000 s; distance = 5435211 kB, estimate = 5435211 kB
2018-11-23 12:49:48,258 UTC [4193] REGISTRATION: recovery restart point in 1143 / 8B16D800
2018-11-23 12:49:48,258 UTC [4193] DETAIL: the last completed transaction was at the time of registration 2018-11-23 07: 00: 19.510771 + 00
2018-11-23 12: 51: 04.092 UTC [4193] REGISTRATION: start restart point: xlog
2018-11-23 12:55:34,744 UTC [4193] REGISTRATION: full restart point: wrote 3086 buffers (3.2%); 0 transaction log files added, 120 deleted, 212 recycled; writing = 269,875 s, timing = 0.029 s, total = 270,652 s; synchronization files = 4955, the longest = 0.000 s, average = 0.000 s; distance = 3238852 kB, estimate = 5215575 kB
2018-11-23 12:55:34,744 UTC [4193] REGISTRATION: recovery restart point at 1144 / 50C5EA00
2018-11-23 12:55:34,744 UTC [4193] DETAIL: the last completed transaction was at the time of registration 2018-11-23 07: 04: 51.819643 + 00

I analyzed trail of the Slave Postgres recovery process during active and stuck mode, and it appears that during stuck mode, the recovery process is performing minimal (or stuck) activity.

perf trace -p pid of the slave recovery process

                                    ? (?): ... [continued]: epoll_wait ()) = 0
0.048 (0.023 ms): close (fd: 992                                 ) = 0
0.100 (0.022 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
0.146 (0.024 ms): epoll_create1 (marks: 524288) = 992
0.192 (0.025 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
0.240 (0.023 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
0.284 (0.020 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
0.323 (5003.804 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
5004,184 (0.023 ms): close (fd: 992                                 ) = 0
5004.233 (0.023 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
5004,281 (0.025 ms): epoll_create1 (marks: 524288) = 992
5004,329 (0.025 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
5004,374 (0.021 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
5004,416 (0.033 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
5004,465 (5005,223 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
10009.736 (0.025 ms): close (fd: 992                                 ) = 0
10009.788 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
10009.838 (0.025 ms): epoll_create1 (marks: 524288) = 992
10009.887 (0.075 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
10010.004 (0.025 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
10010.114 (0.032 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
10010.205 (5005.241 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
15015.499 (0.024 ms): close (fd: 992                                 ) = 0
15015.554 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
15015.604 (0.028 ms): epoll_create1 (marks: 524288) = 992
15015.692 (0.104 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
15015.824 (0.023 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
15015.871 (0.025 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
15015.919 (5000.618 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
20016.588 (0.022 ms): close (fd: 992                                 ) = 0
20016.641 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
20016.691 (0.027 ms): epoll_create1 (marks: 524288) = 992
20016.744 (0.022 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
20016.787 (0.021 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
20016.831 (0.020 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
20016.867 (5005.231 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
25022.142 (0.023 ms): close (fd: 992                                 ) = 0
25022.196 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
25022.246 (0.028 ms): epoll_create1 (marks: 524288) = 992
25022.296 (0.023 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
25022.340 (0.023 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
25022.388 (0.020 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
25022,428 (5005,228 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
30027.705 (0.025 ms): close (fd: 992                                 ) = 0
30027.761 (0.025 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
30027,811 (0.028 ms): epoll_create1 (marks: 524288) = 992
30027.864 (0.022 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
30027.907 (0.020 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
30027.949 (0.022 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
30027.988 (5005.233 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
35033.275 (0.025 ms): close (fd: 992                                 ) = 0
35033.333 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
35033.382 (0.029 ms): epoll_create1 (marks: 524288) = 992
35033.435 (0.077 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
35033.588 (0.046 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
35033.663 (0.023 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
35033.762 (5005.255 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
40039.070 (0.026 ms): close (fd: 992                                 ) = 0
40039.129 (0.025 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
40039.179 (0.031 ms): epoll_create1 (marks: 524288) = 992
40039.233 (0.078 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
40039.384 (0.047 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
40039.464 (0.023 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
40039.562 (5005.258 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
45044.876 (0.025 ms): close (fd: 992                                 ) = 0
45044.933 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
45044.984 (0.029 ms): epoll_create1 (marks: 524288) = 992
45045.078 (0.118 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
45045.225 (0.091 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
45045.350 (0.027 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
45045.441 (3034.775 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = -1 EINTR System call interrupted
48080.277 (0.028 ms): write (fd: 13, buf: 0x7fff2ec83737, account: 1) = 1
48080.369 (0.021 ms): gettimeofday (tv: 0x7fff2ec83d60) = 0
48080.413 (0.024 ms): close (fd: 992                                 ) = 0
48080.468 (0.023 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
48080.515 (0.023 ms): gettimeofday (tv: 0x7fff2ec83e00) = 0
48080.561 (0.024 ms): lseek (fd: 3, offset: 2490368, so: SET) = 2490368
48080.607 (0.036 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48080.675 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48080.729 (0.045 ms): lseek (fd: 652, where: END) = 300539904
48080.798 (0.025 ms): lseek (fd: 652, where: END) = 300539904
48080.895 (0.053 ms): kill (pid: 11330 (postgres), sig: USR1) = 0
48081.211 (0.132 ms): writing (fd: 652, buf: 0x7f5d2e0cc500, account: 8192) = 8192
48081.377 (0.025 ms): lseek (fd: 3, offset: 2498560, so: SET) = 2498560
48081.425 (0.032 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48081.498 (0.043 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48081.576 (0.024 ms): lseek (fd: 652, where: END) = 300548096
48081.624 (0.024 ms): lseek (fd: 652, where: END) = 300548096
48081.687 (0.122 ms): writing (fd: 652, buf: 0x7f5d2e0ce500, account: 8192) = 8192
48081.853 (0.025 ms): lseek (fd: 3, offset: 2506752, where: SET) = 2506752
48081.902 (0.041 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48081.985 (0.025 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48082.036 (0.049 ms): lseek (fd: 652, where: END) = 300556288
48082.117 (0.038 ms): lseek (fd: 652, where: END) = 300556288
48082.193 (0.064 ms): writing (fd: 652, buf: 0x7f5d2e0d0500, account: 8192) = 8192
48082.286 (0.024 ms): lseek (fd: 3, offset: 2514944, where: SET) = 2514944
48082.353 (0.039 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48082.419 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48082.467 (0.024 ms): lseek (fd: 652, where: END) = 300564480
48082.540 (0.034 ms): lseek (fd: 652, where: END) = 300564480
48082.613 (0.057 ms): writing (fd: 652, buf: 0x7f5d2e0d2500, account: 8192) = 8192
48082.723 (0.044 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48082.801 (0.040 ms): lseek (fd: 652, where: END) = 300572672
48082.874 (0.025 ms): lseek (fd: 652, where: END) = 300572672
48082.936 (0.061 ms): writing (fd: 652, buf: 0x7f5d2e0d4500, account: 8192) = 8192
48083.026 (0.024 ms): lseek (fd: 3, offset: 2523136, when: SET) = 2523136
48083.073 (0.054 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48083.156 (0.040 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48083.229 (0.033 ms): lseek (fd: 652, where: END) = 300580864
48083.295 (0.038 ms): lseek (fd: 652, where: END) = 300580864
48083.385 (0.085 ms): writing (fd: 652, buf: 0x7f5d2e0d6500, account: 8192) = 8192
48083.507 (0.045 ms): lseek (fd: 3, offset: 2531328, where: SET) = 2531328
48083.597 (0.039 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48083.674 (0.033 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48083.754 (0.041 ms): lseek (fd: 652, where: END) = 300589056
48083.839 (0.048 ms): lseek (fd: 652, where: END) = 300589056
48083.955 (0.086 ms): writing (fd: 652, buf: 0x7f5d2e0d8500, account: 8192) = 8192
48084.078 (0.096 ms): lseek (fd: 3, offset: 2539520, where: SET) = 2539520
48084.213 (0.071 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48084.355 (0.041 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48084.443 (0.052 ms): lseek (fd: 652, where: END) = 300597248
48084.528 (0.056 ms): lseek (fd: 652, where: END) = 300597248
48084.642 (0.103 ms): writing (fd: 652, buf: 0x7f5d2e0da500, account: 8192) = 8192
48084.809 (0.048 ms): lseek (fd: 3, offset: 2547712, so: SET) = 2547712
48084.901 (0.048 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48085.015 (0.036 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48085.095 (0.032 ms): lseek (fd: 652, where: END) = 300605440
48085.164 (0.041 ms): lseek (fd: 652, where: END) = 300605440
48085.266 (0.103 ms): writing (fd: 652, buf: 0x7f5d2e0dc500, account: 8192) = 8192
48085.419 (0.047 ms): lseek (fd: 3, offset: 2555904, where: SET) = 2555904
48085.505 (0.064 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48085.612 (0.036 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48085.707 (0.055 ms): lseek (fd: 652, where: END) = 300613632
48085.805 (0.052 ms): lseek (fd: 652, where: END) = 300613632
48085.893 (0.093 ms): writing (fd: 652, buf: 0x7f5d2e0de500, account: 8192) = 8192
48086.027 (0.048 ms): lseek (fd: 3, offset: 2564096, where: SET) = 2564096
48086.120 (0.071 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48086.247 (0.048 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48086.342 (0.040 ms): lseek (fd: 652, where: END) = 300621824
48086.421 (0.041 ms): lseek (fd: 652, where: END) = 300621824
48086.499 (0.103 ms): writing (fd: 652, buf: 0x7f5d2e0e0500, account: 8192) = 8192
48086.641 (0.053 ms): lseek (fd: 3, offset: 2572288, where: SET) = 2572288
48086.740 (0.060 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48086.871 (0.041 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48086.966 (0.026 ms): lseek (fd: 652, where: END) = 300630016
48087.025 (0.031 ms): lseek (fd: 652, where: END) = 300630016
48087.135 (0.084 ms): writing (fd: 652, buf: 0x7f5d2e0e2500, account: 8192) = 8192
48087.281 (0.055 ms): lseek (fd: 3, offset: 2580480, so: SET) = 2580480
48087.392 (0.061 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48087.516 (0.040 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48087.594 (0.043 ms): lseek (fd: 652, where: END) = 300638208
48087.669 (0.048 ms): lseek (fd: 652, where: END) = 300638208
48087.768 (0.102 ms): writing (fd: 652, buf: 0x7f5d2e0e4500, account: 8192) = 8192
48087.915 (0.057 ms): lseek (fd: 3, offset: 2588672, where: SET) = 2588672
48088.026 (0.059 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48088.156 (0.041 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48088.244 (0.032 ms): lseek (fd: 652, where: END) = 300646400
48088.321 (0.054 ms): lseek (fd: 652, where: END) = 300646400
48088.413 (0.102 ms): writing (fd: 652, buf: 0x7f5d2e0e6500, account: 8192) = 8192
48088.554 (0.046 ms): lseek (fd: 3, offset: 2596864, when: SET) = 2596864
48088.655 (0.061 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48088.779 (0.041 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48088.865 (0.056 ms): lseek (fd: 652, where: END) = 300654592
48088.954 (0.036 ms): lseek (fd: 652, where: END) = 300654592
48089.049 (0.050 ms): writing (fd: 652, buf: 0x7f5d2e0e8500, account: 8192) = 8192
48089.153 (0.037 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48089.261 (0.049 ms): lseek (fd: 652, where: END) = 300662784
48089.355 (0.071 ms): lseek (fd: 652, where: END) = 300662784
48089.463 (0.083 ms): writing (fd: 652, buf: 0x7f5d2e0ea500, account: 8192) = 8192
48089.616 (0.054 ms): lseek (fd: 3, offset: 2605056, where: SET) = 2605056
48089.724 (0.064 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48089.847 (0.044 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48089.951 (0.051 ms): lseek (fd: 652, where: END) = 300670976
48090.046 (0.044 ms): lseek (fd: 652, where: END) = 300670976
48090.150 (0.089 ms): writing (fd: 652, buf: 0x7f5d2e0ec500, account: 8192) = 8192
48090.297 (0.045 ms): lseek (fd: 3, offset: 2613248, when: SET) = 2613248
48090.393 (0.039 ms): reading (fd: 3, buf: 0x55fb9371ca98, account: 8192) = 8192
48090.483 (0.038 ms): reading (fd: 8, buf: 0x7fff2ec84457, count: 1) = -1 The EAGAIN resource is temporarily unavailable
48090.574 (0.056 ms): lseek (fd: 652, where: END) = 300679168
48090.672 (0.046 ms): lseek (fd: 652, where: END) = 300679168
48090.786 (0.182 ms): writing (fd: 652, buf: 0x7f5d2e0ee500, account: 8192) = 8192
48091.030 (0.059 ms): epoll_create1 (marks: 524288) = 992
48091.135 (0.056 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
48091.235 (0.056 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
48091.336 (0.051 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
48091.431 (0.055 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 1
48091.542 (0.034 ms): reading (fd: 12, buf: 0x7fff2ec83d70, count: 16) = 1
48091.629 (0.044 ms): gettimeofday (tv: 0x7fff2ec83d60) = 0
48091.717 (5005.261 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
53097.027 (0.023 ms): close (fd: 992                                 ) = 0
53097.079 (0.024 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
53097.127 (0.026 ms): epoll_create1 (marks: 524288) = 992
53097.177 (0.025 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
53097.222 (0.021 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
53097.264 (0.033 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
53097.313 (5005.226 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
58102.587 (0.022 ms): close (fd: 992                                 ) = 0
58102.638 (0.030 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
58102.690 (0.024 ms): epoll_create1 (marks: 524288) = 992
58102.746 (0.025 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
58102.795 (0.033 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
58102.850 (0.022 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0
58102.905 (5005.237 ms): epoll_wait (epfd: 992, events: 0x55fb9372ae08, maxevents: 1, wait time: 5000) = 0
63108.190 (0.023 ms): close (fd: 992                                 ) = 0
63108.243 (0.025 ms): reading (fd: 8, buf: 0x7fff2ec83e07, account: 1) = -1 EAGAIN resource temporarily unavailable
63108.293 (0.026 ms): epoll_create1 (marks: 524288) = 992
63108.342 (0.027 ms): epoll_ctl (epfd: 992, op: ADD, fd: 12, event: 0x7fff2ec83d7c) = 0
63108.389 (0.021 ms): epoll_ctl (epfd: 992, op: ADD, fd: 8, event: 0x7fff2ec83d7c) = 0
63108.431 (0.032 ms): gettimeofday (tv: 0x7fff2ec83d50) = 0

It seems epoll_wait It is taking too much time from each operation. Do we have to do something to solve it? Or any other detailed stacktrace needs to be analyzed?

Infrastructure

  • AWS EC2 r4.large
  • ZFS file system
  • RAID0 – 8 x 17GB
  • Number of databases – 130+
  • Heavy ETL process