Let’s say we have an event which is a point in time event, meaning that on the next check Prometheus may not show the event was triggered but the underlying condition is still active.
What we’re finding in our situations like this, prometheus will “clear” the event and alertmanager will auto-resolve. So we’ll get alerted/paged for the event, but on the next check the alert clears. What should really be done is we need to check on the event and manually resolve the alert when we have either fixed the problem or determined that the incident did indeed resolve.
I understand this is counterintuitive in most cases, but in some cases, it’s valid as is the case in some use-cases in our environment.
So, how does one achieve this trigger-on-event-and-don’t-auto-resolve-until-manually-resolved thing with prom/alertmanager? Is it possible? I haven’t found any documentation describing how to do this.