I am working on building a system that deals with an external resource which has been having issues and I am unable to switch away from it. It gives regional endpoints – this post is about handling the errors and taking actions and alerting without duplicate actions or alerts. Do note that this system is being designed to work with multiple external resources which have a
serviceID to identify them.
Explaination of current state
Events are generated in multiple regions, they only happen on error cases however when an error occurs in one region it is likely that another region will have the same error on the external resource mentioned
What happens is say
us-east-1 has an error and
us-west-1 has one as well, both of these errors need to be processed by a single system which will take actions and alert if needed based on the error.
My current setup is pretty terrible in that it is region specific and if in the case above where there are issues in both
us-west-1 that the action needed will be taken twice same with the alert. This causes spam and also has a tendancy to break things in some edge cases I’ve experienced lately.
My goal is to have a system that when a single region has the issue the correct action and alert is done and when two or more regions have an issue that the action is only taken once and if needed a single alert goes out that lists the effected regions.
About the application and infrastructure
- AWS based
- Node.js micro services running in mix of EKS and Lambda
- IaC via Terraform
How can I design this system to handle the multi region events but have a single point where the actions and alerts are deduplicated and still retain their region for inclusion in the action (super important) and the alert?
My idea that hasn’t worked
Each region would generate the events when needed which would contain a serviceID like mentioned at the top of this post. Then on the primary region it would deduplicate the events using that service ID – I have zero clue how to do this, like using SQS queue or SNS or maybe my DB (MongoDB). Once the deduplication is done it would then send a SNS message in a fan out to 2 SQS queues which have Lambda’s attached to take either an action or alert based on the error message from the inital alert. This last part of Lamba working to do the action(s) and alerts needed is already done as it is part of the terrible existing mentioned.
Current issues with my idea
- No deduplication working
- Unable to get SQS/SNS messages to work across regions
I am open to any suggestions on how I could go about solving this problem of mine. Pretty much anything is possible from an approval standpoint.