I am building an application that will calculate uptime – amoung other stuff – of something I’m calling a
service. I will explain a bit about what I have now and how it works then will get into what I’d like it to do.
I have a lambda function that will check my database once per minute and will query for all active services that meet certain criteria and will then create an SQS message for each one.
This is a lambda function that is triggered by the SQS messages created by the services loader, this lambda will check the service and gather some metrics about it and send those off to an instance of influxDB for further processing.
If the service is in a down state currently it will just create an SNS message which goes and activates a different flow using step functions and lambda. This handles some other business logic not related to uptime.
Long running service called service-lr
This is a Node.js application that is running in a K8 cluster which handles a lot of the user side of things. It covers getting the service details to display, gets some metrics from a TS DB, handles creating and updating services and a bunch of other stuff.
What I want
I would like it so the
service monitor will report an up or down status to ideally something that my
service-lr service will be able to query and get the uptime of the service it is monitoring.
My very rough initial thoughts is to have the
service monitor send a message to SQS which a new lambda would take and process by putting the status as well as the current time into an InfluxDB bucket/instance. I would then be able to use my
service-lr to query that InfluxDB bucket and get back the time series data which would have the status’ in it.
I am now trying to figure out if this would work and how I would do the uptime calculation.
My current uptime formula plan:
- Add up time in minutes in a down state – not sure best way to do this
- Divide the down time by 43200 – which is how many minutes there are in 30 days
- Multiple the result by 100 – forget how I came to this
- Subtract that result from previous step from 100 to get the uptime in percentage
With all that said, I am seeking advice on how I would structure things and design them so that I could get the uptime of a service. I have no issue splitting the processing of it across multiple services or creating new ones where it makes sense.
Ideally I would be able to do most of the processing with Lambda with just the final query and actual calculation done using my
service-lr so that I can send it to my front end to be displayed.
A sorta side note/question, is my formula/plan for calculating the uptime correct?