architecture – How to decide whether to spawn more worker instances in a node according to bandwidth utilization

We’re working on a distributed software which transfers data around. The software is composed by a central server application (called “the server”) and several workers which synchronize with the server to get work and inform it when they’re done. It is a distributed application.

Now we need to implement a network bandwidth system which will allow us to decide whether to spawn more workers or not to saturate the network bandwidth and utilize it entirely.

The first question is: how can we estimate if a worker is already utilizing all of the available network bandwidth and how can we decide whether we need to spawn more workers (or even kill some of them)?

This is not easy to answer since for one single central server application:

  • there might be multiple instances of a worker in the same machine with the same network interface
  • there might be multiple workers on different machines
  • some workers could be containerized

Something we thought could work is: let’s keep track of the maximum bandwidth every node (a machine where one or more workers are running) utilized and let’s spawn more workers on that node if we see the used bandwidth is less than that amount. This has the disadvantage of a possible ‘workers overcrowding’, i.e. the server could see that we’re using the maximum-ever-recorded amount of bandwidth but we spawned WAY too many workers and each of them is working at 1 byte/sec (I’m exaggerating here, but ideally this should not happen).

Is there a better heuristic to decide how to ‘scale’ the workers (in or out) according to the network utilization?