design patterns – Versioning of data handling pipeline elements


I have developed a custom-made Python package, which provides 2 classes to play with: Stage and Step. They operate based on a Chain of Responsibility design pattern, and behave in a way where:

  • Both Stage and Step possess handle(request) method
  • Stage has a list of Stages or Steps underneath it
  • Step is the endpoint, which contains logic to be executed on the incoming request

In case of Stage, executing its handle(request) method will make it iterate over children elements in its list, and execute their handle(request) method while passing request argument in-between them (updated with each element it passes through).

In case of a Step, I have an abstract method logic(request), which needs to be overwritten by the user when inheriting from this class, in order to introduce some logic it needs to apply to incoming request argument.

Pipeline logic

As a bit of a background to actual application:
This pipeline will be used for data processing in an ML project, where a Flask server with REST API receives a request with data in JSON format, then processes this data by merging all different data sources in that request, adds new derived entries to it, and then spits it out on the other end ready to be fed into a ML model created in sci-kit learn.

Steps will be set in stone for each release. However, information about version of pipeline each case’s data has been pushed through before ML prediction will be kept on the record as well, in a DB managed by another module coded in Django. In case a newer pipeline is available, internal logic of the Django module would prompt a redo of the data parsing through that new pipeline, followed by a new prediction via this ML model.

Now, I would like to introduce some sort of version control mechanism inside of this pipeline, so that pipeline can identify itself with some unique marker, unique to the set of version and list of steps included in it. I need it, so that if let’s say I introduce some new steps into the pipeline in the future, I would like to be able to e.g. redo so older requests, which have been done with an older version of the pipeline.

I have been wondering about what sort of mechanisms I could leverage to get this done, but so far my only idea was to tag each step & stage with version attribute, like self._version = “1.0”, plus have the running script actually count number of steps+stages inside of the pipeline from Stage 0’s perspective, and somehow combine it with versions sum. While simple, this would require me to remember to up-rev version attribute on each step I would rework etc., and would be prone for an error.

Another idea I came upon, was for a mechanism which would physically read class files each stage/step are instantiated from, and compile some sort of numeric representation of all characters in the file, which could be summed up and represented at the Stage 0 piece as a version of the pipeline. Dunno, if that is worth consideration though, since it is a blunt-approach to the problem.

I was wondering, if there are any alternative methods of doing something like this?