architecture – Unifying Python Preprocessing Pipelines for Machine Learning on Time Series Data (High Throughput Batch & Low Latency Streaming)


What frameworks, design patterns, systems, etc. exist for unifying the preprocessing of time series data in Python such that high throughput is achieved on the retrospective batch data for training machine learning models while also allowing for easy model deployment with low latency on real-time streaming data for inference?

The data is raw, high frequency (50-500Hz) ICU bedside monitor signals which are recorded into a bespoke time-series database containing over 3 trillion data points that come from a RabbitMQ stream producing on the order of tens of thousands of new points per second. The database is accessible via a rather slow REST API backend in PHP or SDKs based in various languages. Researchers are really only familiar with Python-based tooling (numpy, scipy, pandas, biosppy, PyTorch, Tensorflow, various GitHub repos of domain-specific code from research papers, etc) while also not being interested or skilled in engineering custom deployment code for each of their models. We are needing high throughput while preparing large subsets (up to a couple of terabytes raw) of the database for efficiently and easily developing models from retrospective data, while also having a very low latency (<500ms) on real-time streaming data.

  1. Traditional ML Research Approach

    • Relevant retrospective data is pulled to disk, processed sequentially (usually inefficiently), and eventually, a decent performing model is made.
    • This works but is completely incompatible with the real-time systems meaning deployment requires significant custom engineering. Or worse, this processing is simply inefficient and takes days to process the dataset and/or extra engineering work to parallelize. Therefore, this is not a great approach due to how the research phase is slowed down and the excessive technical work requirements for optimizing the processing as well as deploying to real-time data.
  2. Custom Python Framework

    • I had started on developing a Python-based framework that tried to minimize the amount of systems knowledge and code required from researchers while still allowing them the freedom to perform their work however they saw fit via familiar tools (ie, Python). This ended up being quite the undertaking and has been very difficult. Worries of bugs, dealing with edge cases, performance, maintenance, adaptability, … have me looking for a better approach. Many of the problems I discovered along the way seem to be addressed by the streaming frameworks.
  3. Streaming Frameworks

    • It appears that streaming frameworks are the current leading way of feeding real-time machine learning models. In particular, Spark Streaming, but also Storm, Flink, Samza, Trill, …
    • Apache Beam is currently the most interesting though I am uncertain this is the best option
      • Unified: batch and streaming can be processed in the same way
        • I recently learned Flink offers similar functionality
      • Extensible: allows for new SDKs, IO connectors, etc
        • This is important for interfacing with our bespoke database.
        • Issue: The Python SDK of Apache Beam does not have existing IO for RabbitMQ but Java SDK does
        • Allows for the execution of any Python code on the data, not just simplistic min/max/windowing and such of other streaming engines
      • Language: Allows for everything to be written in Python
        • Only other all Python offering I have found is Streamz

Thank you for your time. If there are better venues or ways to ask this question, please let me know! Especially if I have misused terminology, please let me know as that will allow me to research these issues better on my own.