architecture – design of data processing pipeline for data processing

I have a use case for which I need to build a data processing pipeline

  • The contact with the client takes the data that comes from different data sources such as CSV, database, api that should be the first mapped to the fields of a universal scheme. There could be ~ 100k rows every day that should be processed.
  • Then, some of the fields must be cleaned, validated and enriched. For example, the email field must be validated by calling a External API To verify if it is valid and does not bounce, the address field must be standardized to a particular format. There are other operations such as the estimation of the city, the state from the zip code, the validation of the telephone number. At least 20 operations planned, more to come in the future.
  • The above rules are not fixed and can change according to what the user wants to do with their data (saved from the user interface). For example, for a particular data, a user can only choose to standardize their phone number, but not verify if it is valid: therefore, the operations performed on the data are dynamic.

This is what I am currently doing:

  1. Load the data as a panda data frame (they have been considered sparkle.) But the data set is not that big[max 200 mb-]use spark). Have a list of the operations defined by the user that must be performed in each field as

    shares = {"phone number": [‘cleanse’, ‘standardise’], "zipper": [“enrich”, “validate”]}

As I mentioned earlier, the the actions are dynamic and they vary from the data source to the data source according to what the user chooses to do in each field. There are many custom businesses like this that can be applied specifically to a specific field.

  1. I have a custom function for each operation that the user can define for the fields.
    I call them according to the "actions" dictionary and I pass the data frame to the function; the function applies the logic written in the data frame and returns the modified data frame.
def cleanse_phone_no (df, configurations):
# Logic
back modified_df

I'm not sure if this is the right approach to do it. Things will get complicated when you have to call external APIs to enrich certain fields in the future. So I'm considering a producer-consumer model

to. Have a producer module that believes it is divided. each row in the file(1 contact record) as a single message in a queue like AMQ or Kafka

second. Have the logic to process the data in the consumers: they will take one message at a time and process them

do. The advantage I see with this approach is that it simplifies the data processing part, the data is processed one record at a time. There is more control and granularity. The disadvantage is that it will generate an overload in terms of computation as a record processed one by one, which I can overcome to some extent by using multiple consumers

Here are my questions:

  • What is your opinion about the approach? Do you have any suggestions for a better approach?
  • Is there a more elegant pattern that I can use to apply the custom rules to the data set I am currently using?
  • Is it advisable to use a producer-consumer model to process the data one row at a time as a complete data set (taking into account all the logical complexity that would come in the future)? If so, should I use AMQ or Kafka?