data – System Design: Efficient way to import hundreds of different CSV formats


OVERVIEW

We have multiple data providers who provide us information from a large set of data sources. These end data sources provide data in CSV format.

Since, the CSV formats are not consistent across different sources, we currently have a manual process in place where a human goes through each CSV file we receive and updates the column names in each file with the names understood by our system and then those CSVs are ingested.

For Example:

Let us assume that an Address object exists in our system and is represented by Line1, Line2, City, State, Country, PostalCode columns.

We would then get CSVs having columns like these:

  1. Address1,Address2, City, County, Country, Zip
  2. Line1,Line2, Line3, City, Country, Zipcode
  3. Address, City, State, Country, Postcode

Which we would map to our representation manually as following:

  1. Address1–>Line1, Address2–>Line2, City–>City, NULL–>State, Country–>Country, Zip–>PostalCode
  2. Line1–>Line1,Line2+Line3–>Line2, City–>City, NULL–>State, Country–>Country, Zipcode–>PostalCode
  3. Address–>Line1, NULL–>Line2, City–>City, State–>State, Country–>Country, Postcode-->PostalCode`

REQUIREMENT

We need a solution which can replace the human intervention entirely, and smartly identify and map the incoming CSV to the format understood by our system. Please suggest a solution or point me in the direction I should take.