Extract, Transform, Load (ETL)
Why it matters
For AI specifically, every training dataset is the output of an ETL or ELT pipeline. Garbage at the extract step, plus a too-aggressive transform that quietly drops or imputes rows, equals silent training-data pollution. The model evaluates fine, then drifts months later, and the root cause sits two layers upstream in a pipeline nobody on the model team owns. The pipeline IS the AI’s input quality control, not plumbing beneath it. The operational discipline around pipeline maturity is DataOps, which is to ETL what DevOps is to application code.
Where you’ll encounter it
Three concrete contexts. A data engineer picks between dbt (ELT, transforms in the warehouse) and Airflow plus Python (classical ETL, transforms outside). A model-risk review traces a production feature back to the ETL job that built it, and asks whether the job is versioned and rerunnable. A procurement team asks “what ETL platform do you use”, and the honest answer is almost always a stack, not a single product. The pitfall I keep seeing: teams build pipelines as one-time scripts instead of versioned artifacts, then cannot reproduce the training dataset six months later when a regulator asks.
Part of the 7wData AI Glossary. Tracking how concepts like this move in the expert conversation: daily signals at ins7ghts.com.