What is Feature Engineering and Why Does It Need To Be Automated?

What is Feature Engineering and Why Does It Need To Be Automated?

Artificial intelligence is becoming more ubiquitous and necessary these days. From preventing fraud, real-time anomaly detection to predicting customer churn, enterprise customers are finding new applications of machine learning (ML) every day. What lies under the hood of ML, how does this technology make predictions and which secret ingredient makes the AI magic work?

In the data science community, the focus is typically on algorithm selection and model training, and indeed those are important, but the most critical piece in the AI/ML workflow is not how we select or tune algorithms but what we input to AI/ML, i.e., Feature engineering.

Feature engineering is the holy grail of data science and the most critical step that determines the quality of AI/ML outcomes. Irrespective of the algorithm used, feature engineering drives model performance, governs the ability of machine learning to generate meaningful insights, and ultimately solve business problems.

Feature engineering is the process of applying domain knowledge to extract analytical representations from raw data, making it ready for machine learning. It is the first step in developing a machine learning model for prediction.

Feature engineering involves the application of business knowledge, mathematics, and statistics to transform data into a format that can be directly consumed by machine learning models. It starts from many tables spread across disparate databases that are then joined, aggregated, and combined into a single flat table using statistical transformations and/or relational operations.

For example, predicting customers likely to churn in any given quarter implies having to identify potential customers who have the highest probability of no longer doing business with the company. How do you go about making such a prediction? We make predictions about the churn rate by looking at the underlying causes. The process is based on analyzing customer behavior and then creating hypotheses. For example, customer A contacted customer support five times in the last month – implying customer A has complaints and is likely to churn. In another scenario, customer A’s product usage might have dropped by 30% in the previous two months, again, implying that customer A has a high probability of churning. Looking at the historical behavior, extracting some hypothesis patterns, testing those hypotheses is the process of feature engineering.

Feature engineering is about extracting the business hypothesis from historical data. A business problem that involves predictions such as customer churn is a classification problem.

There are several ML algorithms that you can use, such as classical logistic regression, decision tree, support vector machine, boosting, neural network. Although all these algorithms require a single flat matrix as their inputs, raw business data is stored in disparate tables (e.g., transactional, temporal, geo-locational, etc.) with complex relationships.

We may join two tables first and perform temporal aggregation on the joined table to extract temporal user behavior patterns. Practical FE is far more complicated than simple transformation exercises such as One-Hot Encoding (transform categorical values into binary indicators so that ML algorithms can utilize). To implement FE, we are writing hundreds or even thousands of SQL-like queries, performing a lot of data manipulation, as well as a multitude of statistical transformations.

In the machine learning context, if we know the historical pattern, we can create a hypothesis. Based on the hypothesis, we can predict the likely outcome – like which customers are likely to churn in a given time period. And FE is all about finding the optimal combination of hypotheses.

Feature Engineering is critical because if we provide wrong hypotheses as an input, ML cannot make accurate predictions. The quality of any provided hypothesis is vital for the success of an ML model.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Dealing With Unsanitized Data

27 Sep, 2017

Big data is not just a buzzword. It is indeed a very important concept with a considerable impact on business …

Read more

Top Big Data Advantages That Matter Now and in Future

7 Nov, 2017

Big data analytics can no longer be termed as a new technology now. Today, most of the mobile app developers …

Read more

A life by the numbers is not worth living

29 Mar, 2017

Sometimes, it seems like big data and data science will save the world. At least, that is what the headlines …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.