Good models + Bad data = Bad analysis Blog

Good models + Bad data = Bad analysis

by 7wData
January 23, 2017

One of the key themes in Numbersense is the relationship between models and data. Think of data as inputs to models which generate outputs (predictions, etc.). A lot of the dialog in the data science community revolves around models, or algorithms that implement underlying models (random forests, deep learning, etc.). But there are countless examples of applying good models to bad data, resulting in bad outputs.

I just finished teaching a class about Analytical Models at Columbia. Ironically, the main takeaway for the class is grasping the complete analytical process, from data gathering to interpreting model outputs, of which the nature of the analytical model plays only a minor role. The course revolves around a semester-long project. Students are asked to identify a real-world dataset to work on.

A number of students coalesced around a movie dataset, uploaded to Kaggle. The dataset includes data scraped from the IMDB website, plus a number of enhancements, such as counting the number of people in movie posters, and the number of Facebook likes of key actors in the movies. At first glance, the dataset is quite rich, and suggests that box office receipts may be predictable using the included variables.

This dataset is a great illustration of why one cannot get good outputs when the inputs are highly flawed. On closer inspection, most of the variables contain considerable impurities. There will be three or four posts discussing various aspects of this dataset, of which this post is the first.

One of the most interesting variables is the count of faces (people) on the movie poster. What the analyst should recognize right away is that this variable is a "computed" variable (sometimes called "modeled"). In other words, no one actually counted the number of heads on each poster. A facial recognition algorithm developed by a third party was deployed to predict the number of heads in each poster.

We have a cascade of models. The output of one algorithm generates data which are used as input to another algorithm. With computed variables, we must ask how accurate the first algorithm is. If the first algorithm is not accurate enough, we violate one of the key assumptions of data-science models!

Most standard models used in data science, for example, regression models, assume that the predictors (X) are accurately measured. This assumption is fine when raw data are used, e.g. the budget of the movie, the year it was produced, the name of the director, etc. are all known with certainty. But here, the number of people on the poster is a prediction by the face-recognition algorithm, which does not have perfect accuracy.

The use of computed or modeled variables is extremely common in the business world. A big chunk of data we use are actually computed (aka modeled).

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Good models + Bad data = Bad analysis

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

How News Organizations Use Algorithms to Decide What to Show You

There is no business too small to benefit from data analytics

Managing the privacy and security complexities of open data

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Good models + Bad data = Bad analysis

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change