7 Ways to Handle Large Data Files for Machine Learning

7 Ways to Handle Large Data Files for Machine Learning

Exploring and applying machine learning algorithms to datasets that are too large to fit into memory is pretty common.

This leads to questions like:

In this post, I want to offer some common suggestions you may want to consider.

Some machine learning tools or libraries may be limited by a default memory configuration.

Check if you can re-configure your tool or library to allocate more memory.

A good example is Weka, where you can increase the memory as a parameter when starting the application.

Are you sure you need to work with all of the data?

Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample to work through your problem before fitting a final model on all of your data (using progressive data loading techniques).

I think this is a good practice in general for machine learning to give you quick spot-checks of algorithms and turnaround of results.

You may also consider performing a sensitivity analysis of the amount of data used to fit one algorithm compared to the model skill. Perhaps there is a natural point of diminishing returns that you can use as a heuristic size of your smaller sample.

Do you have to work on your computer?

Perhaps you can get access to a much larger computer with an order of magnitude more memory.

For example, a good option is to rent compute time on a cloud service like Amazon Web Services that offers machines with tens of gigabytes of RAM for less than a US dollar per hour.

I have found this approach very useful in the past.

Is your data stored in raw ASCII text, like a CSV file?

Perhaps you can speed up data loading and use less memory by using another data format. A good example is a binary format like GRIB, NetCDF, or HDF.

There are many command line tools that you can use to transform one data format into another that do not require the entire dataset to be loaded into memory.

Using another format may allow you to store the data in a more compact form that saves memory, such as 2-byte integers, or 4-byte floats.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Why to Bring Shadow IT Into the Light

1 May, 2022

Shadow IT is the unauthorized use of software, hardware, and cloud services. Typically, users skirt official IT channels in order …

Read more

Humans and AI: Organizational Change

21 May, 2021

According to McKinsey, “Research shows that 70 percent of complex, large-scale change programs don’t reach their stated goals. Common pitfalls …

Read more

How to work with someone else’s data

8 Apr, 2020

If you’re about to jump on the citizen data scientist bandwagon (diving into COVID-19 data, perhaps?) then there are a …

Read more

Recent Jobs

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

D365 Business Analyst

South Bend, IN, USA

22 Apr, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.