How to Process a DataFrame with Millions of Rows in Seconds Blog

How to Process a DataFrame with Millions of Rows in Seconds

by 7wData
January 28, 2022

TLDR; process it with a new Python Data Processing Engine in the Cloud.

Data Science is having its renaissance moment. It's hard to keep track of all new Data Science tools that have the potential to change the way Data Science gets done.

I learned about this new Data Processing Engine only recently in a conversation with a colleague, also a Data Scientist. We had a discussion about Big Data processing, which is at the forefront of innovation in the field, and this new tool popped up.

While pandas is the defacto tool for data processing in Python, it doesn’t handle Big Data well. With bigger datasets, you’ll get an out-of-memory exception sooner or later.

Researchers were confronted with this issue a long time ago, which prompted the development of tools like Dask and Spark, which try to overcome “the single machine” constrain by distributing processing to multiple machines.

This active area of innovation also brought us tools like Vaex, which try to solve this issue by making processing on a single machine more memory efficient.

And it doesn’t end there. There is another tool for big data processing you should know about …

Terality is a Serverless Data Processing Engine that processes the data in the Cloud. There is no need to manage infrastructure as Terality takes care of scaling compute resources. Its target audiences are Engineers and Data Scientists.

I exchanged a few emails with the Terality’s team as I was interested in the tool they’ve developed. They answered swiftly. These were my questions to the team:

My n-th email to the Terality’s team (screenshot by author)

Terality team developed a proprietary data processing engine — it’s not a fork of Spark or Dask.

The goal was to avoid the imperfections of Dask, which doesn’t have the same syntax as pandas, it’s asynchronous, doesn’t have all pandas functions and it doesn’t support auto-scaling.

Terality has a free plan with which you can process up to 500 GB of data per month. It also offers a paid plan for companies and individuals with greater requirements.

In this article, we’ll focus on the free plan as it’s applicable to many Data Scientists.

When a user performs a read operation, the Terality client copies the dataset on Terality’s secured cloud storage on Amazon S3.

Terality has a strict policy around data privacy and protection. They guarantee that they’ll not use the data and process it securely.

Terality is not a storage solution. They will delete your data maximum within 3 days after Terality’s client session is closed.

Terality processing currently occurs on AWS in the Frankfurt region.

See the security section for more information.

The user needs to have access to the dataset on his local machine and Terality will handle the uploading process behind the scene.

The upload operation is also parallelized so that is faster.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

How to Process a DataFrame with Millions of Rows in Seconds

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

3 Ways AI Could Totally Change Healthcare

AI is Helping Forecast the Wind, Manage Wind Farms

10 Impressive Things Artificial Intelligence Does Better Than Humans

Recent Jobs

IT Engineer

Data Engineer

Applications Developer

D365 Business Analyst

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

How to Process a DataFrame with Millions of Rows in Seconds

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change