How to “Read” A Million Novels in One Afternoon with Spark

The world communicates in text. Our work lives have us treading waist-deep in email, our hobbies often have blogs, our complaints go to Yelp, and even our personal lives are lived out via tweets, Facebook updates, and texts. There is a massive amount of information that can be heard from text — as long as you know how to listen.

Learning from large amounts of text data is less a question of will and more a question of feasibility. How do you “read” the equivalent of thousands or even millions of novels in an afternoon?

Topic models extract the key concepts in a set of documents. Each concept can be described by a list of keywords from most to least important. Then, each document can be connected to those concepts, or topics, to determine how representative that document is of that overall concept.

For example, given a corpus of news stories, topics emerge independent of any one document. One topic may be characterized by terms like “poll,” “campaign,” and “debate,” which an observer will quickly see is a topic about politics and elections. Another sample topic may be characterized by terms like “euro,” “prime minister,” and “fiscal” – an observer immediately sees as a topic about the European economy. Commonly, a given document is not wholly about only a single topic but rather a mix of topics. The topic model outputs a probability that the document is about each possible topic. An analyst interested in how an election may affect the European economy can now isolate the topics of interest and search directly for those documents that contain a mixture. In a very short period of time, what started as thousands or millions of documents can be whittled to only the most important few.

To build these topics, an algorithm is employed called Latent Dirichlet Allocation. The algorithm will need to store vectors representing each possible term multiplied by the number of documents. Then, it will iterate through hundreds of thousands of cycles, seeking to improve each abstract topic at each stage.

Apache Spark is optimal for building a pipeline for this task. Spark uses the pool of memory across many servers to break up the problem into many parallel components and uses the comparative speed of RAM to quickly iterate the algorithm. Because the task can be broken into an arbitrary number of smaller pieces and iteration can continue at speed, it handles very large amounts of text data as easily as a single machine handles a moderate amount. 

To see how Spark can handle a massive amount of text data, consider the case of a University of Oklahoma doctoral student in Political Science. She wants to investigate how international politics have changed over the last 20 years, using congressional hearing transcripts from 1995 to 2015.;

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

The Death of Supply Chain Management

19 Jun, 2018

The supply chain is the heart of a company’s operations. To make the best decisions, managers need access to real-time …

Read more

Data Modeling vs. Data Architecture: What’s the Difference?

17 Jul, 2021

The potential benefits of cloud computing are inspiring senior IT and business leaders in many organizations to reconsider enterprise data …

Read more

Machine Learning Is Making Unstructured Data Accessible

10 Nov, 2016

In a 2013 report by IBM, the amount of data created everyday was estimated to be roughly 2,500,000TB. It very …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.