How to “Read” A Million Novels in One Afternoon with Spark Blog

How to “Read” A Million Novels in One Afternoon with Spark

by 7wData
March 1, 2016

The world communicates in text. Our work lives have us treading waist-deep in email, our hobbies often have blogs, our complaints go to Yelp, and even our personal lives are lived out via tweets, Facebook updates, and texts. There is a massive amount of information that can be heard from text — as long as you know how to listen.

Learning from large amounts of text data is less a question of will and more a question of feasibility. How do you “read” the equivalent of thousands or even millions of novels in an afternoon?

Topic models extract the key concepts in a set of documents. Each concept can be described by a list of keywords from most to least important. Then, each document can be connected to those concepts, or topics, to determine how representative that document is of that overall concept.

For example, given a corpus of news stories, topics emerge independent of any one document. One topic may be characterized by terms like “poll,” “campaign,” and “debate,” which an observer will quickly see is a topic about politics and elections. Another sample topic may be characterized by terms like “euro,” “prime minister,” and “fiscal” – an observer immediately sees as a topic about the European economy. Commonly, a given document is not wholly about only a single topic but rather a mix of topics. The topic model outputs a probability that the document is about each possible topic. An analyst interested in how an election may affect the European economy can now isolate the topics of interest and search directly for those documents that contain a mixture. In a very short period of time, what started as thousands or millions of documents can be whittled to only the most important few.

To build these topics, an algorithm is employed called Latent Dirichlet Allocation. The algorithm will need to store vectors representing each possible term multiplied by the number of documents. Then, it will iterate through hundreds of thousands of cycles, seeking to improve each abstract topic at each stage.

Apache Spark is optimal for building a pipeline for this task. Spark uses the pool of memory across many servers to break up the problem into many parallel components and uses the comparative speed of RAM to quickly iterate the algorithm. Because the task can be broken into an arbitrary number of smaller pieces and iteration can continue at speed, it handles very large amounts of text data as easily as a single machine handles a moderate amount.

To see how Spark can handle a massive amount of text data, consider the case of a University of Oklahoma doctoral student in Political Science. She wants to investigate how international politics have changed over the last 20 years, using congressional hearing transcripts from 1995 to 2015.;

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

How to “Read” A Million Novels in One Afternoon with Spark

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

The Death of Supply Chain Management

Data Modeling vs. Data Architecture: What’s the Difference?

Machine Learning Is Making Unstructured Data Accessible

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

How to “Read” A Million Novels in One Afternoon with Spark

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change