Find a Dataset to Launch Your Data Science Project, and Tune Your AI Education
- by 7wData
Once you have decided to explore a career in data science, and you need to engage in a project to get yourself going, you need to decide what dataset to use.
Fortunately, a guide to the best datasets for machine learning has been published in edureka!, written by Disha Gupta, a computer science and technology writer based in India. She notes that without training datasets, machine-learning algorithms would not have a way to learn text mining or text classification. Five to 10 years ago, it was difficult to find datasets for machine learning and data science projects. Today the challenge is not finding data, but to find the relevant data.
Here is an excerpt referring to datasets good for Natural Language Processing projects, which need text data. She recommended:
Enron Dataset– Email data from the senior management of Enron that is organized into folders.
Amazon Reviews – It contains approximately 35 million reviews from Amazon spanning 18 years. Data includes user information, product information, ratings, and text review.
Newsgroup Classification – Collection of almost 20,000 newsgroup documents, partitioned evenly across 20 newsgroups. It is great for practicing topic modeling and text classification.
Quandl: A great source of economic and financial data that is useful to build models to predict stock prices or economic indicators.
WorldBank Open Data: Covers population demographics and many economic and development indicators across the world.
IMF Data: The International Monetary Fund (IMF) publishes data on international finances, foreign exchange reserves, debt rates, commodity prices, and investments.
Two Questions for Your Data Science Project
Once you have selected a dataset, you might need some more suggestions for getting your project off the ground. First, ask yourself two questions, suggests a recent article in Data Science Weekly: How would you make some money with it? And how would you save some money with it?
The answers will help you focus on what is important and useful when looking at your data. You will often find that before you get to the modeling or serious math, you may have to work through problems with the data, such as missing, erroneous or biased data.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More