Data Analysis using SQL
- by 7wData
SQL is a key cog in a data science professional’s armory. I’m speaking from experience – you simply cannot expect to carve out a successful career in either analytics or data science if you haven’t yet picked up SQL.
And why is SQL so important?
As we move into a new decade, the rate at which we are producing and consuming data is skyrocketing by the day. To make smart decisions based on data, organizations around the world are hiring data professionals like business analysts and data scientists to mine and unearth insights from the vast treasure trove of data.
And one of the most important tools required for this is – you guessed it – SQL!
Structured Query Language (SQL) has been around for decades. It is a programming language used for managing the data held in relational databases. SQL is used all around the world by a majority of big companies. A data analyst can use SQL to access, read, manipulate, and analyze the data stored in a database and generate useful insights to drive an informed decision-making process.
In this article, I will be discussing 8 SQL techniques/queries that will make you ready for any advanced data analysis problems. Do keep in mind that this article assumes a very basic knowledge of SQL.
I would suggest checking out the below courses if you’re new to SQL and/or business analytics:
What is the best way to learn data analysis? By performing it side by side on a dataset! For this purpose, I have created a dummy dataset of a retail store. The customer data table is represented by ConsumerDetails.
Our dataset consists of the following columns:
Note:- I will be using MySQL 5.7 for going forward in the article. You can download it from here – My SQL 5.7 Downloads.
We will begin our analysis with the simplest query, i.e, counting the number of rows in our table. We will do this by using the function – COUNT().
Great! Now we know the number of rows in our table which is 10. It may seem to be funny using this function on a small test dataset but it can help a lot when your rows run into the millions!
A lot of times, our data table is filled with duplicate values. To attain the unique value, we use the DISTINCT function.
In our dataset, how can we find the unique industries that customers belong to?
You guessed it right. We can do this by using the DISTINCT function.
You can even count the number of unique rows by using the count along with distinct. You can refer to the below query:
Aggregation functions are the base of any kind of data analysis. They provide us with an overview of the dataset. Some of the functions we will be discussing are – SUM(), AVG(), and STDDEV().
We use the SUM() function to calculate the sum of the numerical column in a table.
Let’s find out the sum of the amount spent by each of the customers:
In the above example, sum_all is the variable in which the value of the sum is stored. The sum of the amount of money spent by consumers is Rs. 12,560.
To calculate the average of the numeric columns, we use the AVG() function. Let’s find the average expenditure by the consumers for our retail store:
The average amount spent by customers in the retail store is Rs. 1256.
If you have looked at the dataset and then the average value of expenditure by the consumers, you’ll have noticed there’s something missing. The average does not quite provide the complete picture so let’s find another important metric – Standard Deviation. The function is STDDEV().
The standard deviation comes out to be 829.7 which means there is a high disparity between the expenditures of consumers!
The next type of analysis is to identify the extreme values which will help you understand the data better.
The maximum numeric value can be identified by using the MAX() function. Let’s see how to apply it:
The maximum amount of money spent by the consumer in the retail store is Rs. 3000.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More