Introduction
This mini case study is part of the Data with Danny Serious SQL course. Danny Ma's designed this case study to be a lesson in debugging SQL code.
Problem
The GM of Health Analytics has asked us to answer a few questions for an upcoming board meeting and they want to find out who their most active users are.
Before answering the business questions that will be asked of us, it's best to have a look at the dataset to get an idea of its shape, column names, and the types of values that are in each column.
Based on the output from above we can see that the fields we have are:
ID: unique identifier for a client.
log_date: a date.
measure: type of measure being taken.
measure_value: the value for the measure that has been taken.
systolic: a measurement that goes into calculating blood pressure.
diastolic: another measurement involved in calculating blood pressure.
I want to know which unique values are in the measures column.
It looks like we have 3 main measures which are:
blood_glucose
blood_pressure
weight
Lastly, I want to know what the dimensions of the table are.
Now that I know the shape of the data, the type of values in each column, and the number of rows in the dataset I'm going to try answer the business questions that the GM has asked us to address.
How many unique users exist in the logs_dataset? The initial query was:
When looking at the query and the column names earlier, I noticed that the reason the query wasn't running was because there is no column called user_id in the dataset. To fix this we can run the query:
For the next few questions, we needed to create a temporary table.
2. How many total measurements do we have per user on average?
Initial Query:
Right away I noticed that the reason the query didn't execute because there is no function called MEAN, but rather the mean is calculated using the AVG() function.
OUTPUT:
This first question is helping us identify how many measurements an active user might have and we found that there were around 79 measurements per user on average.
3. What about the median number of measurements per user?
Initial Query:
The above query doesn't work because PERCENTILE_CONTINUOUS isn't the name of the function, additionally, it doesn't make sense to ORDER BY the id, but rather the measure_count. The query below will get the correct output.
OUTPUT:
As you can see the median number of measurements per user is significantly different than the mean amount of measurments.
4. How many users have 3 or more measurements?
Initial query:
This query wouldn't execute because of the presence of the HAVING clause. A having clause is used to query on groups. Using a WHERE call will filter the data as we want it.
OUTPUT:
It looks like there are 209 users who have over 3 measurements.
5. How many users have 1,000 or more measurements ?
Initial Query:
This query doesn't produce the required output because of the SUM call and additionally, what purpose is there for the ID column in this query. We want the count for the users who have over 1,000 measurements so we should use a COUNT(*) call.
OUTPUT:
In the case of this call query it looks like we have 5 users who have over 1000 measure counts.
6. How many users have logged blood glucose measurements?
Initial Query
This query doesn't run for several reasons. The DISTINCT call need to be inside parentheses and measure needs to be followed by an = sign along with the string blood_glucose.
OUTPUT
We have 325 users who have logged blood glucose measurements.
7. How many users have at least two types of measurements?
Initial Query
The COUNT(DISTINCT measures) call doesn't exist in the table. We provided that call with the lable unique_measures when we created out temporary table earlier.
OUTPUT:
It looks like we have 204 people who have had at least 2 different types of measurement on their health.
8. Have all 3 measures- blood glucose, weight, and blood pressure?
Initial Query
This query doesn't execute because there is a typo. It should be user_measure_count not usr_measure_count.
OUTPUT
50 users have recorded the 3 different types of measurements our company offers.
9. What is the median systolic/diastolic blood pressure values?
Initial Query:
This query won't run because the WITHIN clause is missing the latter half of the call which is a GROUP call. It should be WITHIN GROUP. Additionally, blood pressure should be passed in as a string like this blood_pressure.
OUTPUT:
It looks like the median systolic value was 126 and the median diastolic value was 79.
Final Findings
We found that there were 554 unique users in the dataset.
The average user had around 79 measurements.
The median amount of measurement per user was 2.
209 users had more than 3 measurements.
5 users had more than 1000 measurements.
325 users have had a blood glucose measurement.
204 users have had at least 2 different types of measurement.
50 users had all 3 measurements.
The median blood pressure values were 129 systolic and 79 diastolic.
Conclusions
This mini case study allowed for me to practically apply some of the SQL functions I've started to learn. Additionally, it gave me some practice with
Debugging code
Sorting/Filtering Data
Applying Summary Statistics
Using CTE's and TEMP TABLES
You can see the GitHub and code snippets here.
Acknowledgements
The layout for this project was largely inspired by Abe Diaz's fantastic GitHub.
Comments