Introduction
Hello everyone! As someone who has started to become more interested in using data science for social good (I currently work as a data analyst solving early childhood problems) I wanted to explore some of the questions that can be answered in the Healthcare Industry. To do this, I'm going to use MySQL and some queries that will explore some of the operational challenges for hospitals.
I'll explore demographic information, but I'll also have a look at the typical length of stay on a patient level. Length of stay is particularly important because when combined with other factors like the nature of the visit or the disposition(outcome) of the visit it provides insights on a hospital's level of efficiency and from another perspective insurances companies allocate a set amount of money per patient by type of care. If a patient overstays their projected time it could be detrimental to hospital costs and insurance claims.
As the American healthcare system continues to move towards a value-based care model, I'm going to look at the dispositions along with lab procedures, and location of visit to determine whether or not patients are going through the right channels of care. A large issue in the American healthcare system related to costs is patients going into the ER for services that wouldn't require. This stems from a lack of education on the healthcare system and is what is prompting a shift to a value-based care model.
Data
The dataset originates from the UCI Machine Learning Repository and can be downloaded there or on Kaggle.
The dataset covers 10 years of clinical data from over 130 US hospitals. It has features related to patient and hospital outcomes. The columns that I'll be honing in on are:
Patient number
Race
Time in hospital
admission_type_id (channel of care)
admission_source_id
Specialty of admitting physician
Number of lab tests
Number of visits
During the course of the project I want to answer the questions below:
How long are patients staying in the hospital?
Is there a racial bias in length of stay?
What is length of stay by way of service?
Is there a racial bias in number of lab procedures patients receive?
What is the type of visit for patients overall?
Do more lab procedures mean longer stays?
How efficiently is the hospital operating? Do patients get discharged faster than the average length of stay?
From the COO, who are the top 50 patients with the most prescribed medications, and use lab procedures to tie break.
Fast Insights:
The average length of stay was 4.4 days and there really weren't substantial significant differences between races.
Patients who entered as an inpatient at a trauma center or urgent care typically had the longest length of stays.
There didn't seem to be any racial bias on length of stay.
There was a unique interaction with the average procedures by type and the average stay length by procedure. When looking at the table there looks to be a slight positive correlation between procedures and length of stay. However, I think the type of procedure largely determines the length of stay. Surgeries typically resulted in longer stays as they usually require a longer recovery time.
When it came to stay lengths by channel of care, Trauma center stays seemed to be the longest. This makes sense as these are usually life-threatening situations.
In terms of length of stay, it can be said that there quite a few instances where the patient was discharged faster than the average length of stay, which means the hospital is using the right channels of care a lot of the time.
Deep Dive Insights (Process)
Demographics
The deep dive into the insights above will revolve around the hospital's efficiency in operations. The length of stay is an important indicator of a hospital's efficiency and a look at outcomes on a patient level as well. Earlier we said that the average length of stay for this data set was 4.4 days. To view this, I segmented the data by race and used a query grab the average length of stay by race.
Query
Output
According to the table above there weren't significant differences in length of stay by race. Additionally, the amount of medication administered and the average amount of lab procedures administered by race were roughly around the same. To add a little more granularity, Asian patients seemed to have the least amount of lab procedures, had the shortest stays, and took the least amount of medication indicating that this demographic has higher baseline for health. This could be to a better understanding of the healthcare system, a healthier lifestyle, and just better health in general.
Looking more critically at patient stays as a whole, I decided to use MySQL to visualize some data in the form of a histogram. Now, I know what you're saying MySQL isn't really a data visualization software, but we are going to use the RPAD function along with the '*' symbol to create a bar for our histograms and bucket them based on the number of days a patient stays.
Histogram query
Histogram output
Based on the histogram above we can see that most patients get out of the hospital faster than the national average of 5.5 days. Focusing more on the data within the dataset, the majority of patients stay less than 7 days. What we can gain from these insights is that overall the hospital is operating relatively efficiently.
Since the average profile of stay is on the shorter end, this could be a sign for hospital leadership to focus on the longer admissions and understanding the nature of these visits. Which is exactly what we'll look at next. I want to have a look at the type of stays and the effect those paired with number of procedures has on the overall stay time.
Length of Stay by Procedure
After looking at some demographic-level information related to the length of stay I wanted to have a look at the general length of stay by medical specialty and the counts as well for each stay.
Clearly there seem to be some issues with the way this data was collected and aggregated first of all. There are several procedures where there is only a count of 1. I want to know on a macro level the amount of procedures and the average length of stay, so I need to filter the data to include medical specialty procedures with a count of more than 50 and where the average number of procedures we had is more than 1 as well. I decided to set that additional parameter at 2.5 average procedures by medical specialty. I did this using a HAVING clause on the aggregated information. I also ordered this information by the most most average procedures and limited it to the top 20 procedures if necessary.
These parameters narrowed our findings down to surgeries and radiology labs. The surgeries were typically heart-related surgeries and these tended to have longer visits. Thoracic surgeries are surgeries that tend to deal with the chest and as you can see cardiovascular or heart related surgeries fell into this category as well.
Again, from a different view we can see the interaction between the number of procedures and the length of stay in the hospital. However, this correlation doesn't mean causation. We know that there is a correlation between the number of labs and the length of stay, so drilling down on this interaction could be useful and a point of analysis in a visualization for future projects.
Length of Stay by Channel of Care
I also wanted to have a look at the length of stay by the type of facility where clients checked in at. To do this I had to reformat the facility to codes into categories using a CASE WHEN statement.
Having a look at this information we can see that the longer stays tended to happen with people who came in to a trauma center or urgent care. There was a NULL field from which I wasn't able to decipher from the data dictionary. However, it seems like emergency procedures whether from trauma or something else resulted in longer stays.
Number of Procedures (categories) by Length of Stay
Next, let's summarize the length of stay by creating categories based on the number of procedures. We know that number of procedures typically has a positive correlation with the length of stay, but let's group this information based on the number of procedures and look at the average length of stay for those.
Using a CASE WHEN statement I was able to summarize the data based on the lab procedures meeting certain conditions. I chose the category "few" to represent when a case had at least zero and less than 25 cases, "average" to represent when a client had at least 25 procedures and less than 55 procedures, and finally used "many" to categorize instances when someone had more than 55 procedures. From here we could see that as the number of procedures increased the length of stay in the hospital increased.
Outcomes
Finally, I wanted to have a look at the readmission rates for patients. To do this, I wanted to compile a list of the hospital success stories where patients were discharged faster than the average amount of time spent in the hospital 4.4 days. I utilized a CTE along with a subquery as a filter to grab this information.
The filter on admission_type_id is for people who returned to the hospital via the ER. This means that those who did return were able to be discharged relatively quickly. A future route for hospital administrators is to dive deeper into these length entries and determine the dispositions/outcomes for the clients. Did the client use the ER properly? Should they have used a different channel of care like the urgent care? These are questions and areas that can improve client health and drive down the cost of care as well because clients are going through the right channels.
Client Profile
Finally, I wanted to use the CONCAT function, along with some case statements to build a client profile based on their demographic information, their patient id, the medication they've taken, and the number of lab procedures they've received.
Query
Client Profile Output
Now we have a clean way of grabbing client information, their readmission status, their race, the number of medications they've received, and the number of lab procedures as well.
Recommendations
Overall, this hospital is doing well with keeping the average length of stay low for the patients they serve. The patients represented in this were diabetic patients.
Further diving into the nature of their visits could help hospital admin determine the reason for longer stays, potential misuses of the emergency room, and recreate a focus on value-based care.
What systems are in place to ensure that client is using the right channel of care? Are there 24/7 ER/Urgent Care clinics in place? These types of facilities have been instrumental in driving down the cost of care.
Next Steps
Exploration with visualization in Tableau will be part 2 of this project. I'll use the client list that was created to have a look at some of the interactions between a variety of fields in Tableau.
Again, I'd like to say thank you for everyone who read this. If there are any healthcare analytic professionals in my network I'd really appreciate any feedback for future projects. Additionally, if you haven't take a look at my portfolio website which includes projects in Python, SQL, and I'll be migrating this along with the Excel Analysis of Door Dash to that website.
댓글