Introduction¶
I will analyze a kaggle dataset containing 100k medical appointments in Brazil, to examine the data characteristics and deduce what are the major factors behind the no show of patients
and Documente the analysis process along with visualizations to explain the findings and support the conclusions drawn.
Tool: Jupyter Notebook (Python)
Programming library: pandas, numpy, matplotlib, seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
we are trying to know which variable may be related to appointment commitment
df = pd.read_csv(noshowappointments-kagglev2-may-2016.csv')
df.head(50)
df.info()
there are no null values
df.duplicated().sum()
there are no duplicated entries
df.describe()
Data Cleaning¶
we don’t need PatientId or AppointmentID so let’s drop it
correct the misspelling in hipertension
and change ‘-‘ in No-show to ‘_’ to be more consistent
df.drop(['PatientId', 'AppointmentID'],axis = 1 , inplace = True)
df.rename(columns={"Hipertension":"Hypertension","No-show":"No_show"},inplace=True)
df.head()
df.groupby("No_show").Alcoholism.value_counts()
it seems that number of non alcoholic patients who didn’t show up is much more than alcoholic I’ll calculate the proportions to be more precisely
Alcoholic_Proportion = 677/(21642+677)
Non_Alcoholic_Proportion = 21642/(21642+677)
now we wil use visualization
def bar(A,B,A_hieght,B_hieght,xlabel,ylabel,title):
plt.bar([A,B],[A_hieght,B_hieght])
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel);
bar("Alcoholic_Prop","Non_Alcoholic_Prop",Alcoholic_Proportion,Non_Alcoholic_Proportion,\
"Proportion of alcoholic patients who didn't show up ","alcoholism status","Proportions")
as we expected, the proportions show the same results as the non_alcoholic proportion is more than the alcoholic
Another visualization with pie chart
labels='Alcoholic_Prop','Non_Alcoholic_Prop'
sizes=[Alcoholic_Proportion,Non_Alcoholic_Proportion]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal');
from visualizations it seems that there is no correlation between alcoholism and missing the appointment
what about the relationship between the age and appointment commitment ?
df.groupby('No_show').Age.mean()
df.groupby('No_show').Age.describe()
it seems that there is no big difference between age in two groups let’s see it in visualization
df.Age[df.No_show=="No"].hist(alpha=.5,label="show up")
df.Age[df.No_show=='Yes'].hist(alpha=.5,label="didn't show up")
plt.legend();
the distribution of the age nearly looks the same except the number of patients who show up is more large
we are try to investigate SMS receiving and find if there is a relationship with appointment show up
df.groupby('No_show').SMS_received.value_counts()
It looks that a great part of patients who did’t show up in appointment hadn’t recived SMS
calculate proportions to be more consistent and the drawing a visualization
No_SMS_prop=12535/(9784+12535)
SMS_prop = 9784/(12535+9784)
bar("SMS_received_Prop","NO_SMS_received_Prop",SMS_prop,No_SMS_prop,\
"Proportion of patients who received SMS who didn't show up ","SMS_receiving","Proportions")
the bar chart shows that proportion of patients who didn’t received SMS and didn’t go to their appointment is greater than proportion of patient who received SMS
Conclusions¶
from my analysis it seems that there is no relationship between showing up in appointment and age or alcoholism but may be there is positive correlation with SMS receiving
Limitations¶
there is no obvious limitaions