Project: Medical appointments data analysis¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction¶

I will analyze a kaggle dataset containing 100k medical appointments in Brazil, to examine the data characteristics and deduce what are the major factors behind the no show of patients

and Documente the analysis process along with visualizations to explain the findings and support the conclusions drawn.

Tool: Jupyter Notebook (Python)

Programming library: pandas, numpy, matplotlib, seaborn

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

we are trying to know which variable may be related to appointment commitment

Data Wrangling¶

General Properties¶

In [2]:

df = pd.read_csv(noshowappointments-kagglev2-may-2016.csv')
df.head(50)

Out[2]:

	PatientId	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Scholarship	Hipertension	Diabetes	Alcoholism	SMS_received	No-show
0	2.987250e+13	5642903	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	0	1	0	0	0	No
1	5.589978e+14	5642503	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	0	0	0	No
2	4.262962e+12	5642549	F	2016-04-29T16:19:04Z	2016-04-29T00:00:00Z	62	MATA DA PRAIA	0	0	0	0	0	No
3	8.679512e+11	5642828	F	2016-04-29T17:29:31Z	2016-04-29T00:00:00Z	8	PONTAL DE CAMBURI	0	0	0	0	0	No
4	8.841186e+12	5642494	F	2016-04-29T16:07:23Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	1	1	0	0	No
5	9.598513e+13	5626772	F	2016-04-27T08:36:51Z	2016-04-29T00:00:00Z	76	REPÚBLICA	0	1	0	0	0	No
6	7.336882e+14	5630279	F	2016-04-27T15:05:12Z	2016-04-29T00:00:00Z	23	GOIABEIRAS	0	0	0	0	0	Yes
7	3.449833e+12	5630575	F	2016-04-27T15:39:58Z	2016-04-29T00:00:00Z	39	GOIABEIRAS	0	0	0	0	0	Yes
8	5.639473e+13	5638447	F	2016-04-29T08:02:16Z	2016-04-29T00:00:00Z	21	ANDORINHAS	0	0	0	0	0	No
9	7.812456e+13	5629123	F	2016-04-27T12:48:25Z	2016-04-29T00:00:00Z	19	CONQUISTA	0	0	0	0	0	No
10	7.345362e+14	5630213	F	2016-04-27T14:58:11Z	2016-04-29T00:00:00Z	30	NOVA PALESTINA	0	0	0	0	0	No
11	7.542951e+12	5620163	M	2016-04-26T08:44:12Z	2016-04-29T00:00:00Z	29	NOVA PALESTINA	0	0	0	0	1	Yes
12	5.666548e+14	5634718	F	2016-04-28T11:33:51Z	2016-04-29T00:00:00Z	22	NOVA PALESTINA	1	0	0	0	0	No
13	9.113946e+14	5636249	M	2016-04-28T14:52:07Z	2016-04-29T00:00:00Z	28	NOVA PALESTINA	0	0	0	0	0	No
14	9.988472e+13	5633951	F	2016-04-28T10:06:24Z	2016-04-29T00:00:00Z	54	NOVA PALESTINA	0	0	0	0	0	No
15	9.994839e+10	5620206	F	2016-04-26T08:47:27Z	2016-04-29T00:00:00Z	15	NOVA PALESTINA	0	0	0	0	1	No
16	8.457439e+13	5633121	M	2016-04-28T08:51:47Z	2016-04-29T00:00:00Z	50	NOVA PALESTINA	0	0	0	0	0	No
17	1.479497e+13	5633460	F	2016-04-28T09:28:57Z	2016-04-29T00:00:00Z	40	CONQUISTA	1	0	0	0	0	Yes
18	1.713538e+13	5621836	F	2016-04-26T10:54:18Z	2016-04-29T00:00:00Z	30	NOVA PALESTINA	1	0	0	0	1	No
19	7.223289e+12	5640433	F	2016-04-29T10:43:14Z	2016-04-29T00:00:00Z	46	DA PENHA	0	0	0	0	0	No
20	6.222575e+14	5626083	F	2016-04-27T07:51:14Z	2016-04-29T00:00:00Z	30	NOVA PALESTINA	0	0	0	0	0	Yes
21	1.215484e+13	5628338	F	2016-04-27T10:50:45Z	2016-04-29T00:00:00Z	4	CONQUISTA	0	0	0	0	0	Yes
22	8.632298e+14	5616091	M	2016-04-25T13:29:16Z	2016-04-29T00:00:00Z	13	CONQUISTA	0	0	0	0	1	Yes
23	2.137540e+14	5634142	F	2016-04-28T10:27:05Z	2016-04-29T00:00:00Z	46	CONQUISTA	0	0	0	0	0	No
24	8.734858e+12	5641780	F	2016-04-29T14:19:19Z	2016-04-29T00:00:00Z	65	TABUAZEIRO	0	0	0	0	0	No
25	5.819370e+12	5624020	M	2016-04-26T15:04:17Z	2016-04-29T00:00:00Z	46	CONQUISTA	0	1	0	0	1	No
26	2.578785e+10	5641781	F	2016-04-29T14:19:42Z	2016-04-29T00:00:00Z	45	BENTO FERREIRA	0	1	0	0	0	No
27	1.215484e+13	5628345	F	2016-04-27T10:51:45Z	2016-04-29T00:00:00Z	4	CONQUISTA	0	0	0	0	0	No
28	5.926172e+12	5642400	M	2016-04-29T15:48:02Z	2016-04-29T00:00:00Z	51	SÃO PEDRO	0	0	0	0	0	No
29	1.225776e+12	5642186	F	2016-04-29T15:16:29Z	2016-04-29T00:00:00Z	32	SANTA MARTHA	0	0	0	0	0	No
30	3.428156e+11	5628068	F	2016-04-27T10:24:52Z	2016-04-29T00:00:00Z	46	NOVA PALESTINA	0	0	0	0	0	No
31	3.112849e+11	5628907	M	2016-04-27T12:07:14Z	2016-04-29T00:00:00Z	12	NOVA PALESTINA	1	0	0	0	0	Yes
32	5.288356e+13	5637908	M	2016-04-29T07:30:27Z	2016-04-29T00:00:00Z	61	SÃO CRISTÓVÃO	0	1	0	0	0	No
33	7.653517e+12	5616921	F	2016-04-25T15:01:04Z	2016-04-29T00:00:00Z	38	SÃO CRISTÓVÃO	1	0	0	0	1	No
34	1.999976e+13	5637963	F	2016-04-29T07:34:30Z	2016-04-29T00:00:00Z	79	SÃO CRISTÓVÃO	0	1	0	0	0	No
35	7.816264e+13	5637968	M	2016-04-29T07:34:42Z	2016-04-29T00:00:00Z	18	SÃO CRISTÓVÃO	0	0	0	0	0	No
36	7.298459e+13	5637975	F	2016-04-29T07:35:19Z	2016-04-29T00:00:00Z	63	SÃO CRISTÓVÃO	0	1	1	0	0	No
37	1.578132e+12	5637986	F	2016-04-29T07:35:56Z	2016-04-29T00:00:00Z	64	TABUAZEIRO	1	1	1	0	0	No
38	5.873316e+12	5609446	M	2016-04-20T15:54:18Z	2016-04-29T00:00:00Z	85	SÃO CRISTÓVÃO	0	1	0	0	1	No
39	1.455623e+13	5639644	F	2016-04-29T09:21:13Z	2016-04-29T00:00:00Z	59	SÃO CRISTÓVÃO	0	0	0	0	0	No
40	9.968684e+14	5635881	F	2016-04-28T14:14:16Z	2016-04-29T00:00:00Z	55	TABUAZEIRO	0	0	0	0	0	No
41	8.224325e+14	5633339	F	2016-04-28T09:20:36Z	2016-04-29T00:00:00Z	71	MARUÍPE	0	0	1	0	0	No
42	2.596543e+13	5632906	F	2016-04-28T08:34:54Z	2016-04-29T00:00:00Z	50	SÃO CRISTÓVÃO	0	0	0	0	0	No
43	7.155896e+13	5641620	F	2016-04-29T14:02:16Z	2016-04-29T00:00:00Z	49	SÃO CRISTÓVÃO	0	1	0	0	0	No
44	2.741649e+11	5635414	F	2016-04-28T13:27:27Z	2016-04-29T00:00:00Z	78	SÃO CRISTÓVÃO	0	1	1	0	0	Yes
45	4.982379e+12	5635842	F	2016-04-28T14:11:24Z	2016-04-29T00:00:00Z	31	SÃO CRISTÓVÃO	0	0	0	0	0	No
46	1.379437e+11	5615608	M	2016-04-25T12:44:36Z	2016-04-29T00:00:00Z	58	SÃO CRISTÓVÃO	0	1	0	1	1	No
47	5.894585e+11	5633116	F	2016-04-28T08:51:26Z	2016-04-29T00:00:00Z	39	MARUÍPE	0	1	1	0	0	No
48	8.545415e+12	5618643	F	2016-04-26T07:19:49Z	2016-04-29T00:00:00Z	58	SÃO CRISTÓVÃO	0	0	0	0	1	Yes
49	9.223559e+13	5534656	F	2016-03-31T17:11:17Z	2016-04-29T00:00:00Z	27	GRANDE VITÓRIA	0	0	0	0	1	Yes

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

there are no null values

In [9]:

df.duplicated().sum()

Out[9]:

there are no duplicated entries

In [3]:

df.describe()

Out[3]:

	PatientId	AppointmentID	Age	Scholarship	Hipertension	Diabetes	Alcoholism	Handcap	SMS_received
count	1.105270e+05	1.105270e+05	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000
mean	1.474963e+14	5.675305e+06	37.088874	0.098266	0.197246	0.071865	0.030400	0.022248	0.321026
std	2.560949e+14	7.129575e+04	23.110205	0.297675	0.397921	0.258265	0.171686	0.161543	0.466873
min	3.921784e+04	5.030230e+06	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	4.172614e+12	5.640286e+06	18.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	3.173184e+13	5.680573e+06	37.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	9.439172e+13	5.725524e+06	55.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	9.999816e+14	5.790484e+06	115.000000	1.000000	1.000000	1.000000	1.000000	4.000000	1.000000

Data Cleaning¶

we don’t need PatientId or AppointmentID so let’s drop it
correct the misspelling in hipertension
and change ‘-‘ in No-show to ‘_’ to be more consistent

In [13]:

df.drop(['PatientId', 'AppointmentID'],axis = 1 , inplace = True)

In [14]:

df.rename(columns={"Hipertension":"Hypertension","No-show":"No_show"},inplace=True)

df.head()

Out[14]:

	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Hypertension	Diabetes	No_show
0	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	1	0	No
1	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	No
2	F	2016-04-29T16:19:04Z	2016-04-29T00:00:00Z	62	MATA DA PRAIA	0	0	No
3	F	2016-04-29T17:29:31Z	2016-04-29T00:00:00Z	8	PONTAL DE CAMBURI	0	0	No
4	F	2016-04-29T16:07:23Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	1	1	No

Exploratory Data Analysis¶

Do alcoholic patients are more likely to miss their appointment ?

In [15]:

df.groupby("No_show").Alcoholism.value_counts()

Out[15]:

No_show  Alcoholism
No       0             85525
         1              2683
Yes      0             21642
         1               677
Name: Alcoholism, dtype: int64

it seems that number of non alcoholic patients who didn’t show up is much more than alcoholic I’ll calculate the proportions to be more precisely

In [16]:

Alcoholic_Proportion = 677/(21642+677)
Non_Alcoholic_Proportion = 21642/(21642+677)

now we wil use visualization

In [20]:

def bar(A,B,A_hieght,B_hieght,xlabel,ylabel,title):
    plt.bar([A,B],[A_hieght,B_hieght])
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel);
    
bar("Alcoholic_Prop","Non_Alcoholic_Prop",Alcoholic_Proportion,Non_Alcoholic_Proportion,\
    "Proportion of alcoholic patients who didn't show up ","alcoholism status","Proportions")

as we expected, the proportions show the same results as the non_alcoholic proportion is more than the alcoholic

Another visualization with pie chart

In [13]:

labels='Alcoholic_Prop','Non_Alcoholic_Prop'
sizes=[Alcoholic_Proportion,Non_Alcoholic_Proportion]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal');

from visualizations it seems that there is no correlation between alcoholism and missing the appointment

what about the relationship between the age and appointment commitment ?

In [26]:

df.groupby('No_show').Age.mean()

Out[26]:

No_show
No     37.790064
Yes    34.317667
Name: Age, dtype: float64

In [25]:

df.groupby('No_show').Age.describe()

Out[25]:

	count	mean	std	min	25%	50%	75%	max
No_show
No	88208.0	37.790064	23.338878	-1.0	18.0	38.0	56.0	115.0
Yes	22319.0	34.317667	21.965941	0.0	16.0	33.0	51.0	115.0

it seems that there is no big difference between age in two groups let’s see it in visualization

In [24]:

df.Age[df.No_show=="No"].hist(alpha=.5,label="show up")
df.Age[df.No_show=='Yes'].hist(alpha=.5,label="didn't show up")
plt.legend();

the distribution of the age nearly looks the same except the number of patients who show up is more large

we are try to investigate SMS receiving and find if there is a relationship with appointment show up

In [27]:

df.groupby('No_show').SMS_received.value_counts()

Out[27]:

No_show  SMS_received
No       0               62510
         1               25698
Yes      0               12535
         1                9784
Name: SMS_received, dtype: int64

It looks that a great part of patients who did’t show up in appointment hadn’t recived SMS

calculate proportions to be more consistent and the drawing a visualization

In [28]:

No_SMS_prop=12535/(9784+12535) 
SMS_prop = 9784/(12535+9784)

In [30]:

bar("SMS_received_Prop","NO_SMS_received_Prop",SMS_prop,No_SMS_prop,\
    "Proportion of patients who received SMS who didn't show up ","SMS_receiving","Proportions")

the bar chart shows that proportion of patients who didn’t received SMS and didn’t go to their appointment is greater than proportion of patient who received SMS

Conclusions¶

from my analysis it seems that there is no relationship between showing up in appointment and age or alcoholism but may be there is positive correlation with SMS receiving

Limitations¶

there is no obvious limitaions

Medical Appointments

What are the major factors behind the no show of patients?