(PISA Data exploration)¶

by (mohamed abdo)¶

Preliminary Wrangling¶

PISA is a survey of students’ skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.

Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.

In [10]:

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

beacause data is too large, I only selected some variables of the data to observe

In [11]:

df = pd.read_csv(r'C:\Users\msala\Desktop\pisa2012.csv',encoding="cp1252",sep=',')

In [12]:

df = df[['CNT', 'ST03Q02', 'ST04Q01', 'ST11Q01', 'ST11Q02','ST11Q03','ST11Q04','ST11Q05',
         'ST26Q06',

         'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ', 
         'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'COBN_F',  'COBN_M',  'COBN_S']]
df

Out[12]:

	CNT	ST03Q02	ST04Q01	ST11Q01	ST11Q02	ST11Q03	ST11Q04	ST11Q05	ST26Q06	PV1MATH	…	PV4READ	PV5READ	PV1SCIE	PV2SCIE	PV3SCIE	PV4SCIE	PV5SCIE	COBN_F	COBN_M	COBN_S
0	Albania	1996	Female	Yes	Yes	Yes	Yes	NaN	No	406.8469	…	175.7053	218.5981	341.7009	408.8400	348.2283	367.8105	392.9877	Albania	Albania	Albania
1	Albania	1996	Female	Yes	Yes	NaN	Yes	NaN	Yes	486.1427	…	369.7553	396.7618	548.9929	471.5964	471.5964	443.6218	454.8116	Albania	Albania	Albania
2	Albania	1996	Female	Yes	Yes	No	Yes	No	Yes	533.2684	…	431.3938	401.2100	499.6643	428.7952	492.2044	512.7191	499.6643	Albania	Albania	Albania
3	Albania	1996	Female	Yes	Yes	No	Yes	No	Yes	412.2215	…	425.0393	471.9036	438.6796	481.5740	448.9370	474.1141	426.5573	Albania	Albania	Albania
4	Albania	1996	Female	Yes	Yes	Yes	NaN	NaN	Yes	381.9209	…	272.8495	260.1405	361.5628	275.7740	372.7527	403.5248	422.1746	Albania	Albania	Albania
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
485485	Vietnam	1996	Female	Yes	Yes	Yes	No	No	No	477.1849	…	472.1419	481.6736	559.8098	528.1052	519.7128	535.5651	538.3626	Viet Nam	Viet Nam	Viet Nam
485486	Vietnam	1996	Male	Yes	Yes	Yes	Yes	No	Yes	518.9360	…	565.5134	451.6372	538.7355	493.9761	493.0436	561.1153	535.0056	Viet Nam	Viet Nam	Viet Nam
485487	Vietnam	1996	Male	Yes	Yes	No	Yes	Yes	No	475.2376	…	457.8122	511.5425	536.8706	571.3726	488.3812	548.9929	563.9127	Viet Nam	Viet Nam	Viet Nam
485488	Vietnam	1996	Male	Yes	Yes	Yes	No	No	Yes	550.9503	…	528.5437	522.9301	511.0407	532.4879	524.0955	551.1376	514.7706	Viet Nam	Viet Nam	Viet Nam
485489	Vietnam	1996	Female	Yes	No	No	Yes	Yes	NaN	470.0187	…	459.2741	488.6635	530.6229	473.7411	477.4711	477.4711	505.4457	Viet Nam	Viet Nam	Viet Nam

485490 rows × 27 columns

In [13]:

print(df.shape)
print(df.dtypes)
print(df.head(10))

(485490, 27)
CNT         object
ST03Q02      int64
ST04Q01     object
ST11Q01     object
ST11Q02     object
ST11Q03     object
ST11Q04     object
ST11Q05     object
ST26Q06     object
PV1MATH    float64
PV2MATH    float64
PV3MATH    float64
PV4MATH    float64
PV5MATH    float64
PV1READ    float64
PV2READ    float64
PV3READ    float64
PV4READ    float64
PV5READ    float64
PV1SCIE    float64
PV2SCIE    float64
PV3SCIE    float64
PV4SCIE    float64
PV5SCIE    float64
COBN_F      object
COBN_M      object
COBN_S      object
dtype: object
       CNT  ST03Q02 ST04Q01 ST11Q01 ST11Q02 ST11Q03 ST11Q04 ST11Q05 ST26Q06  \
0  Albania     1996  Female     Yes     Yes     Yes     Yes     NaN      No   
1  Albania     1996  Female     Yes     Yes     NaN     Yes     NaN     Yes   
2  Albania     1996  Female     Yes     Yes      No     Yes      No     Yes   
3  Albania     1996  Female     Yes     Yes      No     Yes      No     Yes   
4  Albania     1996  Female     Yes     Yes     Yes     NaN     NaN     Yes   
5  Albania     1996  Female     Yes     Yes      No     Yes     Yes     Yes   
6  Albania     1996  Female     Yes     Yes     Yes      No      No      No   
7  Albania     1996    Male     Yes     Yes      No     Yes     Yes      No   
8  Albania     1996  Female     NaN     NaN     NaN     NaN     NaN     Yes   
9  Albania     1996  Female     Yes     Yes     Yes      No     Yes      No   

    PV1MATH  ...   PV4READ   PV5READ   PV1SCIE   PV2SCIE   PV3SCIE   PV4SCIE  \
0  406.8469  ...  175.7053  218.5981  341.7009  408.8400  348.2283  367.8105   
1  486.1427  ...  369.7553  396.7618  548.9929  471.5964  471.5964  443.6218   
2  533.2684  ...  431.3938  401.2100  499.6643  428.7952  492.2044  512.7191   
3  412.2215  ...  425.0393  471.9036  438.6796  481.5740  448.9370  474.1141   
4  381.9209  ...  272.8495  260.1405  361.5628  275.7740  372.7527  403.5248   
5  396.3312  ...  335.3617  219.3924  384.3156  358.2059  405.7628  403.8978   
6  438.0823  ...  338.3006  373.2503  508.1499  462.4580  499.7576  519.3398   
7  576.5773  ...  535.6811  582.9958  611.2831  596.3633  554.4013  513.3719   
8  434.1097  ...  391.7577  278.9657  373.0325  317.0832  376.7624  409.3995   
9  463.6314  ...  450.5366  323.4470  404.2708  343.6591  339.9291  335.2667   

    PV5SCIE   COBN_F   COBN_M   COBN_S  
0  392.9877  Albania  Albania  Albania  
1  454.8116  Albania  Albania  Albania  
2  499.6643  Albania  Albania  Albania  
3  426.5573  Albania  Albania  Albania  
4  422.1746  Albania  Albania  Albania  
5  325.5688  Albania  Albania  Albania  
6  466.1880  Albania  Albania  Albania  
7  519.8993  Albania  Albania  Albania  
8  382.3573  Albania  Albania  Albania  
9  287.7098  Albania  Albania  Albania  

[10 rows x 27 columns]

In [14]:

df.duplicated().sum()

Out[14]:

In [15]:

df.isna().sum()

Out[15]:

CNT             0
ST03Q02         0
ST04Q01         0
ST11Q01     24931
ST11Q02     44454
ST11Q03     85414
ST11Q04     94722
ST11Q05    137310
ST26Q06     12308
PV1MATH         0
PV2MATH         0
PV3MATH         0
PV4MATH         0
PV5MATH         0
PV1READ         0
PV2READ         0
PV3READ         0
PV4READ         0
PV5READ         0
PV1SCIE         0
PV2SCIE         0
PV3SCIE         0
PV4SCIE         0
PV5SCIE         0
COBN_F       3665
COBN_M       3647
COBN_S       3654
dtype: int64

In [16]:

df_cleaned = df.copy()

In [17]:

#replace nan value with yes and replace no,yes with 0,1
df_cleaned['ST11Q01'].fillna('Yes', inplace=True)
df_cleaned['ST11Q01'].replace({'No': 0, 'Yes': 1}, inplace=True)
df_cleaned['ST11Q02'].fillna('Yes', inplace=True)
df_cleaned['ST11Q02'].replace({'No': 0, 'Yes': 1}, inplace=True)
df_cleaned['ST11Q03'].fillna('Yes', inplace=True)

df_cleaned['ST11Q04'].fillna('Yes', inplace=True)



df_cleaned['ST26Q06'].fillna('Yes', inplace=True)

In [18]:

# get the average score
df_cleaned['Avg Math Score'] = (df_cleaned['PV1MATH'] + df_cleaned['PV2MATH'] + df_cleaned['PV3MATH'] + df_cleaned['PV4MATH'] + df_cleaned['PV5MATH']) / 5
df_cleaned['Avg Reading Score'] = (df_cleaned['PV1READ'] + df_cleaned['PV2READ'] + df_cleaned['PV3READ'] + df_cleaned['PV4READ'] + df_cleaned['PV5READ']) / 5
df_cleaned['Avg Science Score'] = (df_cleaned['PV1SCIE'] + df_cleaned['PV2SCIE'] + df_cleaned['PV3SCIE'] + df_cleaned['PV4SCIE'] + df_cleaned['PV5SCIE']) / 5
# what is the  maximum score for each subject
df_cleaned['Max Math Score'] = df_cleaned[['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH']].max(axis=1)
df_cleaned['Max Reading Score'] = df_cleaned[['PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ']].max(axis=1)
df_cleaned['Max Science Score'] = df_cleaned[['PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']].max(axis=1)
#Remove the original columns
df_cleaned.drop(columns=['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ',
                 'PV5READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'], inplace=True)

In [19]:

#rename column with more appropriate names
df_cleaned.rename({'CNT':'Country', 'ST03Q02':'Birth -Year', 'ST04Q01':'Gender', 'ST11Q01':'At Home - Mother', 'ST11Q02':'At Home - Father','ST11Q03':'At Home - Brothers','ST11Q04':'At Home - Sisters','ST11Q05':'At Home - Grandparents',
'ST26Q06':'Possessions Internet','COBN_M': 'Birth country of mother','COBN_S':'Birth country','COBN_F':'birth country of father'}, axis='columns', inplace=True)
df_cleaned

Out[19]:

	Country	Birth -Year	Gender	At Home - Mother	At Home - Father	At Home - Brothers	At Home - Sisters	At Home - Grandparents	Possessions Internet	birth country of father	Birth country of mother	Birth country	Avg Math Score	Avg Reading Score	Avg Science Score	Max Math Score	Max Reading Score	Max Science Score
0	Albania	1996	Female	1	1	1	1	1	No	Albania	Albania	Albania	366.18634	261.01424	371.91348	406.8469	406.8496	408.8400
1	Albania	1996	Female	1	1	1	1	1	Yes	Albania	Albania	Albania	470.56396	384.68832	478.12382	486.1427	406.2936	548.9929
2	Albania	1996	Female	1	1	0	1	0	Yes	Albania	Albania	Albania	505.53824	405.18154	486.60946	533.2684	431.3938	512.7191
3	Albania	1996	Female	1	1	0	1	0	Yes	Albania	Albania	Albania	449.45476	477.46376	453.97240	498.6836	547.3630	481.5740
4	Albania	1996	Female	1	1	1	1	1	Yes	Albania	Albania	Albania	385.50398	256.01010	367.15778	418.5309	311.7707	422.1746
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
485485	Vietnam	1996	Female	1	1	1	0	0	No	Viet Nam	Viet Nam	Viet Nam	486.22058	472.61846	536.31110	494.3215	481.6736	559.8098
485486	Vietnam	1996	Male	1	1	1	1	0	Yes	Viet Nam	Viet Nam	Viet Nam	529.21794	487.24356	524.37522	596.8297	565.5134	561.1153
485487	Vietnam	1996	Male	1	1	0	1	1	No	Viet Nam	Viet Nam	Viet Nam	486.29850	476.25694	541.90600	508.7319	514.7503	571.3726
485488	Vietnam	1996	Male	1	1	1	0	0	Yes	Viet Nam	Viet Nam	Viet Nam	522.90856	518.43922	526.70646	550.9503	528.5437	551.1376
485489	Vietnam	1996	Female	1	0	0	1	1	Yes	Viet Nam	Viet Nam	Viet Nam	454.43994	488.66354	492.95038	475.4713	532.3506	530.6229

485490 rows × 18 columns

In [20]:

#create a new column parent at home with 0 value if no parent , 1 if one parent and both if two parents at home
df_cleaned['Parent at home'] = df_cleaned['At Home - Father'] + df_cleaned['At Home - Mother']

df_cleaned['Parent at home'] = df_cleaned['Parent at home'].replace({0: 'None', 1: 'One', 2: 'Both'})

In [21]:

df_cleaned.to_csv('df_cleaned.csv',index = False)

In [ ]:

The structure of the dataset?¶

there are 485490 student in data and many variables I choose some of them to analyze and I created new columns for avg. and maximum scores of math, science and reading and column for parent at home

The main feature(s) of interest in the dataset?¶

I’m most interested in figuring out what features are best for predicting the Avg. scores in the dataset.

I think gender ,country and whether parent at home or not will have a great impact on scores

Univariate Exploration¶

In [22]:

#read file
df_cleaned = pd.read_csv('df_cleaned.csv')
df_cleaned

Out[22]:

	Country	Birth -Year	Gender	At Home - Mother	At Home - Father	At Home - Brothers	At Home - Sisters	At Home - Grandparents	Possessions Internet	birth country of father	Birth country of mother	Birth country	Avg Math Score	Avg Reading Score	Avg Science Score	Max Math Score	Max Reading Score	Max Science Score	Parent at home
0	Albania	1996	Female	1	1	Yes	Yes	1	No	Albania	Albania	Albania	366.18634	261.01424	371.91348	406.8469	406.8496	408.8400	Both
1	Albania	1996	Female	1	1	Yes	Yes	1	Yes	Albania	Albania	Albania	470.56396	384.68832	478.12382	486.1427	406.2936	548.9929	Both
2	Albania	1996	Female	1	1	No	Yes	0	Yes	Albania	Albania	Albania	505.53824	405.18154	486.60946	533.2684	431.3938	512.7191	Both
3	Albania	1996	Female	1	1	No	Yes	0	Yes	Albania	Albania	Albania	449.45476	477.46376	453.97240	498.6836	547.3630	481.5740	Both
4	Albania	1996	Female	1	1	Yes	Yes	1	Yes	Albania	Albania	Albania	385.50398	256.01010	367.15778	418.5309	311.7707	422.1746	Both
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
485485	Vietnam	1996	Female	1	1	Yes	No	0	No	Viet Nam	Viet Nam	Viet Nam	486.22058	472.61846	536.31110	494.3215	481.6736	559.8098	Both
485486	Vietnam	1996	Male	1	1	Yes	Yes	0	Yes	Viet Nam	Viet Nam	Viet Nam	529.21794	487.24356	524.37522	596.8297	565.5134	561.1153	Both
485487	Vietnam	1996	Male	1	1	No	Yes	1	No	Viet Nam	Viet Nam	Viet Nam	486.29850	476.25694	541.90600	508.7319	514.7503	571.3726	Both
485488	Vietnam	1996	Male	1	1	Yes	No	0	Yes	Viet Nam	Viet Nam	Viet Nam	522.90856	518.43922	526.70646	550.9503	528.5437	551.1376	Both
485489	Vietnam	1996	Female	1	0	No	Yes	1	Yes	Viet Nam	Viet Nam	Viet Nam	454.43994	488.66354	492.95038	475.4713	532.3506	530.6229	One

485490 rows × 19 columns

In [23]:

#names of countries present in data
df_cleaned['Country'].unique()

Out[23]:

array(['Albania', 'United Arab Emirates', 'Argentina', 'Australia',
       'Austria', 'Belgium', 'Bulgaria', 'Brazil', 'Canada',
       'Switzerland', 'Chile', 'Colombia', 'Costa Rica', 'Czech Republic',
       'Germany', 'Denmark', 'Spain', 'Estonia', 'Finland', 'France',
       'United Kingdom', 'Greece', 'Hong Kong-China', 'Croatia',
       'Hungary', 'Indonesia', 'Ireland', 'Iceland', 'Israel', 'Italy',
       'Jordan', 'Japan', 'Kazakhstan', 'Korea', 'Liechtenstein',
       'Lithuania', 'Luxembourg', 'Latvia', 'Macao-China', 'Mexico',
       'Montenegro', 'Malaysia', 'Netherlands', 'Norway', 'New Zealand',
       'Peru', 'Poland', 'Portugal', 'Qatar', 'China-Shanghai',
       'Perm(Russian Federation)', 'Florida (USA)', 'Connecticut (USA)',
       'Massachusetts (USA)', 'Romania', 'Russian Federation',
       'Singapore', 'Serbia', 'Slovak Republic', 'Slovenia', 'Sweden',
       'Chinese Taipei', 'Thailand', 'Tunisia', 'Turkey', 'Uruguay',
       'United States of America', 'Vietnam'], dtype=object)

In [36]:

#there ara a large number of countries so i choose some of them
countries_group = df_cleaned.loc[df_cleaned['Country'].isin(['United Arab Emirates','Argentina','Australia','United States of America','Belgium','Italy','Japan','Tunisia','United Kingdom'])]
base_color = sb.color_palette()[0]

Country_order = countries_group['Country'].value_counts().index
sb.countplot(data = countries_group, x = 'Country',color = base_color, order = Country_order)
plt.xticks(rotation = 90)
plt.title('distribution of student in each country');

Out[36]:

Text(0.5, 1.0, 'distribution of student in each country')

it seems that italy has the greatest number of students.

In [42]:

#plot histogram for distribution of scores
plt.figure(figsize = [20,8]).suptitle('Distribution of scores')
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
bin_edges = np.arange(0, df_cleaned['Avg Math Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Math Score', bins = bin_edges)
plt.title('Avg Math Score')

plt.subplot(1,3,2)
bin_edges = np.arange(0, df_cleaned['Avg Science Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Science Score',bins = bin_edges)
plt.title('Avg Science Score')

plt.subplot(1,3,3)
bin_edges = np.arange(0, df_cleaned['Avg Reading Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Reading Score',bins = bin_edges)
plt.title('Avg Reading Score')

Out[42]:

Text(0.5, 1.0, 'Avg Reading Score')

from histogram maximum average scores for Math between 400 and 500 for science and Reading between 500 and 550

Bivariate Exploration¶

In [26]:

#plot a bar plot between gender and scores 
gender=df_cleaned.groupby(['Gender']).mean()[['Avg Math Score', 'Avg Reading Score', 'Avg Science Score']]
gender.plot.bar(figsize=(10,5))
plt.ylabel('Scores')
plt.title('Avg Scores based on Gender');

gender has no impact on average scores

In [27]:

#plot a bar plot between countries and scores 
country=countries_group.groupby(['Country']).mean()[['Avg Math Score', 'Avg Reading Score', 'Avg Science Score']]
country.plot.bar(figsize=(20,10))
plt.ylabel('Scores')
plt.title('Avg Scores based on countries');

students from japan have the greatest score and students from tunisisa have the least scores

In [28]:

#plot the relation between parents at home and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
order = ['None', 'One', 'Both']

plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Math Score'],order=order)
plt.title('Parental Impact on Math')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Science Score'],order=order)
plt.title('Parental Impact on Science')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Reading Score'],order=order)
plt.title('Parental Impact on Reading');

it seems that parents at home has a great imppact on scores as student with no parents at home have low scores compared to students with parents at home

In [29]:

#plot the relation between possession internet at home and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Math Score'])
plt.title('Internet Possession Impact on Math')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Science Score'])
plt.title('Internet Possession Impact on Science')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Reading Score'])
plt.title('Internet Possession Impact on Reading');

possession internet has impact on scores student who has internet get higher scores than student who don’t have internet

Multivariate Exploration¶

In [30]:

pairplot= sb.pairplot(data = df_cleaned, vars= ['Avg Math Score',  'Avg Science Score','Avg Reading Score'])
pairplot.fig.suptitle("Correlation between Avg. Scores of Math, Science and Reading");

There is a great correlation between scores in each subject student who has high score in one subject tend to has high score in other subjects

In [31]:

#plot the relation between possesssion internet,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Reading by Gender');

looks like that female students who has internet tend to get high score in Reading than male student but in other two subjects there is no effect of gender in score

In [32]:

#plot the relation between parents at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
order = ['None', 'One', 'Both']

plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Math Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Science Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Reading Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Reading by Gender');

In [33]:

#plot the relation between brothers at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Reading by Gender');

In [35]:

#plot the relation between sisters at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Reading by Gender');

gender has no impact on scores whether the student has sisters or brothers at home

PISA Data Exploration

Programme for International Student Assessment.