PISA Data Exploration

Programme for International Student Assessment.

Posted by Mohamed Abdo on Fri 10 July 2020

(PISA Data exploration)

by (mohamed abdo)

Preliminary Wrangling

PISA is a survey of students’ skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.

Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.

In [10]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

beacause data is too large, I only selected some variables of the data to observe

In [11]:
df = pd.read_csv(r'C:\Users\msala\Desktop\pisa2012.csv',encoding="cp1252",sep=',')
In [12]:
df = df[['CNT', 'ST03Q02', 'ST04Q01', 'ST11Q01', 'ST11Q02','ST11Q03','ST11Q04','ST11Q05',
         'ST26Q06',

         'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ', 
         'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'COBN_F',  'COBN_M',  'COBN_S']]
df
Out[12]:
CNT ST03Q02 ST04Q01 ST11Q01 ST11Q02 ST11Q03 ST11Q04 ST11Q05 ST26Q06 PV1MATH PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE COBN_F COBN_M COBN_S
0 Albania 1996 Female Yes Yes Yes Yes NaN No 406.8469 175.7053 218.5981 341.7009 408.8400 348.2283 367.8105 392.9877 Albania Albania Albania
1 Albania 1996 Female Yes Yes NaN Yes NaN Yes 486.1427 369.7553 396.7618 548.9929 471.5964 471.5964 443.6218 454.8116 Albania Albania Albania
2 Albania 1996 Female Yes Yes No Yes No Yes 533.2684 431.3938 401.2100 499.6643 428.7952 492.2044 512.7191 499.6643 Albania Albania Albania
3 Albania 1996 Female Yes Yes No Yes No Yes 412.2215 425.0393 471.9036 438.6796 481.5740 448.9370 474.1141 426.5573 Albania Albania Albania
4 Albania 1996 Female Yes Yes Yes NaN NaN Yes 381.9209 272.8495 260.1405 361.5628 275.7740 372.7527 403.5248 422.1746 Albania Albania Albania
485485 Vietnam 1996 Female Yes Yes Yes No No No 477.1849 472.1419 481.6736 559.8098 528.1052 519.7128 535.5651 538.3626 Viet Nam Viet Nam Viet Nam
485486 Vietnam 1996 Male Yes Yes Yes Yes No Yes 518.9360 565.5134 451.6372 538.7355 493.9761 493.0436 561.1153 535.0056 Viet Nam Viet Nam Viet Nam
485487 Vietnam 1996 Male Yes Yes No Yes Yes No 475.2376 457.8122 511.5425 536.8706 571.3726 488.3812 548.9929 563.9127 Viet Nam Viet Nam Viet Nam
485488 Vietnam 1996 Male Yes Yes Yes No No Yes 550.9503 528.5437 522.9301 511.0407 532.4879 524.0955 551.1376 514.7706 Viet Nam Viet Nam Viet Nam
485489 Vietnam 1996 Female Yes No No Yes Yes NaN 470.0187 459.2741 488.6635 530.6229 473.7411 477.4711 477.4711 505.4457 Viet Nam Viet Nam Viet Nam

485490 rows × 27 columns

In [13]:
print(df.shape)
print(df.dtypes)
print(df.head(10))
(485490, 27)
CNT         object
ST03Q02      int64
ST04Q01     object
ST11Q01     object
ST11Q02     object
ST11Q03     object
ST11Q04     object
ST11Q05     object
ST26Q06     object
PV1MATH    float64
PV2MATH    float64
PV3MATH    float64
PV4MATH    float64
PV5MATH    float64
PV1READ    float64
PV2READ    float64
PV3READ    float64
PV4READ    float64
PV5READ    float64
PV1SCIE    float64
PV2SCIE    float64
PV3SCIE    float64
PV4SCIE    float64
PV5SCIE    float64
COBN_F      object
COBN_M      object
COBN_S      object
dtype: object
       CNT  ST03Q02 ST04Q01 ST11Q01 ST11Q02 ST11Q03 ST11Q04 ST11Q05 ST26Q06  \
0  Albania     1996  Female     Yes     Yes     Yes     Yes     NaN      No   
1  Albania     1996  Female     Yes     Yes     NaN     Yes     NaN     Yes   
2  Albania     1996  Female     Yes     Yes      No     Yes      No     Yes   
3  Albania     1996  Female     Yes     Yes      No     Yes      No     Yes   
4  Albania     1996  Female     Yes     Yes     Yes     NaN     NaN     Yes   
5  Albania     1996  Female     Yes     Yes      No     Yes     Yes     Yes   
6  Albania     1996  Female     Yes     Yes     Yes      No      No      No   
7  Albania     1996    Male     Yes     Yes      No     Yes     Yes      No   
8  Albania     1996  Female     NaN     NaN     NaN     NaN     NaN     Yes   
9  Albania     1996  Female     Yes     Yes     Yes      No     Yes      No   

    PV1MATH  ...   PV4READ   PV5READ   PV1SCIE   PV2SCIE   PV3SCIE   PV4SCIE  \
0  406.8469  ...  175.7053  218.5981  341.7009  408.8400  348.2283  367.8105   
1  486.1427  ...  369.7553  396.7618  548.9929  471.5964  471.5964  443.6218   
2  533.2684  ...  431.3938  401.2100  499.6643  428.7952  492.2044  512.7191   
3  412.2215  ...  425.0393  471.9036  438.6796  481.5740  448.9370  474.1141   
4  381.9209  ...  272.8495  260.1405  361.5628  275.7740  372.7527  403.5248   
5  396.3312  ...  335.3617  219.3924  384.3156  358.2059  405.7628  403.8978   
6  438.0823  ...  338.3006  373.2503  508.1499  462.4580  499.7576  519.3398   
7  576.5773  ...  535.6811  582.9958  611.2831  596.3633  554.4013  513.3719   
8  434.1097  ...  391.7577  278.9657  373.0325  317.0832  376.7624  409.3995   
9  463.6314  ...  450.5366  323.4470  404.2708  343.6591  339.9291  335.2667   

    PV5SCIE   COBN_F   COBN_M   COBN_S  
0  392.9877  Albania  Albania  Albania  
1  454.8116  Albania  Albania  Albania  
2  499.6643  Albania  Albania  Albania  
3  426.5573  Albania  Albania  Albania  
4  422.1746  Albania  Albania  Albania  
5  325.5688  Albania  Albania  Albania  
6  466.1880  Albania  Albania  Albania  
7  519.8993  Albania  Albania  Albania  
8  382.3573  Albania  Albania  Albania  
9  287.7098  Albania  Albania  Albania  

[10 rows x 27 columns]
In [14]:
df.duplicated().sum()
Out[14]:
0
In [15]:
df.isna().sum()
Out[15]:
CNT             0
ST03Q02         0
ST04Q01         0
ST11Q01     24931
ST11Q02     44454
ST11Q03     85414
ST11Q04     94722
ST11Q05    137310
ST26Q06     12308
PV1MATH         0
PV2MATH         0
PV3MATH         0
PV4MATH         0
PV5MATH         0
PV1READ         0
PV2READ         0
PV3READ         0
PV4READ         0
PV5READ         0
PV1SCIE         0
PV2SCIE         0
PV3SCIE         0
PV4SCIE         0
PV5SCIE         0
COBN_F       3665
COBN_M       3647
COBN_S       3654
dtype: int64
In [16]:
df_cleaned = df.copy()
In [17]:
#replace nan value with yes and replace no,yes with 0,1
df_cleaned['ST11Q01'].fillna('Yes', inplace=True)
df_cleaned['ST11Q01'].replace({'No': 0, 'Yes': 1}, inplace=True)
df_cleaned['ST11Q02'].fillna('Yes', inplace=True)
df_cleaned['ST11Q02'].replace({'No': 0, 'Yes': 1}, inplace=True)
df_cleaned['ST11Q03'].fillna('Yes', inplace=True)

df_cleaned['ST11Q04'].fillna('Yes', inplace=True)



df_cleaned['ST26Q06'].fillna('Yes', inplace=True)
In [18]:
# get the average score
df_cleaned['Avg Math Score'] = (df_cleaned['PV1MATH'] + df_cleaned['PV2MATH'] + df_cleaned['PV3MATH'] + df_cleaned['PV4MATH'] + df_cleaned['PV5MATH']) / 5
df_cleaned['Avg Reading Score'] = (df_cleaned['PV1READ'] + df_cleaned['PV2READ'] + df_cleaned['PV3READ'] + df_cleaned['PV4READ'] + df_cleaned['PV5READ']) / 5
df_cleaned['Avg Science Score'] = (df_cleaned['PV1SCIE'] + df_cleaned['PV2SCIE'] + df_cleaned['PV3SCIE'] + df_cleaned['PV4SCIE'] + df_cleaned['PV5SCIE']) / 5
# what is the  maximum score for each subject
df_cleaned['Max Math Score'] = df_cleaned[['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH']].max(axis=1)
df_cleaned['Max Reading Score'] = df_cleaned[['PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ']].max(axis=1)
df_cleaned['Max Science Score'] = df_cleaned[['PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']].max(axis=1)
#Remove the original columns
df_cleaned.drop(columns=['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ',
                 'PV5READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'], inplace=True)
In [19]:
#rename column with more appropriate names
df_cleaned.rename({'CNT':'Country', 'ST03Q02':'Birth -Year', 'ST04Q01':'Gender', 'ST11Q01':'At Home - Mother', 'ST11Q02':'At Home - Father','ST11Q03':'At Home - Brothers','ST11Q04':'At Home - Sisters','ST11Q05':'At Home - Grandparents',
'ST26Q06':'Possessions Internet','COBN_M': 'Birth country of mother','COBN_S':'Birth country','COBN_F':'birth country of father'}, axis='columns', inplace=True)
df_cleaned
Out[19]:
Country Birth -Year Gender At Home - Mother At Home - Father At Home - Brothers At Home - Sisters At Home - Grandparents Possessions Internet birth country of father Birth country of mother Birth country Avg Math Score Avg Reading Score Avg Science Score Max Math Score Max Reading Score Max Science Score
0 Albania 1996 Female 1 1 1 1 1 No Albania Albania Albania 366.18634 261.01424 371.91348 406.8469 406.8496 408.8400
1 Albania 1996 Female 1 1 1 1 1 Yes Albania Albania Albania 470.56396 384.68832 478.12382 486.1427 406.2936 548.9929
2 Albania 1996 Female 1 1 0 1 0 Yes Albania Albania Albania 505.53824 405.18154 486.60946 533.2684 431.3938 512.7191
3 Albania 1996 Female 1 1 0 1 0 Yes Albania Albania Albania 449.45476 477.46376 453.97240 498.6836 547.3630 481.5740
4 Albania 1996 Female 1 1 1 1 1 Yes Albania Albania Albania 385.50398 256.01010 367.15778 418.5309 311.7707 422.1746
485485 Vietnam 1996 Female 1 1 1 0 0 No Viet Nam Viet Nam Viet Nam 486.22058 472.61846 536.31110 494.3215 481.6736 559.8098
485486 Vietnam 1996 Male 1 1 1 1 0 Yes Viet Nam Viet Nam Viet Nam 529.21794 487.24356 524.37522 596.8297 565.5134 561.1153
485487 Vietnam 1996 Male 1 1 0 1 1 No Viet Nam Viet Nam Viet Nam 486.29850 476.25694 541.90600 508.7319 514.7503 571.3726
485488 Vietnam 1996 Male 1 1 1 0 0 Yes Viet Nam Viet Nam Viet Nam 522.90856 518.43922 526.70646 550.9503 528.5437 551.1376
485489 Vietnam 1996 Female 1 0 0 1 1 Yes Viet Nam Viet Nam Viet Nam 454.43994 488.66354 492.95038 475.4713 532.3506 530.6229

485490 rows × 18 columns

In [20]:
#create a new column parent at home with 0 value if no parent , 1 if one parent and both if two parents at home
df_cleaned['Parent at home'] = df_cleaned['At Home - Father'] + df_cleaned['At Home - Mother']

df_cleaned['Parent at home'] = df_cleaned['Parent at home'].replace({0: 'None', 1: 'One', 2: 'Both'})
In [21]:
df_cleaned.to_csv('df_cleaned.csv',index = False)
In [ ]:
 

The structure of the dataset?

there are 485490 student in data and many variables I choose some of them to analyze and I created new columns for avg. and maximum scores of math, science and reading and column for parent at home

The main feature(s) of interest in the dataset?

I’m most interested in figuring out what features are best for predicting the Avg. scores in the dataset.

I think gender ,country and whether parent at home or not will have a great impact on scores

Univariate Exploration

In [22]:
#read file
df_cleaned = pd.read_csv('df_cleaned.csv')
df_cleaned
Out[22]:
Country Birth -Year Gender At Home - Mother At Home - Father At Home - Brothers At Home - Sisters At Home - Grandparents Possessions Internet birth country of father Birth country of mother Birth country Avg Math Score Avg Reading Score Avg Science Score Max Math Score Max Reading Score Max Science Score Parent at home
0 Albania 1996 Female 1 1 Yes Yes 1 No Albania Albania Albania 366.18634 261.01424 371.91348 406.8469 406.8496 408.8400 Both
1 Albania 1996 Female 1 1 Yes Yes 1 Yes Albania Albania Albania 470.56396 384.68832 478.12382 486.1427 406.2936 548.9929 Both
2 Albania 1996 Female 1 1 No Yes 0 Yes Albania Albania Albania 505.53824 405.18154 486.60946 533.2684 431.3938 512.7191 Both
3 Albania 1996 Female 1 1 No Yes 0 Yes Albania Albania Albania 449.45476 477.46376 453.97240 498.6836 547.3630 481.5740 Both
4 Albania 1996 Female 1 1 Yes Yes 1 Yes Albania Albania Albania 385.50398 256.01010 367.15778 418.5309 311.7707 422.1746 Both
485485 Vietnam 1996 Female 1 1 Yes No 0 No Viet Nam Viet Nam Viet Nam 486.22058 472.61846 536.31110 494.3215 481.6736 559.8098 Both
485486 Vietnam 1996 Male 1 1 Yes Yes 0 Yes Viet Nam Viet Nam Viet Nam 529.21794 487.24356 524.37522 596.8297 565.5134 561.1153 Both
485487 Vietnam 1996 Male 1 1 No Yes 1 No Viet Nam Viet Nam Viet Nam 486.29850 476.25694 541.90600 508.7319 514.7503 571.3726 Both
485488 Vietnam 1996 Male 1 1 Yes No 0 Yes Viet Nam Viet Nam Viet Nam 522.90856 518.43922 526.70646 550.9503 528.5437 551.1376 Both
485489 Vietnam 1996 Female 1 0 No Yes 1 Yes Viet Nam Viet Nam Viet Nam 454.43994 488.66354 492.95038 475.4713 532.3506 530.6229 One

485490 rows × 19 columns

In [23]:
#names of countries present in data
df_cleaned['Country'].unique()
Out[23]:
array(['Albania', 'United Arab Emirates', 'Argentina', 'Australia',
       'Austria', 'Belgium', 'Bulgaria', 'Brazil', 'Canada',
       'Switzerland', 'Chile', 'Colombia', 'Costa Rica', 'Czech Republic',
       'Germany', 'Denmark', 'Spain', 'Estonia', 'Finland', 'France',
       'United Kingdom', 'Greece', 'Hong Kong-China', 'Croatia',
       'Hungary', 'Indonesia', 'Ireland', 'Iceland', 'Israel', 'Italy',
       'Jordan', 'Japan', 'Kazakhstan', 'Korea', 'Liechtenstein',
       'Lithuania', 'Luxembourg', 'Latvia', 'Macao-China', 'Mexico',
       'Montenegro', 'Malaysia', 'Netherlands', 'Norway', 'New Zealand',
       'Peru', 'Poland', 'Portugal', 'Qatar', 'China-Shanghai',
       'Perm(Russian Federation)', 'Florida (USA)', 'Connecticut (USA)',
       'Massachusetts (USA)', 'Romania', 'Russian Federation',
       'Singapore', 'Serbia', 'Slovak Republic', 'Slovenia', 'Sweden',
       'Chinese Taipei', 'Thailand', 'Tunisia', 'Turkey', 'Uruguay',
       'United States of America', 'Vietnam'], dtype=object)
In [36]:
#there ara a large number of countries so i choose some of them
countries_group = df_cleaned.loc[df_cleaned['Country'].isin(['United Arab Emirates','Argentina','Australia','United States of America','Belgium','Italy','Japan','Tunisia','United Kingdom'])]
base_color = sb.color_palette()[0]

Country_order = countries_group['Country'].value_counts().index
sb.countplot(data = countries_group, x = 'Country',color = base_color, order = Country_order)
plt.xticks(rotation = 90)
plt.title('distribution of student in each country');
Out[36]:
Text(0.5, 1.0, 'distribution of student in each country')

it seems that italy has the greatest number of students.

In [42]:
#plot histogram for distribution of scores
plt.figure(figsize = [20,8]).suptitle('Distribution of scores')
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
bin_edges = np.arange(0, df_cleaned['Avg Math Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Math Score', bins = bin_edges)
plt.title('Avg Math Score')

plt.subplot(1,3,2)
bin_edges = np.arange(0, df_cleaned['Avg Science Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Science Score',bins = bin_edges)
plt.title('Avg Science Score')

plt.subplot(1,3,3)
bin_edges = np.arange(0, df_cleaned['Avg Reading Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Reading Score',bins = bin_edges)
plt.title('Avg Reading Score')
Out[42]:
Text(0.5, 1.0, 'Avg Reading Score')

from histogram maximum average scores for Math between 400 and 500 for science and Reading between 500 and 550

Bivariate Exploration

In [26]:
#plot a bar plot between gender and scores 
gender=df_cleaned.groupby(['Gender']).mean()[['Avg Math Score', 'Avg Reading Score', 'Avg Science Score']]
gender.plot.bar(figsize=(10,5))
plt.ylabel('Scores')
plt.title('Avg Scores based on Gender');

gender has no impact on average scores

In [27]:
#plot a bar plot between countries and scores 
country=countries_group.groupby(['Country']).mean()[['Avg Math Score', 'Avg Reading Score', 'Avg Science Score']]
country.plot.bar(figsize=(20,10))
plt.ylabel('Scores')
plt.title('Avg Scores based on countries');

students from japan have the greatest score and students from tunisisa have the least scores

In [28]:
#plot the relation between parents at home and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
order = ['None', 'One', 'Both']

plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Math Score'],order=order)
plt.title('Parental Impact on Math')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Science Score'],order=order)
plt.title('Parental Impact on Science')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Reading Score'],order=order)
plt.title('Parental Impact on Reading');

it seems that parents at home has a great imppact on scores as student with no parents at home have low scores compared to students with parents at home

In [29]:
#plot the relation between possession internet at home and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Math Score'])
plt.title('Internet Possession Impact on Math')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Science Score'])
plt.title('Internet Possession Impact on Science')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Reading Score'])
plt.title('Internet Possession Impact on Reading');

possession internet has impact on scores student who has internet get higher scores than student who don’t have internet

Multivariate Exploration

In [30]:
pairplot= sb.pairplot(data = df_cleaned, vars= ['Avg Math Score',  'Avg Science Score','Avg Reading Score'])
pairplot.fig.suptitle("Correlation between Avg. Scores of Math, Science and Reading");

There is a great correlation between scores in each subject student who has high score in one subject tend to has high score in other subjects

In [31]:
#plot the relation between possesssion internet,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Reading by Gender');

looks like that female students who has internet tend to get high score in Reading than male student but in other two subjects there is no effect of gender in score

In [32]:
#plot the relation between parents at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
order = ['None', 'One', 'Both']

plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Math Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Science Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Reading Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Reading by Gender');
In [33]:
#plot the relation between brothers at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Reading by Gender');
In [35]:
#plot the relation between sisters at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]


plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Math by Gender')

plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Science by Gender')

plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Reading by Gender');

gender has no impact on scores whether the student has sisters or brothers at home