(PISA Data exploration)¶
by (mohamed abdo)¶
Preliminary Wrangling¶
PISA is a survey of students’ skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.
Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
beacause data is too large, I only selected some variables of the data to observe
df = pd.read_csv(r'C:\Users\msala\Desktop\pisa2012.csv',encoding="cp1252",sep=',')
df = df[['CNT', 'ST03Q02', 'ST04Q01', 'ST11Q01', 'ST11Q02','ST11Q03','ST11Q04','ST11Q05',
'ST26Q06',
'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ',
'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'COBN_F', 'COBN_M', 'COBN_S']]
df
print(df.shape)
print(df.dtypes)
print(df.head(10))
df.duplicated().sum()
df.isna().sum()
df_cleaned = df.copy()
#replace nan value with yes and replace no,yes with 0,1
df_cleaned['ST11Q01'].fillna('Yes', inplace=True)
df_cleaned['ST11Q01'].replace({'No': 0, 'Yes': 1}, inplace=True)
df_cleaned['ST11Q02'].fillna('Yes', inplace=True)
df_cleaned['ST11Q02'].replace({'No': 0, 'Yes': 1}, inplace=True)
df_cleaned['ST11Q03'].fillna('Yes', inplace=True)
df_cleaned['ST11Q04'].fillna('Yes', inplace=True)
df_cleaned['ST26Q06'].fillna('Yes', inplace=True)
# get the average score
df_cleaned['Avg Math Score'] = (df_cleaned['PV1MATH'] + df_cleaned['PV2MATH'] + df_cleaned['PV3MATH'] + df_cleaned['PV4MATH'] + df_cleaned['PV5MATH']) / 5
df_cleaned['Avg Reading Score'] = (df_cleaned['PV1READ'] + df_cleaned['PV2READ'] + df_cleaned['PV3READ'] + df_cleaned['PV4READ'] + df_cleaned['PV5READ']) / 5
df_cleaned['Avg Science Score'] = (df_cleaned['PV1SCIE'] + df_cleaned['PV2SCIE'] + df_cleaned['PV3SCIE'] + df_cleaned['PV4SCIE'] + df_cleaned['PV5SCIE']) / 5
# what is the maximum score for each subject
df_cleaned['Max Math Score'] = df_cleaned[['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH']].max(axis=1)
df_cleaned['Max Reading Score'] = df_cleaned[['PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ']].max(axis=1)
df_cleaned['Max Science Score'] = df_cleaned[['PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']].max(axis=1)
#Remove the original columns
df_cleaned.drop(columns=['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ',
'PV5READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'], inplace=True)
#rename column with more appropriate names
df_cleaned.rename({'CNT':'Country', 'ST03Q02':'Birth -Year', 'ST04Q01':'Gender', 'ST11Q01':'At Home - Mother', 'ST11Q02':'At Home - Father','ST11Q03':'At Home - Brothers','ST11Q04':'At Home - Sisters','ST11Q05':'At Home - Grandparents',
'ST26Q06':'Possessions Internet','COBN_M': 'Birth country of mother','COBN_S':'Birth country','COBN_F':'birth country of father'}, axis='columns', inplace=True)
df_cleaned
#create a new column parent at home with 0 value if no parent , 1 if one parent and both if two parents at home
df_cleaned['Parent at home'] = df_cleaned['At Home - Father'] + df_cleaned['At Home - Mother']
df_cleaned['Parent at home'] = df_cleaned['Parent at home'].replace({0: 'None', 1: 'One', 2: 'Both'})
df_cleaned.to_csv('df_cleaned.csv',index = False)
The structure of the dataset?¶
there are 485490 student in data and many variables I choose some of them to analyze and I created new columns for avg. and maximum scores of math, science and reading and column for parent at home
The main feature(s) of interest in the dataset?¶
I’m most interested in figuring out what features are best for predicting the Avg. scores in the dataset.
I think gender ,country and whether parent at home or not will have a great impact on scores
Univariate Exploration¶
#read file
df_cleaned = pd.read_csv('df_cleaned.csv')
df_cleaned
#names of countries present in data
df_cleaned['Country'].unique()
#there ara a large number of countries so i choose some of them
countries_group = df_cleaned.loc[df_cleaned['Country'].isin(['United Arab Emirates','Argentina','Australia','United States of America','Belgium','Italy','Japan','Tunisia','United Kingdom'])]
base_color = sb.color_palette()[0]
Country_order = countries_group['Country'].value_counts().index
sb.countplot(data = countries_group, x = 'Country',color = base_color, order = Country_order)
plt.xticks(rotation = 90)
plt.title('distribution of student in each country');
it seems that italy has the greatest number of students.
#plot histogram for distribution of scores
plt.figure(figsize = [20,8]).suptitle('Distribution of scores')
base_color = sb.color_palette()[0]
plt.subplot(1,3,1)
bin_edges = np.arange(0, df_cleaned['Avg Math Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Math Score', bins = bin_edges)
plt.title('Avg Math Score')
plt.subplot(1,3,2)
bin_edges = np.arange(0, df_cleaned['Avg Science Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Science Score',bins = bin_edges)
plt.title('Avg Science Score')
plt.subplot(1,3,3)
bin_edges = np.arange(0, df_cleaned['Avg Reading Score'].max()+50, 50)
plt.hist(data = df_cleaned, x = 'Avg Reading Score',bins = bin_edges)
plt.title('Avg Reading Score')
from histogram maximum average scores for Math between 400 and 500 for science and Reading between 500 and 550
Bivariate Exploration¶
#plot a bar plot between gender and scores
gender=df_cleaned.groupby(['Gender']).mean()[['Avg Math Score', 'Avg Reading Score', 'Avg Science Score']]
gender.plot.bar(figsize=(10,5))
plt.ylabel('Scores')
plt.title('Avg Scores based on Gender');
gender has no impact on average scores
#plot a bar plot between countries and scores
country=countries_group.groupby(['Country']).mean()[['Avg Math Score', 'Avg Reading Score', 'Avg Science Score']]
country.plot.bar(figsize=(20,10))
plt.ylabel('Scores')
plt.title('Avg Scores based on countries');
students from japan have the greatest score and students from tunisisa have the least scores
#plot the relation between parents at home and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
order = ['None', 'One', 'Both']
plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Math Score'],order=order)
plt.title('Parental Impact on Math')
plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Science Score'],order=order)
plt.title('Parental Impact on Science')
plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Reading Score'],order=order)
plt.title('Parental Impact on Reading');
it seems that parents at home has a great imppact on scores as student with no parents at home have low scores compared to students with parents at home
#plot the relation between possession internet at home and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Math Score'])
plt.title('Internet Possession Impact on Math')
plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Science Score'])
plt.title('Internet Possession Impact on Science')
plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Reading Score'])
plt.title('Internet Possession Impact on Reading');
possession internet has impact on scores student who has internet get higher scores than student who don’t have internet
Multivariate Exploration¶
pairplot= sb.pairplot(data = df_cleaned, vars= ['Avg Math Score', 'Avg Science Score','Avg Reading Score'])
pairplot.fig.suptitle("Correlation between Avg. Scores of Math, Science and Reading");
There is a great correlation between scores in each subject student who has high score in one subject tend to has high score in other subjects
#plot the relation between possesssion internet,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Math by Gender')
plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Science by Gender')
plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Possessions Internet'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Internet Possession Impact on Reading by Gender');
looks like that female students who has internet tend to get high score in Reading than male student but in other two subjects there is no effect of gender in score
#plot the relation between parents at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
order = ['None', 'One', 'Both']
plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Math Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Math by Gender')
plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Science Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Science by Gender')
plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['Parent at home'], y = df_cleaned['Avg Reading Score'],order=order,hue = df_cleaned["Gender"])
plt.title('Parental Impact on Reading by Gender');
#plot the relation between brothers at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Math by Gender')
plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Science by Gender')
plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['At Home - Brothers'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Brothers Impact on Reading by Gender');
#plot the relation between sisters at home,gender and scores on boxplot
plt.figure(figsize = [20,8])
base_color = sb.color_palette()[0]
plt.subplot(1,3,1)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Math Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Math by Gender')
plt.subplot(1,3,2)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Science Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Science by Gender')
plt.subplot(1,3,3)
sb.boxplot(x = df_cleaned['At Home - Sisters'], y = df_cleaned['Avg Reading Score'],hue = df_cleaned["Gender"])
plt.title('Sisters Impact on Reading by Gender');
gender has no impact on scores whether the student has sisters or brothers at home