PISA Data Cleanup

Introduction¶

PISA is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.

Over 510,000 students from 62 countries took part in the PISA 2012 assessment of reading, mathematics and science, representing about 28 million 15-year-olds globally. Of those countries, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy. Score data was not skewed by participation level but it was scaled using a calibration data set of 31,500 students.

Countries that scored below 450 on previous PISA tests were offered an easier version of the assessment. This was designed to better assess what students at the lower end of the spectrum understood. Two countries accepted the offer and although it was at a lower difficulty, the results were still comparable to every other PISA participant.

This notebook cleans and prepares the data for visualization in Tableau.

Plan of Attack¶

Data cleaning (Python)
Domain research (PISA reference manual, website)
Data manipulation (Python)
Exploration (Tableau)
Explanatory Analysis (Tableau)

Data Cleaning¶

The first step involved manipulating the data slightly to make better sense of the variables. I changed the variable abbreviations in the CSV to more descriptive names so they were human-readable. I suspect more work will be needed here.

Consolidating US¶

The US was split into FL, CT, MA and USA. All other countries were unified, so I consolidated the US into one variable (despite the fact that Massachusetts is a cut above the rest).

import pandas as pd
pisa = pd.read_csv('data/pisa2012.csv')

pisa.head()

Change header names to descriptions¶

Read the pisadict as a pandas file
Create a list of header values from the pandas file
Make the first list item blank ("")
Append the remaining names
Read the csv as a pandas dataframe
Change the column names using the list of header values
Create a new pisa csv from the modified pisa dataframe

pisa_dict = pd.read_csv('data/pisadict2012.csv')
new_pisa_names = ['']
for row in pisa_dict['x']:
    new_pisa_names.append(row)
pisa.columns = new_pisa_names

pisa.head()

Clean up Country Codes¶

Some country codes are unrecognizable. Need to change to standard names.

country_dict = {
    'Country code 3-character': {
        'China-Shanghai': 'China',
        'Chinese Taipei': 'Taiwan',
        'Connecticut (USA)': 'United States',
        'Florida (USA)': 'United States',
        'Hong Kong-China': 'China',
        'Korea': 'South Korea',
        'Macao-China': 'China',
        'Massachusetts (USA)': 'United States',
        'Perm(Russian Federation)': 'Russia',
        'Russian Federation': 'Russia',
        'United States of America': 'United States'
    }    
}

pisa_updated = pisa.replace(to_replace = country_dict)

pisa_updated.to_csv('data/pisa_updated.csv', chunksize = 20000, index = False)

Data Manipulation¶

Reduce CSV File Size¶

pisa_reduced = pisa[[
    'Country code 3-character', 
    'Gender',
    'Mother Current Job Status',
    'Possessions - Internet',
    'Time of computer use (mins)',
    'How many - cellular phones',
    'How many - televisions',
    'How many - computers',
    'How many - cars',
    'How many - rooms bath or shower',
    'How many books at home',
    'Out-of-School Study Time - Homework',
    'Min in <class period> - <test lang>',
    'Min in <class period> - <Maths>',
    'Min in <class period> - <Science>',
    'No of <class period> p/wk - <test lang>',
    'No of <class period> p/wk - <Maths>',
    'No of <class period> p/wk - <Science>',
    'No of ALL <class period> a week',
    'Class Size - No of Students in <Test Language> Class',
    'Teacher-Directed Instruction - Sets Clear Goals',
    'Teacher-Directed Instruction - Encourages Thinking and Reasoning',
    'Teacher-Directed Instruction - Checks Understanding',
    'Teacher-Directed Instruction - Summarizes Previous Lessons',
    'Teacher-Directed Instruction - Informs about Learning Goals',
    'Highest parental education in years',
    'Learning time (minutes per week)  - <test language>',
    'Learning time (minutes per week)- <Mathematics>',
    'Learning time (minutes per week) - <Science>',
    'Plausible value 1 in mathematics',
    'Plausible value 2 in mathematics',
    'Plausible value 3 in mathematics',
    'Plausible value 4 in mathematics',
    'Plausible value 5 in mathematics',
    'Plausible value 1 in reading',
    'Plausible value 2 in reading',
    'Plausible value 3 in reading',
    'Plausible value 4 in reading',
    'Plausible value 5 in reading',
    'Plausible value 1 in science',
    'Plausible value 2 in science',
    'Plausible value 3 in science',
    'Plausible value 4 in science',
    'Plausible value 5 in science'
]]

pisa_reduced.to_csv('data/pisa_reduced.csv')

	Unnamed: 0	CNT	SUBNATIO	STRATUM	OECD	NC	SCHOOLID	STIDSTD	ST01Q01	ST02Q01	...	W_FSTR75	W_FSTR76	W_FSTR77	W_FSTR78	W_FSTR79	W_FSTR80	WVARSTRR	VAR_UNIT	SENWGT_STU	VER_STU
0	1	Albania	80000	ALB0006	Non-OECD	Albania	1	1	10	1.0	...	13.7954	13.9235	13.1249	13.1249	4.3389	13.0829	19	1	0.2098	22NOV13
1	2	Albania	80000	ALB0006	Non-OECD	Albania	1	2	10	1.0	...	13.7954	13.9235	13.1249	13.1249	4.3389	13.0829	19	1	0.2098	22NOV13
2	3	Albania	80000	ALB0006	Non-OECD	Albania	1	3	9	1.0	...	12.7307	12.7307	12.7307	12.7307	4.2436	12.7307	19	1	0.1999	22NOV13
3	4	Albania	80000	ALB0006	Non-OECD	Albania	1	4	9	1.0	...	12.7307	12.7307	12.7307	12.7307	4.2436	12.7307	19	1	0.1999	22NOV13
4	5	Albania	80000	ALB0006	Non-OECD	Albania	1	5	9	1.0	...	12.7307	12.7307	12.7307	12.7307	4.2436	12.7307	19	1	0.1999	22NOV13