PISA Data Cleanup

Courtney Ferguson Lee

Introduction

PISA is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.

Over 510,000 students from 62 countries took part in the PISA 2012 assessment of reading, mathematics and science, representing about 28 million 15-year-olds globally. Of those countries, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy. Score data was not skewed by participation level but it was scaled using a calibration data set of 31,500 students.

Countries that scored below 450 on previous PISA tests were offered an easier version of the assessment. This was designed to better assess what students at the lower end of the spectrum understood. Two countries accepted the offer and although it was at a lower difficulty, the results were still comparable to every other PISA participant.

This notebook cleans and prepares the data for visualization in Tableau.

Plan of Attack

  1. Data cleaning (Python)
  2. Domain research (PISA reference manual, website)
  3. Data manipulation (Python)
  4. Exploration (Tableau)
  5. Explanatory Analysis (Tableau)

Data Cleaning

The first step involved manipulating the data slightly to make better sense of the variables. I changed the variable abbreviations in the CSV to more descriptive names so they were human-readable. I suspect more work will be needed here.

Consolidating US

The US was split into FL, CT, MA and USA. All other countries were unified, so I consolidated the US into one variable (despite the fact that Massachusetts is a cut above the rest).

In [17]:
import pandas as pd
pisa = pd.read_csv('data/pisa2012.csv')
In [18]:
pisa.head()
Out[18]:
Unnamed: 0 CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ... W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU VER_STU
0 1 Albania 80000 ALB0006 Non-OECD Albania 1 1 10 1.0 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
1 2 Albania 80000 ALB0006 Non-OECD Albania 1 2 10 1.0 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
2 3 Albania 80000 ALB0006 Non-OECD Albania 1 3 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
3 4 Albania 80000 ALB0006 Non-OECD Albania 1 4 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
4 5 Albania 80000 ALB0006 Non-OECD Albania 1 5 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13

5 rows × 636 columns

Change header names to descriptions

  1. Read the pisadict as a pandas file
  2. Create a list of header values from the pandas file
  3. Make the first list item blank ("")
  4. Append the remaining names
  5. Read the csv as a pandas dataframe
  6. Change the column names using the list of header values
  7. Create a new pisa csv from the modified pisa dataframe
In [19]:
pisa_dict = pd.read_csv('data/pisadict2012.csv')
new_pisa_names = ['']
for row in pisa_dict['x']:
    new_pisa_names.append(row)
pisa.columns = new_pisa_names
In [ ]:
pisa.head()

Clean up Country Codes

Some country codes are unrecognizable. Need to change to standard names.

In [21]:
country_dict = {
    'Country code 3-character': {
        'China-Shanghai': 'China',
        'Chinese Taipei': 'Taiwan',
        'Connecticut (USA)': 'United States',
        'Florida (USA)': 'United States',
        'Hong Kong-China': 'China',
        'Korea': 'South Korea',
        'Macao-China': 'China',
        'Massachusetts (USA)': 'United States',
        'Perm(Russian Federation)': 'Russia',
        'Russian Federation': 'Russia',
        'United States of America': 'United States'
    }    
}

pisa_updated = pisa.replace(to_replace = country_dict)
In [9]:
pisa_updated.to_csv('data/pisa_updated.csv', chunksize = 20000, index = False)

Data Manipulation

Reduce CSV File Size

In [24]:
pisa_reduced = pisa[[
    'Country code 3-character', 
    'Gender',
    'Mother Current Job Status',
    'Possessions - Internet',
    'Time of computer use (mins)',
    'How many - cellular phones',
    'How many - televisions',
    'How many - computers',
    'How many - cars',
    'How many - rooms bath or shower',
    'How many books at home',
    'Out-of-School Study Time - Homework',
    'Min in <class period> - <test lang>',
    'Min in <class period> - <Maths>',
    'Min in <class period> - <Science>',
    'No of <class period> p/wk - <test lang>',
    'No of <class period> p/wk - <Maths>',
    'No of <class period> p/wk - <Science>',
    'No of ALL <class period> a week',
    'Class Size - No of Students in <Test Language> Class',
    'Teacher-Directed Instruction - Sets Clear Goals',
    'Teacher-Directed Instruction - Encourages Thinking and Reasoning',
    'Teacher-Directed Instruction - Checks Understanding',
    'Teacher-Directed Instruction - Summarizes Previous Lessons',
    'Teacher-Directed Instruction - Informs about Learning Goals',
    'Highest parental education in years',
    'Learning time (minutes per week)  - <test language>',
    'Learning time (minutes per week)- <Mathematics>',
    'Learning time (minutes per week) - <Science>',
    'Plausible value 1 in mathematics',
    'Plausible value 2 in mathematics',
    'Plausible value 3 in mathematics',
    'Plausible value 4 in mathematics',
    'Plausible value 5 in mathematics',
    'Plausible value 1 in reading',
    'Plausible value 2 in reading',
    'Plausible value 3 in reading',
    'Plausible value 4 in reading',
    'Plausible value 5 in reading',
    'Plausible value 1 in science',
    'Plausible value 2 in science',
    'Plausible value 3 in science',
    'Plausible value 4 in science',
    'Plausible value 5 in science'
]]
In [25]:
pisa_reduced.to_csv('data/pisa_reduced.csv')