Public School Data is Great, But You Have to Clean It First
Using public school data can help school districts see the bigger picture of what is going on. So why don’t we see more districts taking advantage of this information gold mine? The datasets available from federal and state governments are often messy in their current form. This raw data has to be cleaned before it can be useful and school districts don’t often have the staff or time to clean large datasets or groups of large datasets. Here are two examples of really useful datasets, but in their raw, messy form:
- California Department of Education Public Schools and Districts dataset
- California School Dashboards datasets
Raw data is a treasure chest of information, but it takes time to clean them up in a fast and accurate manner. I wrote this post to demonstrate how I clean raw data before delivering it to a client in a more useful form.
Example of Our Data Cleanup
This example uses a dataset of special education dispute resolution data from the US Department of Education’s website.
This dataset counts the number of complaints written to state departments of education during the 2014-2015 school year. It includes related dispute resolution outcomes. All of the counts are organized by state.
Clean Data Is More Useful
Here’s the data as-is, if you imported it into your spreadsheet software without cleaning it:
Here’s the same dataset after I’ve cleaned it:
How I Use R to Clean Data
I use a programming language called R. R programming allows me to work faster by doing the same steps repeatedly to clean a lot of datasets. Instead of manually cleaning data in a spreadsheet with several steps, I have the steps automated in R. This way, if a client needs a year’s worth of special education data and decides later they want the last ten years for comparison, I can apply the recipe quickly and they can jump right to exploring their data.
A More Technical Exploration and Example Dispute Resolution Plot
If you’re interested in what this process looks like, here’s a more detailed explanation. I’ve also included an example visualization of the dispute resolution data that results from the cleaned dataset. All the code for this post can be found in its entirety on my GitHub page.
First, we’ll have a look at the 2014-2015 dispute resolution data exactly as it is when you download it from the US Department of Education. Notice that when we load the data into memory, R immediately throws a warning about the number of missing column names. Column names will frequently need cleaning because they are often out of order, missing, or not named in a way that is easy to understand.
Here are the first few lines of the dataset before we’ve done anything to clean it:
url <- "https://www2.ed.gov/programs/osepidea/618-data/state-level-data-files/part-b-data/dispute-resolution/bdispres2014-15.csv" dr_notclean <- read_csv(url)
## Warning: Missing column names filled in: 'X3' , 'X4' , 'X5' , 'X6' , ## 'X7' , 'X8' , 'X9' , 'X10' , 'X11' , 'X12' , 'X13' , ## 'X14' , 'X15' , 'X16' , 'X17' , 'X18' , 'X19' , ## 'X20' , 'X21' , 'X22' , 'X23' , 'X24' , 'X25' , ## 'X26' , 'X27' , 'X28' , 'X29' , 'X30' , 'X31' , ## 'X32' , 'X33' 
head(dr_notclean, n = 10)
## # A tibble: 10 x 33 ## `Extraction Dat… `6/6/2016` X3 X4 X5 X6 X7 X8 X9 X10 ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Updated: <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ## 2 Revised: <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ## 4 Year State Writ… WSC … WSC … WSC … WSC … WSC … WSC … WSC … ## 5 2014-2015 Alabama 7 5 4 4 1 0 0 2 ## 6 2014-2015 Alaska 10 10 5 7 3 0 0 0 ## 7 2014-2015 American … 0 0 0 0 0 0 0 0 ## 8 2014-2015 Arizona 80 52 36 52 0 0 0 28 ## 9 2014-2015 Arkansas 19 9 9 9 0 0 0 10 ## 10 2014-2015 Bureau of… 11 9 9 0 0 0 0 2 ## # … with 23 more variables: X11 <chr>, X12 <chr>, X13 <chr>, X14 <chr>, ## # X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>, X19 <chr>, X20 <chr>, ## # X21 <chr>, X22 <chr>, X23 <chr>, X24 <chr>, X25 <chr>, X26 <chr>, ## # X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>, X31 <chr>, X32 <chr>, X33 <chr>
You can download the whole dataset here so you can inspect it further. You will notice right away that the column labels are pushed down a few rows because the extraction date is at the top instead. That creates a lot of empty space, which is why you see a lot of
NA values and
Xs as column names. You can also see this in the screenshot of the raw data at the start of the post, which gives you a sense of what it would look like if you imported it into your favorite spreadsheet program.
You will notice that there are thirty one distinct categories with names that don’t immediately describe what the numbers below them mean. For example,
Written, Signed Complaints (WSC) Total (1) are the total count of written complaints submitted to the state agency, but might be easier to look at (if not more descriptive) by using something like
But let’s imagine you only want to know about the total number of complaints, the number of complaints that were ajudicated, the number of cases that were pending, and the number that were dismissed. We can make this dataset easier to look at by removing all columns except the four we want. We can also gather the counts up into two columns: one for the thing you are counting, such as total complaints, and one for the count itself. Finally, let’s say we are only interested in the states that have the ten highest counts of total complaints. Here are the first 10 lines of the dataset after we clean it this way:
dr <- read_csv(url, skip = 4) %>% select(State, Total_Complaints = `Due Process Complaints (DPC) Total (3)`, Settlement = `DPC Resolution Meetings - Written Settlement Agreements (3.1a)`, Adjudicated = `DPC Hearings (fully adjudicated) Total (3.2)`, Pending = `DPC Pending (3.3)`, Dismissed = `DPC Withdrawn or Dismissed (3.4)`) %>% filter(State != "US, Outlying Areas, and Freely Associated States") %>% arrange(desc(Total_Complaints)) %>% filter(min_rank(desc(Total_Complaints)) <= 10) %>% gather(Category, Count, -State) dr
## # A tibble: 50 x 3 ## State Category Count ## <chr> <chr> <dbl> ## 1 New York Total_Complaints 5170 ## 2 California Total_Complaints 3714 ## 3 Puerto Rico Total_Complaints 2354 ## 4 New Jersey Total_Complaints 1211 ## 5 Pennsylvania Total_Complaints 804 ## 6 Massachusetts Total_Complaints 492 ## 7 District of Columbia Total_Complaints 457 ## 8 Texas Total_Complaints 370 ## 9 Illinois Total_Complaints 320 ## 10 Connecticut Total_Complaints 250 ## # … with 40 more rows
Much easier to look at! Now you see only the data that helps you answer the questions you want. You can download the cleaned dataset here so you can explore it further.
Clean Data Means Faster Analysis
When data is cleaned in this way, you will have a much easier time sorting and building visualizations in Excel or with programming languages like R and Python. Now that the data is cleaned, here is a quick visualization that compares the counts for the states that have the ten highest total complaints submitted:
ggplot(data = dr, aes(x = State, y = Count)) + geom_point(color = "cyan4", size = 3.0) + coord_flip() + facet_wrap(~Category) + labs( title = "Exploring Dispute Resolution Counts", subtitle = "Data: 2014-2015 Dispute Resolution Data", x = "", y = "")
This was a very short explanation of what data cleaning is and why it is useful. For an infinitely more thorough explanation of data cleaning theory, I recommend Hadley Wickham’s legenday paper on tidy data cleaning.