Public School Data is Great, But You Have to Clean It First

Using public school data can help school districts see the bigger picture of what is going on. So why don’t we see more districts taking advantage of this information gold mine? The datasets available from federal and state governments are often messy in their current form. This raw data has to be cleaned before it can be useful and school districts don’t often have the staff or time to clean large datasets or groups of large datasets. Here are two examples of really useful datasets, but in their raw, messy form:

Raw data is a treasure chest of information, but it takes time to clean them up in a fast and accurate manner. I wrote this post to demonstrate how I clean raw data before delivering it to a client in a more useful form.

Example of Our Data Cleanup

This example uses a dataset of special education dispute resolution data from the US Department of Education’s website.

This dataset counts the number of complaints written to state departments of education during the 2014-2015 school year. It includes related dispute resolution outcomes. All of the counts are organized by state.

Clean Data Is More Useful

Here’s the data as-is, if you imported it into your spreadsheet software without cleaning it:

Here’s the same dataset after I’ve cleaned it:

How I Use R to Clean Data

I use a programming language called R. R programming allows me to work faster by doing the same steps repeatedly to clean a lot of datasets. Instead of manually cleaning data in a spreadsheet with several steps, I have the steps automated in R. This way, if a client needs a year’s worth of special education data and decides later they want the last ten years for comparison, I can apply the recipe quickly and they can jump right to exploring their data.

A More Technical Exploration and Example Dispute Resolution Plot

If you’re interested in what this process looks like, here’s a more detailed explanation. I’ve also included an example visualization of the dispute resolution data that results from the cleaned dataset. All the code for this post can be found in its entirety on my GitHub page.

Raw Data

First, we’ll have a look at the 2014-2015 dispute resolution data exactly as it is when you download it from the US Department of Education. Notice that when we load the data into memory, R immediately throws a warning about the number of missing column names. Column names will frequently need cleaning because they are often out of order, missing, or not named in a way that is easy to understand.

Here are the first few lines of the dataset before we’ve done anything to clean it:

url <- ""

dr_notclean <- read_csv(url)
## Warning: Missing column names filled in: 'X3' [3], 'X4' [4], 'X5' [5], 'X6' [6],
## 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12], 'X13' [13],
## 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17], 'X18' [18], 'X19' [19],
## 'X20' [20], 'X21' [21], 'X22' [22], 'X23' [23], 'X24' [24], 'X25' [25],
## 'X26' [26], 'X27' [27], 'X28' [28], 'X29' [29], 'X30' [30], 'X31' [31],
## 'X32' [32], 'X33' [33]
head(dr_notclean, n = 10)
## # A tibble: 10 x 33
##    `Extraction Dat… `6/6/2016` X3    X4    X5    X6    X7    X8    X9    X10  
##    <chr>            <chr>      <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 Updated:         <NA>       <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  2 Revised:         <NA>       <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  3 <NA>             <NA>       <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  4 Year             State      Writ… WSC … WSC … WSC … WSC … WSC … WSC … WSC …
##  5 2014-2015        Alabama    7     5     4     4     1     0     0     2    
##  6 2014-2015        Alaska     10    10    5     7     3     0     0     0    
##  7 2014-2015        American … 0     0     0     0     0     0     0     0    
##  8 2014-2015        Arizona    80    52    36    52    0     0     0     28   
##  9 2014-2015        Arkansas   19    9     9     9     0     0     0     10   
## 10 2014-2015        Bureau of… 11    9     9     0     0     0     0     2    
## # … with 23 more variables: X11 <chr>, X12 <chr>, X13 <chr>, X14 <chr>,
## #   X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>, X19 <chr>, X20 <chr>,
## #   X21 <chr>, X22 <chr>, X23 <chr>, X24 <chr>, X25 <chr>, X26 <chr>,
## #   X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>, X31 <chr>, X32 <chr>, X33 <chr>

You can download the whole dataset here so you can inspect it further. You will notice right away that the column labels are pushed down a few rows because the extraction date is at the top instead. That creates a lot of empty space, which is why you see a lot of NA values and Xs as column names. You can also see this in the screenshot of the raw data at the start of the post, which gives you a sense of what it would look like if you imported it into your favorite spreadsheet program.

You will notice that there are thirty one distinct categories with names that don’t immediately describe what the numbers below them mean. For example, Written, Signed Complaints (WSC) Total (1) are the total count of written complaints submitted to the state agency, but might be easier to look at (if not more descriptive) by using something like written_complaints.

Cleaned Data

But let’s imagine you only want to know about the total number of complaints, the number of complaints that were ajudicated, the number of cases that were pending, and the number that were dismissed. We can make this dataset easier to look at by removing all columns except the four we want. We can also gather the counts up into two columns: one for the thing you are counting, such as total complaints, and one for the count itself. Finally, let’s say we are only interested in the states that have the ten highest counts of total complaints. Here are the first 10 lines of the dataset after we clean it this way:

dr <- read_csv(url, skip = 4) %>% 
         Total_Complaints = `Due Process Complaints (DPC) Total (3)`, 
         Settlement = `DPC Resolution Meetings - Written Settlement Agreements (3.1a)`,
         Adjudicated = `DPC Hearings (fully adjudicated) Total (3.2)`, 
         Pending = `DPC Pending (3.3)`, 
         Dismissed = `DPC Withdrawn or Dismissed (3.4)`) %>% 
  filter(State != "US, Outlying Areas, and Freely Associated States") %>% 
  arrange(desc(Total_Complaints)) %>% 
  filter(min_rank(desc(Total_Complaints)) <= 10) %>% 
  gather(Category, Count, -State) 

## # A tibble: 50 x 3
##    State                Category         Count
##    <chr>                <chr>            <dbl>
##  1 New York             Total_Complaints  5170
##  2 California           Total_Complaints  3714
##  3 Puerto Rico          Total_Complaints  2354
##  4 New Jersey           Total_Complaints  1211
##  5 Pennsylvania         Total_Complaints   804
##  6 Massachusetts        Total_Complaints   492
##  7 District of Columbia Total_Complaints   457
##  8 Texas                Total_Complaints   370
##  9 Illinois             Total_Complaints   320
## 10 Connecticut          Total_Complaints   250
## # … with 40 more rows

Much easier to look at! Now you see only the data that helps you answer the questions you want. You can download the cleaned dataset here so you can explore it further.

Clean Data Means Faster Analysis

When data is cleaned in this way, you will have a much easier time sorting and building visualizations in Excel or with programming languages like R and Python. Now that the data is cleaned, here is a quick visualization that compares the counts for the states that have the ten highest total complaints submitted:

ggplot(data = dr, 
       aes(x = State, y = Count)) + 
  geom_point(color = "cyan4", size = 3.0) + 
  coord_flip() + 
  facet_wrap(~Category) + 
    title = "Exploring Dispute Resolution Counts",
    subtitle = "Data: 2014-2015 Dispute Resolution Data", 
    x = "", 
    y = "")


This was a very short explanation of what data cleaning is and why it is useful. For an infinitely more thorough explanation of data cleaning theory, I recommend Hadley Wickham’s legenday paper on tidy data cleaning.

Ryan Estrellado
Ryan Estrellado
Executive Consultant at South County SELPA, Educator, and Data Scientist

I work at the San Diego South County SELPA, where I design solutions to help SELPAs build equitable systems for students with disabilities.