California School Dashboards Part 1: Cleaning the Data

This is part one of a three part series where I’ll be working with California School Dashboard data by cleaning, visualizaing, and exploring through modeling.

Introduction: It’s Ok to Skip Around

I’m writing this series for data scientists, public school educators, and data scientists who are also public school educators. Turns out, this is pretty hard to do. Most data scientists want to see reprodicible code because it explains the analysis explicitly. On the other hand, public educators may be less interested in the code and more interested in the process, questions, and answers.

So to keep things as interesting as I can for all parties, I broke up the post with descriptive headers so readers can skip sections or jump around to where their curiosity takes them. There’s code scattered throughout the post, but if that’s not interesting to you feel free to jump right over it and into the results. These usually appear right after the code. If code is your thing, you are always welcome to view all of it on my GitHub profile and to help me by making contributions and improvements.

There’s More to the Data Than Just the Dashboard Website

The California School Dashboard website was built to show and analyze public school metrics such as English language arts test scores and graduation rates. The website delivers the information using icons that show scores and growth from the previous year. They also use a color coded grid so you can compare scores and growth across schools.

For a lot of folks the interface and visuals on the website will be enough to answer the questions they are asking about a school district. Others may find that the natural progression of curiosity leads to questions that require stepping outside of the California School Dashboard website to ask more specific questions and to compare across districts. If you want to start answering questions outside of what’s available in the dashboard website, you’ll need to get the data out of the website and work with it directly.

This series of posts will work through the analysis process, covering the main steps I take when exploring public education datasets:

  • Importing and cleaning the data
  • Exploring the data through visualization
  • Exploring the data through modeling

We’ll cover the first step here and will do the rest in later posts.

Importing the Raw Data

I picked a school district and did some cleaning to hide the district and school names. Every California school district’s data is available on the California School Dashboard website. This demonstration will work on any of the dashboard math datasets.

The California Department of Education makes this data available on the five-by-five grid pages. First we’ll bring the data into R. Here’s how you get to the dataset:

  • From the district’s dashboard page, navigate to the metric you want to explore
  • Click on the five-by-five link
  • Click on the View Detailed Data link
  • Download the text file by clicking on the Download Data link

Clearer Column Names Helps Readability

Here’s what the math dataset looks like without any changes:

url <- ""

# Show the whole CDS code instead of scientific format
math <- read_tsv(url, col_types = cols(cds = col_character()))
## # A tibble: 232 x 26
##    cds   rtype schoolname districtname countyname charter_flag coe_flag
##    <chr> <chr> <chr>      <chr>        <chr>      <lgl>        <lgl>   
##  1 14    D     <NA>       district_1   county_1   NA           NA      
##  2 14    D     <NA>       district_1   county_1   NA           NA      
##  3 14    D     <NA>       district_1   county_1   NA           NA      
##  4 14    D     <NA>       district_1   county_1   NA           NA      
##  5 14    D     <NA>       district_1   county_1   NA           NA      
##  6 14    D     <NA>       district_1   county_1   NA           NA      
##  7 14    D     <NA>       district_1   county_1   NA           NA      
##  8 14    D     <NA>       district_1   county_1   NA           NA      
##  9 14    D     <NA>       district_1   county_1   NA           NA      
## 10 14    D     <NA>       district_1   county_1   NA           NA      
## # … with 222 more rows, and 19 more variables: studentgroup <chr>,
## #   currdenom <dbl>, currstatus <dbl>, priordenom <dbl>, priorstatus <dbl>,
## #   change <dbl>, statuslevel <dbl>, changelevel <dbl>, color <dbl>, box <dbl>,
## #   caa_denom <dbl>, caa_level1_num <dbl>, caa_level1_pct <dbl>,
## #   caa_level2_num <dbl>, caa_level2_pct <dbl>, caa_level3_num <dbl>,
## #   caa_level3_pct <dbl>, ReportingYear <chr>, X26 <lgl>

Notice that the column names aren’t immediately clear about what they’re describing. For example, the sizes of the student subgroups are kept in a column called currdenom. The average distance from score level of three is currstatus.

It’s super important to take the time to review the file layout, which provides definitions for every column name. It’s not the most fun reading, but you’ll find that you can help people answer questions better when you are confident you understand precisely what the numbers mean. Once you have a good understanding of what each column means, you can drop the column names you don’t need for your analysis and rename the ones that you keep. Let’s drop all columns except the school name, subgroup name, size of the subgroup, average distance from three, and the change from last year:

math <- math %>% 
  # Filter on just schools, not whole district counts 
  filter(rtype == "S") %>% 
  select(school = schoolname, 
         subgroup = studentgroup, 
         subgroup_size = currdenom, 
         avg_dfrom3 = currstatus, 
## # A tibble: 215 x 5
##    school    subgroup subgroup_size avg_dfrom3 change
##    <chr>     <chr>            <dbl>      <dbl>  <dbl>
##  1 school_13 AA                   2       NA     NA  
##  2 school_13 ALL                217       57.3    0.7
##  3 school_13 AS                  45       73.8    2.4
##  4 school_13 EL                  44       26.1    8.9
##  5 school_13 ELO                 18       -8     14.5
##  6 school_13 EO                 158       60.6   -3.7
##  7 school_13 FI                   1       NA     NA  
##  8 school_13 HI                  23      -22     -6.3
##  9 school_13 MR                  31       62.5  -17.3
## 10 school_13 RFP                 26       49.7   -2.6
## # … with 205 more rows

Much better! Now we have only the columns we want, renamed so they’re clearer:

  • school: name of the school
  • subgroup: name of the subgroup
  • subgroup_size: number of students in the subgroup
  • avg_dfrom3: average distance from test score level three for each subgroup
  • change: change in distance from three score compared to last year

There’s one more thing I want to do here. It’s hard to see right away which subgroups I’m looking at because the dataset uses subgroup codes instead of actual names. We want to use our brain power more to ask and answer questions and less to remember what codes mean, so let’s change that. I built a handy tool in my spedtools R package that will quickly rename the codes to actual words.

(math <- math %>% mutate(subgroup = convert_subgroup(subgroup)))
## # A tibble: 215 x 5
##    school    subgroup                   subgroup_size avg_dfrom3 change
##    <chr>     <chr>                              <dbl>      <dbl>  <dbl>
##  1 school_13 Black/African American                 2       NA     NA  
##  2 school_13 All Students                         217       57.3    0.7
##  3 school_13 Asian                                 45       73.8    2.4
##  4 school_13 English Learner                       44       26.1    8.9
##  5 school_13 English Learners Only                 18       -8     14.5
##  6 school_13 English Only                         158       60.6   -3.7
##  7 school_13 Filipino                               1       NA     NA  
##  8 school_13 Hispanic                              23      -22     -6.3
##  9 school_13 Multiple Races/Two or More            31       62.5  -17.3
## 10 school_13 RFEPs Only                            26       49.7   -2.6
## # … with 205 more rows

Exploring a Clean Dataset

Now that we have just the columns we want to explore and actual subgroup names instead of codes, we can start exploring the dataset faster with a series of quick exploratory questions. We just want to indulge a little curiosity and ask quick questions and answers to get the analytic juices going. My favorite way to do this is to encourage lots of questions and to provide rapid answers so the process feels casual and conversational, rather than writing up a list of questions and taking days to respond with the data. Having a clean dataset ready and being fast at the keyboard helps us do this. Here are some examples:

How different are the student subgroups in size?

Sometimes you want to get a sense of how big the student subgroups are, especially when dealing with metrics like “average distance of three”, the results of which are influenced in some part by how big the group is. Here are the top ten biggest subgroup student counts in this dataset, along with their school.

math %>% 
  select(-c(avg_dfrom3, change)) %>% 
  filter(subgroup != "All Students") %>% 
  filter(min_rank(desc(subgroup_size)) <= 10) %>% 
## # A tibble: 10 x 3
##    school    subgroup     subgroup_size
##    <chr>     <chr>                <dbl>
##  1 school_10 English Only           765
##  2 school_07 English Only           628
##  3 school_10 White                  509
##  4 school_07 Asian                  474
##  5 school_12 English Only           404
##  6 school_07 White                  380
##  7 school_10 Asian                  331
##  8 school_12 Asian                  254
##  9 school_12 White                  251
## 10 school_04 English Only           190

It’s important to note here that not all of these groups are mutually exclusive. For example, it’s possible for a student to be counted both in the White category and in the English Only category.

Where were the highest performing English Learner groups?

Two questions that seem to come up a lot when I work with school staff are

  • Who are the students behind the numbers?
  • Who do we know that’s doing a great job at this?

One way to answer the second question is to transform the data so it shows you the schools with the top ten average distance from 3 for the subgroup you’re interested in. Here’s that transformation for English learners:

math %>% 
  select(-c(subgroup_size, change)) %>% 
  filter(subgroup == "English Learner") %>% 
  filter(min_rank(desc(avg_dfrom3)) <= 10) %>% 
## # A tibble: 10 x 3
##    school    subgroup        avg_dfrom3
##    <chr>     <chr>                <dbl>
##  1 school_09 English Learner       90.5
##  2 school_07 English Learner       72.1
##  3 school_14 English Learner       61.2
##  4 school_04 English Learner       54  
##  5 school_02 English Learner       50.7
##  6 school_15 English Learner       49.7
##  7 school_01 English Learner       48.9
##  8 school_03 English Learner       43.5
##  9 school_06 English Learner       37.6
## 10 school_05 English Learner       32.8

Now school staff can call colleagues at other schools to get a conversation going about what’s been working well.

Which schools had the highest increases from last year in each subgroup?

We can keep the conversation going by looking at the top two positive change scores for every subgroup. These are the schools that had the best increases in the distance from three score from last year. Now schools can get a sense of where to start calling if they want to hear about strategies, interventions, or initiatives that their colleagues are using to be successful.

math %>% 
  select(-subgroup_size) %>% 
  # Remove the All Students category so we're just looking at subgroups
  filter(subgroup != "All Students", change > 0) %>%
  group_by(subgroup) %>% 
  filter(min_rank(desc(change)) <= 2) %>% 
  select(subgroup, school, change) %>% 
## # A tibble: 21 x 3
## # Groups:   subgroup [11]
##    subgroup               school    change
##    <chr>                  <chr>      <dbl>
##  1 Asian                  school_08   12.4
##  2 Asian                  school_12   11.6
##  3 Black/African American school_12   58.4
##  4 English Learner        school_13    8.9
##  5 English Learner        school_04    3.4
##  6 English Learners Only  school_13   14.5
##  7 English Learners Only  school_04   34.1
##  8 English Only           school_09    9.1
##  9 English Only           school_12    6.3
## 10 Hispanic               school_09   13.1
## # … with 11 more rows


The California School Dashboard website makes much of its underlying data available for us to explore. It’s worthwhile to import and clean that data so when the exploration starts, school staff can rapidly ask questions and get answers in a way that organically moves the conversation towards the important questions. Next we’ll look at ways to explore the data further through data visualization.

Ryan Estrellado
Ryan Estrellado
Executive Consultant at South County SELPA, Educator, and Data Scientist

I work at the San Diego South County SELPA, where I design solutions to help SELPAs build equitable systems for students with disabilities.