Turning Dataset Codes to Words With R
Note: I include a lot of code in this post so my fellow data scientists can either learn from it or give me feedback about how to make it better. It’s totally ok to skip over all that and just check out the data. The data immediately follow the code and have ##
marks in them.
Introduction: Labels Are Nicer Than Codes
I do a lot of work with the same public education datasets, like the CASEMIS dataset and the California School Dashboard datasets. These both have columns that use codes instead of labels. This means that you will never see the words “autism” in a CASEMIS dataset. Instead, you’ll see 320
among a sea of other disability codes. I’m a big advocate of sharing as much information and insight as we can about our school data, but that’s hard if your audience has to refer to their technical assistance guide to figure out what a code means. Consider the following plot, which takes a count of disabilities from a fake dataset:
I can just feel the insight slipping away as your audience flips through their manuals to figure out which disabilities 210
and 320
are!
So what can we do about this? In the past, I’ve tried one of three solutions in my spreadsheets:
- Copy and paste a disability label over each disability code
- Find and replace each disability code with a disability label
- Just leave the disability codes as they are and deal with it
These solutions are ok, but they introduce a lot of human error and tend to be really slow, particularly if you’re working with datasets that have thousands and thousands of records. Once I started complimenting my spreadsheet work with programming in R, I learned how to swap those codes for labels pretty quickly. Here’s how that works!
1. Use a Named Vector to Transform Codes to Words
First we’ll make a pretend dataset of students, primary disability, and secondary disability. It’s the same pretend dataset we used to make the plot above. We’ll make a thousand records, since it’s a little easier to see the benefit of this technique over a larger dataset.
library(tidyverse)
# Names for disability vector
dis_codes <- c(seq(200, 280, by = 10), 281, seq(290, 330, by = 10))
# Make dataset
set.seed(200)
df <- tibble(
disability_1 = as.character(sample(dis_codes, 1000, replace = TRUE)),
disability_2 = as.character(sample(dis_codes, 1000, replace = TRUE)))
df
## # A tibble: 1,000 x 2
## disability_1 disability_2
## <chr> <chr>
## 1 250 210
## 2 210 230
## 3 330 330
## 4 270 220
## 5 260 220
## 6 300 200
## 7 300 250
## 8 240 290
## 9 281 290
## 10 210 300
## # … with 990 more rows
Next we’ll make a vector of our disability names. These are the names we’ll be replacing the codes with later. These labels are the exact ones you’ll see in the California Education Code and will correspond with the disability codes described in the CASEMIS technical assistance guide.
# Elements of disability label vector
dis_desc <- c(
"none", "intellectual disability", "hard of hearing", "deafness",
"speech and language impairment", "visual impairment",
"emotional disturbance", "orthopedic impairment","other health impairment",
"established medical disability", "specific learning disability",
"deaf-blindness", "multiple disabilities", "autism", "traumatic brain injury"
)
Then we’ll use set_names
from purrr
to name each of the disability labels with their corresponding code. Now we have a named vector that looks like this:
# Create named vector
(dis_lookup <- set_names(dis_desc, nm = dis_codes))
## 200 210
## "none" "intellectual disability"
## 220 230
## "hard of hearing" "deafness"
## 240 250
## "speech and language impairment" "visual impairment"
## 260 270
## "emotional disturbance" "orthopedic impairment"
## 280 281
## "other health impairment" "established medical disability"
## 290 300
## "specific learning disability" "deaf-blindness"
## 310 320
## "multiple disabilities" "autism"
## 330
## "traumatic brain injury"
This is the point where things start to feel like magic! We’ll take our new named vector and use mutate
to transform the codes in disability_1
into actual disability labels.
df %>%
mutate(disability_1 = dis_lookup[disability_1])
## # A tibble: 1,000 x 2
## disability_1 disability_2
## <chr> <chr>
## 1 visual impairment 210
## 2 intellectual disability 230
## 3 traumatic brain injury 330
## 4 orthopedic impairment 220
## 5 emotional disturbance 220
## 6 deaf-blindness 200
## 7 deaf-blindness 250
## 8 speech and language impairment 290
## 9 established medical disability 290
## 10 intellectual disability 300
## # … with 990 more rows
Now we can actually look at the disability_1
column and understand what it means. But what if you have more than one disability column that needs to be transformed?
2. Use a Function to Convert More Than One Column of Codes
The CASEMIS dataset, like other public education datasets, has more than one column that uses codes instead of labels. For example, there are two disability code columns: DISABILIT1
and DISABILIT2
. One solution is to change each column separately, but let’s push for a more efficient solution. We can take our named vector and build a function out of it that takes a vector of disability codes and converts it into a column of disability labels.
convert_dis <- function(x) {
# Names for disability vector
dis_codes <- c(seq(200, 280, by = 10), 281, seq(290, 330, by = 10))
# Elements of disability vector
dis_desc <- c(
"none", "intellectual disability", "hard of hearing", "deafness",
"speech and language impairment", "visual impairment",
"emotional disturbance", "orthopedic impairment","other health impairment",
"established medical disability", "specific learning disability",
"deaf-blindness", "multiple disabilities", "autism", "traumatic brain injury"
)
# Create named vector
dis_lookup <- set_names(dis_desc, nm = dis_codes)
# Create a new vector of disability descriptions
as.vector(dis_lookup[as.character(x)])
}
Now let’s use mutate_at
to call that function on both of the disability columns.
df %>%
mutate_at(vars(disability_1:disability_2), funs(convert_dis))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 1,000 x 2
## disability_1 disability_2
## <chr> <chr>
## 1 visual impairment intellectual disability
## 2 intellectual disability deafness
## 3 traumatic brain injury traumatic brain injury
## 4 orthopedic impairment hard of hearing
## 5 emotional disturbance hard of hearing
## 6 deaf-blindness none
## 7 deaf-blindness visual impairment
## 8 speech and language impairment specific learning disability
## 9 established medical disability specific learning disability
## 10 intellectual disability deaf-blindness
## # … with 990 more rows
3. Collect Them All in A Package
I had so much fun making that work that I started taking notice anytime there were chances to create functions that convert codes to labels. As soon as I had a second function, I built an R package called spedtools
to house them so I could just load the package at the top of any CASEMIS analysis I do. Now I don’t need to rebuild the functions over and over again and I can get my data transformed for analysis with some compact and readable code. Here’s what that looks like:
library(spedtools)
df %>%
mutate_at(vars(disability_1:disability_2), funs(convert_dis))
## # A tibble: 1,000 x 2
## disability_1 disability_2
## <chr> <chr>
## 1 visual impairment intellectual disability
## 2 intellectual disability deafness
## 3 traumatic brain injury traumatic brain injury
## 4 orthopedic impairment hard of hearing
## 5 emotional disturbance hard of hearing
## 6 deaf-blindness none
## 7 deaf-blindness visual impairment
## 8 speech and language impairment specific learning disability
## 9 established medical disability specific learning disability
## 10 intellectual disability deaf-blindness
## # … with 990 more rows

Build an R package around all your renaming functions and load it up before every analysis
Conclusion
Using programming as a compliment to my work in spreadsheets changed my mindset on tasks I find myself doing over and over again. If there’s something I know I’ll be doing many times, like converting disability codes to disability labels, I prefer to do a little more work up front if it helps me move faster and more accurately in the long run. That way I’m spending more time sharing useful information to data consumers and less time cleaning data.
PSA: Consider submitting your R related blog posts to rweekly.org