Visualizing Opal Card Travel Data

A few months ago, NSW Transport released Opal Card travel data to the public through its open data portal.

My initial idea was to build a visualization similar to Uber’s Viz to show bus and train commutes. I assumed the data would be de-identified but would include individual tap-on/tap-off locations for each card. However, I was disappointed to find that the data only consisted of aggregated tap counts at 15-minute intervals per station. For those unfamiliar, the Opal Card is Sydney’s equivalent to London’s Oyster Card.

Despite the limitations, I decided to visualize the flow of commuters throughout the day. I was keen on experimenting with D3.js for this project. While I initially struggled with raw shapefiles, I soon discovered Leaflet and Mapbox. I chose Mapbox for its superior map customization capabilities.

Data Sample

Here is a look at the raw data from the portal:

Mode	Date	Tap	Time	Location	Count
bus	20160730	on	02:30	2000	415
bus	20160730	on	02:30	2135	18
train	20160730	on	02:30	Jannali Station	31
bus	20160730	on	13:30	2095	64

Since bus data is aggregated at the postcode level, it felt too coarse for a meaningful visualization. I decided to focus exclusively on train data, as station locations are more precise. I used R to filter and geocode the station names. My geocoding script is available here.

Pre-processing with R

First, I geocoded the unique station names:

library(dplyr)
library(stringr)
library(jsonlite)

data <- read.csv("Data/time-loc_20160725-31.csv") %>%
    filter(mode=="train")

# Get unique station names
stations <- data %>%
    select(loc) %>%
    unique()

# Append ", New South Wales" for better geocoding accuracy
stations$loc2 <- as.character(stations$loc) %>%
    paste(", New South Wales", sep="")

# Geocode each station
stations$geo <- NA
for(i in 1:nrow(stations)){
  print(paste("Working on index", i))
  stations$geo[i] <- getGeoDetails(stations$loc2[i])
}
# Split results into lat/lon
stations[,c(4:6)] <- str_split_fixed(string = stations$geo, pattern = ";", n=3)
stations <- stations %>%
    select(-c(V6, geo, loc2)) %>%
    rename(lat=V4, lon=V5)

data <- left_join(data, stations, by="loc")

Once geocoded, I used D3.js to build the visualization. Because I spend a lot of time in Tableau, I initially approached D3.js with a similar mindset. However, I soon realized that if a station had zero tap-ons during a specific window, it was missing from the dataset. To ensure a smooth animation, I had to expand the data to include these zero-value records.

library(dplyr)
library(lubridate)

opal_data <- read.csv("Full Data.csv") %>%
    select(-X, -mode) %>%
    filter(loc != "UNKNOWN")

train_tap_on <- opal_data %>%
    filter(tap == "on")
rm(opal_data)

# Create unique timestamp and station lists for a cross join
stations <- train_tap_on %>% select(loc, lat, lon) %>% unique()
timestamp <- train_tap_on %>% select(datetime, date, time) %>% unique()

stations$j <- 1
timestamp$j <- 1

# Complete the dataset with zero counts where data is missing
data <- inner_join(stations, timestamp, by='j') %>% select(-j)
data <- arrange(data, datetime, loc)

train_tap_on <- train_tap_on %>% select(-tap, -lat, -lon, -date, -time)
data <- left_join(data, train_tap_on, by=c("datetime", "loc"))
data["count"][is.na(data["count"])] <- 0

The Final Visualization

With a complete dataset, I created an animated visualization showing commuting patterns over an entire week. You can see the result here.

Opal-Vis

This project is still in its early stages, but it’s a solid foundation for future NSW transport data analysis. The source code for the visualization is available on GitHub.