Monday, May 30, 2016

Strava and a curious tale of WW2 bombers

The city of Waterloo published an apparently innocent tweet that stopped me in my tracks.

Now, this is a really interesting move on Waterloo's part, but it left me feeling concerned. I'd like to talk about why, in the hopes that it helps Waterloo and other places avoid falling into a common trap.

First, though, credit where credit is due: the city of Waterloo has been doing some excellent stuff over the last few years, by connecting and improving its trails, continuing to extend bike lanes, and supporting a new uptown streetscape with protected bike infrastructure (which will be joined by more trails along Caroline and Erb as part of a regional reconstruction). There are signs of progress in finally making the Lexington overpass bike-friendly. It's really encouraging and I'd like to see other cities (like Kitchener) take a closer look at their initiatives.

Also positive: Waterloo has also taken an interest in data. Loop counters are now installed on several trails across town, counting travelers on foot and on bike year round. Waterloo wants feedback on its improvements, and wants to let real data help them drive decisions. Maybe they can also show that data live at some point!

Now, it seems Waterloo has taken up with Strava to use Strava's health and fitness tracking data to show where people are jogging and biking, using their Metro planning service, which looks like an attempt by Strava to monetize their user's data. (Update: I've been informed that Waterloo isn't using Metro, just looking at the publicly available information on Strava.)

If this is the case, Waterloo had best tread very carefully and realize what they are getting, and what they are not.

There's no arguing Strava's cycling heat map is a beautiful thing. Who doesn't like seeing where Strava users are biking? How couldn't this help us make decisions on cycling infrastructure?

It's so beautiful.
Have you spotted the problem yet? You should ask who are Strava's users, and what are they doing. Are they actually representative of the people you're trying to encourage to use a bicycle?

A cautionary tale

Here's a lesson from history as to why knowing where your data comes from is so important.

During World War II, Allied bombers would fly out over Europe to bomb targets and sometimes they would be shot down. The odds of crews surviving the many missions of their "tour" was depressingly low, because mission after mission, the chance of being the victim of guns or flak catches up with many.

In an effort to increase the survivability of their bombers, military leaders and engineers decided they needed to add armour to their bombers. They couldn't add much-- armour is heavy, and bombers need to fly. So, where would it help the most?

They came up with a clever idea: look at all of the surviving bombers and where they were shot up. Where they were being hit most, it was argued, was where the aircraft should be strengthened.

Statistician Abraham Wald turned this argument on its head. You're only looking at the bombers that came back, he said. They don't need more armour where they've been hit. They need armour where they haven't been hit. It turns out this was around the pilots and in the tail.

All the bombers that had been hit around the pilots and in the tail hadn't returned, so nobody had good data on them. But bringing those bombers and their crew home was the goal. And until Wald pointed this out, nobody realized the mistake they were making.

The black areas mark hits on the surviving bombers

This is called Survivorship Bias and it happens all over the place.

Survivorship Bias and Self-selection Bias

It's not the only kind of statistical bias that Waterloo needs to worry about here, but it's a big one. People who don't bike but could, don't use Strava. The only people Strava can collect data on are those who have managed to make themselves able to tolerate riding on our streets and roads. Everyone who can't has been filtered out.

There's more.

Strava calls itself a "social network for athletes". Its cycling userbase has a lot of athletic cyclists who use Strava to compare their performance to others. In other words, Strava's data is heavily weighted towards the enthusiastic, confident and athletic cyclists who look for places they can ride fast for long distances, and who are more willing to ride in traffic.

This presents a second source of bias: People choose whether to use Strava. This is Self-selection Bias. The people who ride bikes, who choose to use Strava, may not relfect all people riding in general.

(Look ma, I'm finally putting my math degree to good use!)

So what does this mean for Waterloo, which like many cities, wants to grow cycling? The question here is, are the people who they want to attract to cycling similar to Strava users? Do these users' choices about where to ride reflect where the hypothetical 60% Interested But Concerned want to ride, or the trips they'd make?

I would say there are very significant differences between these groups. Strava's cyclists are the "survivors" of bicycle-unfriendly city design, that keeps all but the bravest of us off our bikes. They are making very different choices about their cycling than the majority of people who say they would like to ride a bike but don't. The first such choice is that these cyclists are already riding on streets and roads that would make most of us very nervous.

This is not to say the data is useless. Some of it reflects overall truths: the great success that is the Iron Horse Trail, for instance. But let's take a closer look at these maps. They can lead you into some strange interpretations that don't make sense.

1. Students on bikes are missing

University of Waterloo

Around UW, there is heavy bike traffic on the Laurel trail, and on Ring Road, and on University, Columbia and Westmount. There are virtually no traces within Ring Road.

This is highly surprising. The interior UW campus has many bike racks littered with bikes. Why aren't paths to these points showing? Could it be that students on bikes aren't using Strava? If so, what does that say about how Strava data represent their travel patterns off UW campus?

This is a sign that Strava is completely silent on a major bike-using demographic. Yikes.

2. It takes a lot of nerve to cross 85 on University Avenue

University between Weber & Bridge

According to Strava, there's little difference in cycling on University Ave. where bike lanes exist (west of Lincoln Road) and where they don't (over the expressway). In fact, it looks like there's plenty of people on bikes going crossing the expressway on University Avenue. My own experience has been that anyone with a choice avoids crossing here, unless they are supremely confident about mixing with high speed traffic merging on and off the highway.

Someone might look at this and interpret it as a case where bike lanes aren't really having an effect, and that usage by people on bikes is just fine on University Ave. For a certain kind of athlete, that is completely true. When considering the general public, that would be a mistake.

3. Where are the neighbourhood riders?

Eastbridge neighbourhood

If you look at this map, you'd be led to believe that virtually everyone is biking on arterial roads. My experience is that there are a lot of casual riders on neighbourhood streets within this area, and younger students who traverse the quieter streets. These are clearly not being captured in Strava's data set.

What's more, this image doesn't really capture who wants to ride, but can't. The southeast portion of the image is Conestoga Mall, with a supermarket and a transit terminal. Do the lack of traces to here mean that nobody bikes to Conestoga mall? Or just that people who use Strava aren't going shopping or connecting to transit? We can't know. Nor can we determine if the neighbourhood would appreciate a little more bike accessibility at the mall.

The map says nothing.

Approach with Caution

These are just a few examples of where Strava as a planning tool comes up short. As I said before, the data is not without value, but if your goal is to make cycling attractive to the majority of people, then at best this information is incomplete, and at worst it can be deceptive. And it won't always be obvious when that happens, either.

What Strava shows is where a very particular kind of cyclist rides. It doesn't show where improvements would do the most good for the people who could ride but choose not to, which is where cycling growth will come from. Strava wants to make money with their data, so I don't trust them to be forthcoming about this, nor would I say that their agenda matches those of the cities hiring them.

The city of Waterloo should be commended for seeking out data to base their decisions on. I trust the they know that Strava is just one tool among many in their toolbox. It can help them visualize how the city looks from two wheels, but they still need to take a step back from these maps and analyze Waterloo's bike network with careful thought and direct observation.

Let's not lose sight of that.

Meanwhile, I guess I should install Strava. If nothing else, I want my own trips to be represented!


  1. I can't see most people firing up the Stava app, or starring their Garmin before going for a quick ride around campus or around their neighbourhood. I agree that Strava isn't a great source for information on all rides.

    Also, the data likely is highly weighted towards users on Road bikes, so dirt/gravel trails are likely under represented.

    I think Strava data can be useful, but definitely needs to be used as part of the whole set of data and not by itself.