Finding Similar Strava Activities
As a fairly serious runner I am a huge fan of Strava which I use to track my training.
Strava has a feature called matched runs whereby you can see if you've run the same route before and track your progress over time. However often I want to try and find similar activities (potentially with a similar route but not necessarily). For example if I've just done a 5 mile tempo run I might want to find other similar workouts.
I therefore set out to try do this using the Strava API and some machine learning techniques.
Extracting and Cleaning Data
I extracted my activity data (including distance, activity type, speed, etc) from the Strava API along with the latitude and longitude coordinates from the run. I took my last 600 runs (roughly the last 18 months). This is over 3500 miles and 30 races with the majority of runs in Cambridge, UK and San Francisco, CA. The Strava heatmap for this period can be found here.
Creating a Similarity Metric
There are a couple of main ways two running activities can be similar. Either they could have a similar route or they could have been similar types of runs with similar statistics. For example the first two runs below are on a very similar route but are run at slightly different paces whereas the latter two runs are at a similar pace but obviously very different routes.
Finding Activities with Similar Routes
Rather than finding activities with exactly the same route (and hence having a very similar distance) I also wanted to find activities where the routes matched up for part of the run. I could not find a huge amount of information on the internet but inspired by the Hausdorff distance I decided to calculate the average minimum distance from one path to the other. More formally given paths $ X = {x_1, \cdots, x_n}, Y = {y_1, \cdots, y_m} $ I calculated
Note that this is not commutative: $dist(X, Y) \neq dist(Y, X) $. Hence a true similarity measure will need to take both values into account.
Calculating this distance metric for a pair of Strava activities takes $ O(nm) $ time. My activities have around 500-1000 coordinate pairs so I subset these by only taking every tenth coordinate to speed this up by 100x as previously it was taking a little too long. While this increased the metric it did so systematically for each pair of activities.
The following map is a demo of the calculation.
Given a point on the red line, we look at the distance from the point to all the points on the blue line (grey lines) and find the shortest one (black line).
Finding Activities with Similar Statistics
The main goal of this was to find activities not just with similar routes. To find activities which are similar from a statistical sense we need to create a similarity metric. From the Strava API we have various details about runs. The main details I wanted to take into account were distance, speed and activity type. However there are also other details such as elevation gain, time of day, day of week that were less important to me but might want to be added to tune the metric.
A fairly classic similarity metric used in machine learning (for example in Spectral Clustering) is the Gaussian ($ e^{-\|s_1 - s_2\|^2/2\sigma^2}$). However this working well would rely on the distance and pace data for my runs being normally distributed.
Therefore I plotted a histogram of the distances for each run:
There is a fair amount of right skew as expected - the mean length of run is around 6 miles but I've run up to 18 miles. Therefore I tried to transform the data to make it normal. Using a log transform the data was still skewed by a square root transform gave very nice looking data:
The square rooted distance data can therefore be used with the Gaussian similarity measure. Applying similar techniques to other variables gives us a run statistics similarity measure.
Combining the Measures
The final step in the process is to combine the route and statistic measures together. This is arguably more art than science (for example given the three runs above which two are most similar is debatable). I therefore combined them in some different ways to see the effect. Note this is also fairly specific to my runs and may vary between different people.
The final result I came up with was to first ensure the run statistics similarity was less than a certain amount. This would prevent 2 mile runs being similar to 10 mile runs even if they were on the same route. I then took the minimum value of the route measure and the statistic measure after normalizing. Because the route measure could be huge (e.g. comparing runs in the UK and USA) the measures could not just be added or multiplied.
This is not hugely scientific but I found it gave good results for my runs. With runs from more athletes this could be improved.
Results
Below are some selected examples this technique produces:
Here we get 3 fairly similar activities in terms of statistics but two are similar routes and 1 is completely different (but is almost idential in terms of pace and distance).
Here we match races - which have very distinct statistics to other runs.
Finally here we match very similar workouts - the first two match by route and statistics whereas the latter two match by the statistics.
Further Work
There is a lot more I would like to do with this - initally I would like to create a web app to allow others to sign in using the Strava API and find their own similar routes.
In addition I think there could be a fair amount more analysis of the similarity measures and combining them. Something that could be interesting is looking at the various paces run throughout the run rather than just the average as this would be a good way to find very similar activities.
Code
All code for this project is located on github.