For this blog post, I’m going to take a step back and not go into data visualization best practices. Rather, I’m going to explore what you can do with your data before arriving at a final visualization – what I like to call “re-expressing” your data. Accordingly, we are going to look at the topic of transforming your data. More specifically, we are going to examine how to transform a measure (quantitative value), such as sales into accumulated sales, before visualizing it.
As usual, I’m not going to be able to go through everything, but, hopefully, this can get you started and maybe give you some ideas about how to transform your data. Also, in all examples, I’m going to use bicycle rides in New York from 2014 as a basis.
Accumulation
Let’s start off with accumulation, as it’s probably the easiest transformation to understand. For most types of accumulations, you would want the accumulation to be calculated over time. That way, you can see what the total value is, instead of the unique value per time step.
In the examples below, the first data visualization does not include accumulation as a measure. As you can see, it’s a representation of the number of bicycle rides per month, starting with January and running through December. The second visualization includes accumulation, showing total bicycle rides across time. We can compare total rides with a target annual utilization number, which we may predefine. Alternatively, we could compare the total number of rides to the number of rides of another year, which would let us know if bicycle rides have increased or decreased.
Normalization
Another method to transform your data is to normalize it. In doing so, instead of looking at an absolute value we can look at a relative one, seeing how it contributes to a total value.
In the example below, again we are looking at rides per month; however, the area colored blue represents males and the area colored red represents females. In the first data visualization, we can compare rides each month by looking at how many riders were male or female. But, if we try to determine if the percentage of female to male riders has increased over time, we’d be forced to manually calculate the percentages.
Instead, we can transform the data and stack the areas in our chart. As you can see in the second visualization, it’s easy to spot the percentage of female riders at the beginning and end of the year and notice the increase in female ridership during the summer months.
Index
Transforming your data to use an index is great if you have multiple measures of either different magnitudes or units. The index can then be used to see how much a value increases or decreases over time.
In the example below, I’ve plotted a couple of measures to see how the measures affect the total amount of rides over time. Due to the measures either having different units or there being too big a difference between them, we end up with some flat lines, as shown in the first visualization. This can potentially be solved by using a dual axis chart, but today I’m going to go with an index instead.
In the second visualization, which includes an index, I’m now calculating each value as a percent compared to the first value, starting in January in which all values converge. This way, each following value, calculated each month, is compared to the initial value, and we can begin to see trends in the data.
With this visualization, we can see that as the number of rides increases, so does the temperature and the duration of each trip. What doesn’t seem to change that much is the average age of the riders. There is also an interesting pattern in February where the temperature hasn’t changed much, but the number of rides has decreased and the duration has increased.
Moving Average
When it comes to having many data points with a spread of values, the normal method is to aggregate them. In the first visualization below, we are looking at rides per day, and we can see that there is a lot of spread in our data. As a result, it becomes quite hard to read the visualization and find patterns, as there are hundreds of data points being represented. If we aggregate the data to rides per week, as in the second visualization, we can see that there is some seasonality to the rides; however, we lose significant detail using this method.
What we can do — to retain the visualization’s detail and readability — is to use a moving average to transform and smooth out the data by calculating an average across multiple days. In the third visualization, I’ve done this using a three-day rolling average. We can see how this reduces the appearance of too much data spread while still allowing us to see both seasonal and local patterns. The length of the rolling average, of course, depends on your data and should be carefully considered. If not, you could potentially remove valuable information from your data.
Hopefully, these transformations give you some inspiration to try out other ones. Being able to re-express your data allows you to more deeply explore it, prompting questions and providing insights you may not have thought to seek out or were unable to see.