When we discuss working with data we often talk about teasing out the 'signal' from the 'noise'. Essentially filtering out the distractions, enabling us to work with only the most valuable or useful data.
We do it in lots of ways and for lots of different reasons. Sometimes we are trying to clean out errors that have occurred during data collection, or stripping out outliers and anomalies, sometimes it's simply the transformation of data so that it's easier to compute or work with, or for us to read. At other times we might be deliberately excluding data to help support our hypothesis or agenda (however I realize no one reading this would ever dream of doing such a thing). We often talk about these activities as an essential part of getting the value from a data set. But what if all that 'noise' held hidden riches?
What started me thinking was one particular article about digital vs analogue music recording. In it, the author was discussing how the intensity of the background tape hiss of a recording shed some light onto the track layering and over-dubbing required. This in turn went some way to revealing the techniques and processes that created the track. Another example was how on some tracks the studio machinery such as a background air-conditioner can be heard, or when listened to closely enough there are snippets of conversation. Studios have been fighting to remove this noise from their recordings for many years, but for some people it's become a rich vein of insight. The 'signal' may be the track laid down, but the 'noise' expands the scene, adding context and depth beyond the idealized rendition.
This idea that there's more to data than the rows you are seeing in the spreadsheet or data points on a chart, is essential to keeping the potential of your data alive. Jer Thorp calls this the "data system" and suggests that we always think of data in terms of it being a system and not simply an artefact. That we always view it as a series of processes and activities around; collection, computation and representation. Each of these in turn filter the 'noise' and magnify specific 'signals'. When we engage with data we have to be conscious of the effects of those processes; what policies were applied during collection, what was missed, what was removed, what unknown errors lurk in the precision and 'truth' of those numbers? When presented with a formatted chart or cleaned data set, in can be very hard to find ways of answering those questions.
The answers lay in lineage and history. It's well known that overtime we humans have a tendency to change our stories, to fit the 'facts' to our preferred narratives. Journalists and researchers know that contemporary secondhand records can be more reliable than firsthand recollections (you can thank Ebbinghaus' "forgetting curve" for that). Data has a little of that too, each transformation, cleaning, interpretation moves it further along a preferred narrative. When we keep the rawness, noise and mess of that first point of collection we can retain more of that history. As we find new ways of seeing the noise, through machine learning or new techniques, we may well start finding fresh insights. It could be that we will be able to infer the policies influencing the collection or undercover hidden signals that reframe the entire data set, stretching its use beyond anything thought of when first collected. The more we maintain of the contemporary records the more possibilities and potential are stored in that data system. The more we record of the influences, approaches and policies that drove the collection of that data alongside it, the more we will be able to understand what else it might be telling us. We could even start to hear beyond the hiss to reveal the participants and processes that formed it.
Photo credit: Laser Burners via Foter.com / CC BY-NC-ND