Identifying Statistical Outliers in your Survey Data

Marketer often undertake large survey projects with little forethought about their approach to data analysis. Compounding this problem is their general lack of interest in cleaning the data they collect.

Data cleaning isn’t really optional. Without it your quantitative data may be tainted and your actions based on inaccurate information.

Identifying statistical outliers is a key part of data cleaning, and that’s what we’re going to cover here. We’ll discuss how we identify an outlier in relation to the study’s goals and the kind of data collected, and what to do with an outlier once identified (to omit it or leave it in your results).

Identifying Statistical Outliers in Your Survey Data

Data points that lie outside of the trend set by the majority of other values are typically easy to distinguish when the data is represented visually in a graph.

For example, the day you get 139 trial signups on your marketing site when the daily median is closer to 60 would be an obvious outlier, right?

Well, maybe.

But it’s tough to say without doing a little simple math first. [Notice that we didn’t use the average of 60 in the example; this is because an average can be manipulated by an outlier, and heavily if the sample is small.]

How to Calculate the Median

Start by taking your sample and ordering each observation from lowest to highest. As an example, we’ll stick with the trial signup hypothetical. In this case, we have a sample of 13 days and the signups from those days. After being re-arranged from smallest to largest, they look like this:

Day 1: 32
Day 2: 45
Day 3: 49
Day 4: 52
Day 5: 59
Day 6: 62
Day 7: 63 <-median
Day 8: 67
Day 9: 68
Day 10: 71
Day 11: 72
Day 12: 74
Day 13: 139

The median in this data set is Day 7 with a value of 63 trial signups. If you happen to have an even number of observations, the median would be the average of the two values closest to the middle. So now that we have the median for this sample, we’ll assign 63 as the variable Q2, which sits between variables Q1 and Q3 that define the upper and lower quartiles.

1 2 3 4