Good Data and Outliers
by Harry J. Foxwell
I sometimes note that I am an "outlier" among my academia colleagues, most of whom had their doctorates by the age of 30 or even younger. I earned mine at age 55, not a rarity of course but a bit outside the typical range for that degree. So, a table or graph of "age at doctorate" would highlight my “odd” data point, worthy of a second look as to whether it was an error or real data. And what of a new doctorate's reported age of 102? Clearly a mistake! Well, maybe not.
What then is an outlier? Informal definitions use words like "unusual", "different" in some aspect from the "typical" data points, "abnormal", "oddity", "artifact", etc., and, of course, "error". More technical definitions include statistical characteristics such as "an observation which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile", which is better understood with a visualization like a boxplot:
Identifying outliers first requires you to have a sense of what your "typical" dataset values are, then defining what is meant by "how far/different" a data value is from the typical. Then you need to find them (how?), interpret them (good data or error?), and perhaps fix them (keep or delete?).
The temptation often is to simply dismiss such atypical data values, perhaps even trimming them from your dataset in order to make your predictive models look pretty. But inspection of outliers can lead to surprising insights when some statistical expectation is challenged. Isaac Asimov said it best: "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'". A detailed examination of unusual data can also cause you to reevaluate your measurement methods.
Should you attempt to directly (visually) locate and inspect outliers in your data, or rather trust some automated process to decide whether to delete, modify, or include them? If we're talking "Big Data", effective human inspection of massive datasets is essentially impossible. Python’s scikit-learn library does have some helpful anomaly (outlier) detection functions based on machine learning algorithms which can be applied to large datasets to find not only individual data item extremes, but also groups of measurements that collectively are unusual.
Dealing with outliers is only one aspect of creating good datasets; choosing appropriate representations, avoiding basic recording errors, and eliminating various forms of bias are equally critical to your data analytics process, interpretation, and conclusions.
This article was contributed by Harry J. Foxwell, author of Creating Good Data.