The aphorism “If you can’t measure it, you can’t manage it” is common in contemporary life. It is often attributed to business guru Peter Drucker, and, even if he did not say it, the notion has become a slogan for the quantified, big-data world in which we live. In boardrooms, non-profits, and universities, we are fixated on quantifiable measures. Otherwise, how do you know what to improve? Another aphorism I find equally compelling is Goodhart’s law which, in Marilyn Strathern’s words, states “When a measure becomes a target it ceases to be a good measure” (Strathern, 1997: 308). Why? Because measures which become targets are soon subject to manipulation. I refer to this as the 3-M’s paradox (measure/manage/manipulate). I first thought about this in research about ratings and rankings at an online photography sharing site. I concluded that evaluation in the digital age is characterized by the following.
- It’s hard to quantify the qualitative: there was much experimentation with rating and ranking systems.
- Quantitative mechanisms beget their manipulation: people “mate” rated friends, “revenge” rated enemies, and inflated their own standing.
- “Fixes” to manipulation have their own, often unintended, consequences and are also susceptible to manipulation: non-anonymous ratings led to rating inflation.
- Quantification (and the how one implements it) privileges some things over others: nudes were highly rated, more so when measured by number of comments, not so with photos of flowers.
- Any “fixes” often take the form of more elaborate, automated, and meta quantification: such as making some users “curators” or labeling them as “helpful.”
Of course, this extends beyond online ratings communities. When politicians sought to manage primary schools on the basis of measures of student achievement, cheating soon followed. My favorite example of this is in Texas where administrators “disappeared” poorly performing students so that they could not take the standardized tests. Colleges can be measured with respect to class size and selectivity; this too can be “gamed.”.
What is most interesting about ranking systems that reduce multiple variables into a single index is how arbitrary they often are. In a classic paper Richard Becker and his colleagues looked at how they could manipulate the outcomes of the best places to live. While the methods used to construct the rankings show fairly good agreement at the top and bottom ends, the choice of the ranking method and how the variables were weighted did make significant differences in order (Becker et al., 1987). Malcolm Gladwell described this problem as “A ranking can be heterogeneous … as long as it doesn’t try to be too comprehensive. And it can be comprehensive as long as it doesn’t try to measure things that are heterogeneous” (Gladwell, 2011). Yet, many schemes try to do both, including U.S. News’ college rankings. (To get a feel for this, you can play Jeffrey Stake’s ranking game of law schools.)
Honestly, I’m confused by all of this. Clearly, we need to measure some things, but we also need to be highly skeptical of what we choose to measure, how we do so, and what we do with the resulting data.
Becker RA, Denby L, Mcgill R, et al. (1987) Analysis of data from the places rated almanac. American Statistician, 41(3), 169–186, Available from: http://www.jstor.org/pss/2685098 (accessed 19 August 2011).
Gladwell M (2011) The order of things. The New Yorker, Available from: http://www.newyorker.com/magazine/2011/02/14/the-order-of-things (accessed 18 December 2014).
Strathern M (1997) ‘Improving ratings’: audit in the British University system. European Review, 5(3), 305–321, Available from: http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=5299904.