Before companies can profit from big data, they often must deal with bad data. There may indeed be gold in the mountains of information that firms collect today, but there also are stores of contaminated or “noisy” data. In large organizations, especially financial institutions, data often suffer from mislabeling, omissions, and other inaccuracies. In firms that have undergone mergers or acquisitions, the problem is usually worse.

Contaminated data is a fact of life in statistics and econometrics. It is tempting to ignore or throw out bad data, or to assume that it can be “fixed” (or even identified) somehow. In general, this is not the case.

I have been studying and writing about how to clean up and optimize data for big data analysis since the early 1990s. At this point I am forced to concede that figuring out how to clean, transform and recast real-world data to make it informative and actionable is as much an art as a science. It turns out that a big chunk of the time we spend in doing data analytics is spent cleaning and recoding the data we work with so that our algorithms and queries can give us sensible clues to what might really be going on under the surface. (The other big chunk of time goes to problem formulation – a subject for another posting.)

Data corruption is particularly concerning when noisy data are used to test a predictive model. Data problems here can be especially acute since we are using the data, for example, to determine the degree to which we can trust a new predictive model, or whether we should recalibrate an existing model or take it off-line entirely. If the test is flawed, so will be our conclusions.

Roger M. Stein is a Senior Lecturer in Finance at the MIT Sloan School of Management, a Research Affiliate at the MIT Laboratory for Financial Engineering, and Chief Analytics Officer at State Street Global Exchange.

