Before companies can profit from big data, they often must deal with bad data. There may indeed be gold in the mountains of information that firms collect today, but there also are stores of contaminated or “noisy” data. In large organizations, especially financial institutions, data often suffer from mislabeling, omissions, and other inaccuracies. In firms that have undergone mergers or acquisitions, the problem is usually worse.
Contaminated data is a fact of life in statistics and econometrics. It is tempting to ignore or throw out bad data, or to assume that it can be “fixed” (or even identified) somehow. In general, this is not the case.