02 February 2009

Murphy's Laws for Data

I've had the privilege of digging through some of Murphy's papers and it transpires that there is a whole collection of lesser-known variants of the Murphy's Law specifically for data.

Murphy's handwriting leaves a little to be desired, and my access was fairly limited, but from what I can gather the following laws are inviolate

Murphy's Laws for Data (ML4D)

  1. If data can be wrong, it will be.
  2. If data can be misinterpreted, it will be.
  3. If data can be biased, it will be.
  4. If data can be misformatted, it will be.
  5. If data can be incomplete, it will be.
  6. If errors in data can pass silently, some will.
  7. If data formats are ambiguous, all interpretations will be used.
  8. If data formats are unambiguous, they will be ignored.
  9. If summarization can destroy meaning, it will.
  10. If patterns can be non-linear, they will be.
  11. If data items can contain separators, they will.
  12. If data can be destroyed it will be (except when the goal is data destruction).
  13. The life expectancy of any datum is inversely proportional to its utility and correctness.
  14. The likelihood of data being correct is inversely propotional to the importance of the decisions it will be used to inform.
  15. Some of the data will be case sensitive.
  16. If input and output encodings can be different, they will be.
  17. Representative samples aren't.
  18. If data can be encoded in EBCDIC, it will be.
  19. If escape conventions can differ, they will.
  20. If the data is correct, then the checksum will be incorrect; and vice versa.
  21. Encryption will render the data unreadable by the encryptor and transparent to others.
  22. Dates are subject to their own special versions of Murphy's Laws for Data.
  23. Passwords are also subject to their own special versions of Murphy's Laws for Data.
  24. Data that demands to be graphed won't be.
  25. Excel will obscure all meaning in data with a combination of chart-junk and inappropriate defaults.
  26. Causal relationships change immediately after detection.
  27. The likelihood that a confidence test on data has been applied correctly is less than the stated confidence level.
  28. Backups become corrupted/missing at exactly the same time as their corresponding master.
  29. The obvious interpretation is incorrect.
  30. The correct interpretation is implausible.

Summary

Data will at best be incorrect, misinterpreted, misformatted, biased, incomplete, non-linear, misgraphed and quickly lost.

Footnote (Data and Plurals)

I am aware that there is a school of thought that maintains that the word data is plural and that on this basis we should say things like "the data are wrong". Neither Murphy nor I attended that school, but it is our opinion that that data supporting this view is questionable and that such usage, in this twenty-first century, is at best archaic and possibly even affected. Those of a different and more delicate sensibility are respectively requested to pass over these laws quickly to avoid undue distress.

Labels: