be clear, replicable, and precise procedures for collecting data so that each person who collects it does it in the same way. Each person who is counting has to count in the same way. Take Gleason grading of tumors—it is only relatively standardized, meaning that you can get different Gleason scores, and hence cancer stage labels, from different pathologists. (InGleason scoring, a sample of prostate tissue is examined under a microscope and assigned a score from 2 to 10 to indicate how likely it is that a tumor will spread.) Psychiatrists differ in their opinions about whether a certain patient has schizophrenia or not. Statisticians disagree about what constitutes a sufficient demonstration of psychic phenomena. Pathology, psychiatry, parapsychology, and other fields strive to create well-defined procedures that anyone can follow and obtain the same results, but in almost all measurements, there are ambiguities and room for differences of opinion. If you are asked to weigh yourself, do you do so with or without clothes on, with or without your wallet in your pocket? If you’re asked to take the temperature of a steak on the grill, do you measure it in one spot or in several and take the average?
Measurement Error
Participants may not understand a question the way the researcher thought they would; they may fill in the wrong bubble on a survey, or in a variety of unanticipated ways, they may not give the answer that they intended. Measurement error occurs in every measurement, in every scientific field. Physicists at CERN reported that theyhad measured neutrinos traveling faster than the speed of light, a finding that would have been among the most important of the last hundred years. They reported later thatthey had made an error in measurement.
Measurement error turns up whenever we quantify anything. The 2000 U.S. presidential election came down to measurement error (and to unsuccessfully recording people’s intentions): Different teams of officials, counting the same ballots, came up with different numbers. Part of this was due to disagreements over how to count a dimpled chad, a hanging chad, etc.—problems of definition—but even when strict guidelines were put in place, differences in the count still showed up.
We’ve all experienced this: When counting pennies in our penny jar, we get different totals if we count twice. When standing on a bathroom scale three times in a row, we get different weights. When measuring the size of a room in your house, you may get slightly different lengths each time you measure. These are explainable occurrences: The springs in your scale are imperfect mechanical devices. You hold the tape measure differently each time you use it, it slips from its resting point just slightly, you read the sixteenths of an inch incorrectly, or the tape measure isn’t long enough to measure the whole room so you have to mark a spot on the floor and take the measurement in two or three pieces, adding to the possibility of error. The measurement tool itself could have variability (indeed, measurement devices have accuracy specifications attached to them, and the higher-priced the device, the more accurate it tends to be). Your bathroom scale may only be accurate to within half a pound, a postal scale within half an ounce (one thirty-second of a pound).
A1960 U.S. Census study recorded sixty-two women aged fifteento nineteen with twelve or more children, and a large number of fourteen-year-old widows. Common sense tells us that there can’t be many fifteen- to nineteen-year-olds with twelve children, and fourteen-year-old widows are very uncommon. Someone made an error here. Some census-takers might have filled in the wrong box on a form, accidentally or on purpose to avoid having to conduct time-consuming interviews. Or maybe an impatient (or impish) group of responders to the survey made up outlandish stories and the census-takers didn’t notice.
In 2015 the New England Patriots were accused