Wednesday, February 3, 2016

Broken Glass, Mangled Statistics

The motto of ASTM International is “Helping Our World Work Better.” This internationally recognized standards development organization contributes to the world of forensic science by promulgating standards of various kinds for performing and interpreting chemical and other tests.

By mid-August 2015, five ASTM Standards were up for public comment to the Organization of Scientific Area Committees. OSAC “is part of an initiative by NIST and the Department of Justice to strengthen forensic science in the United States.” [1] Operating as “[a] collaborative body of more than 500 forensic science practitioners and other experts,” [1] OSAC is reviewing and developing documents for possible inclusion on a Registry of Approved Standards and a Registry of Approved Guidelines. 1/ NIST promises that “[a] standard or guideline that is posted on either Registry demonstrates that the methods it contains have been assessed to be valid by forensic practitioners, academic researchers, measurement scientists, and statisticians ... .” [2]

Last month, OSAC approved its first Registry entry (notwithstanding some puzzling language), ASTM E2329-14, a Standard Practice for Identification of Seized Drugs. Another standard on the list for OSAC’s quasi-governmental seal of approval is ASTM E2926-13, a Standard Test Method for Forensic Comparison of Glass Using Micro X-ray Fluorescence (μ-XRF) Spectrometry (available for a fee). It will be interesting to see whether this standard survives the scrutiny of measurement scientists and statisticians, for it raises a plethora of statistical issues.

What It is All About

Suppose that someone stole some bottles of beer and money from a bar, breaking a window to gain entry. A suspect’s clothing is found to contain four small glass fragments. Various tests are available to help determine whether the four fragments (the “questioned” specimens) came from the broken window (the “known”). The hypothesis that they did can be denoted H1, and the “null hypothesis” that they did not can be designated H0.

Micro X-ray Fluorescence (μ-XRF) Spectrometry involves bombarding a specimen with X-rays. The material then emits other X-rays at frequencies that are characteristic of the elements that compose it. In the words of the ASTM Standard, “[t]he characteristic X-rays emitted by the specimen are detected using an energy dispersive X-ray detector and displayed as a spectrum of energy versus intensity. Spectral and elemental ratio comparisons of the glass specimens are conducted for source discrimination or association.” Such “source discrimination” would be a conclusion that H0 is true; “association” would be a conclusion that H1 is true. The former finding would mean that the suspect's glass fragments did not come from the crime scene; the latter would mean either that (one way or another) they came from the broken window at the bar (or from another piece of glass somewhere that has a similar elemental composition).

Unspecified "Sampling Techniques"
for Assessing Variability Within the Pane of Window Glass

One statistical issue arises from the fact that the known glass is not perfectly homogeneous. Even if measurements of the ratios of the concentrations of different elements in a specimen are perfectly precise (the error of measurement is zero), a fragment from one location could have a different ratio than a fragment from another place in the known specimen. This natural variability must be accounted for in deciding between the two hypotheses. The Standard wisely cautions that “[a]ppropriate sampling techniques should be used to account for natural heterogeneity of the material, varying surface geometries, and potential critical depth effects.” But it gives no guidance at all as to what sampling techniques can accomplish this and how measurements that indicate spatial variation should be treated.

The Statistics of "Peak Identification"

The section of ASTM E-2926 on “Calculation and Interpretation of Results” advises analysts to “[c]ompare the spectra using peak identification, spectral comparisons, and peak intensity ratio comparisons.” First, “peak identification” means comparing “detected elements of the questioned and known glass spectra.” The Standard indicates that when “[r]eproducible differences” in the elements detected in the specimens are found, the analysis can cease and the null hypothesis H0 can be presented as the outcome of the test. No further analysis is required. The criterion for when an element “may be” detected is that “the area of a characteristic energy of an element has a signal-to-noise ratio of three or more.” Where did this statistical criterion come from? What is the sensitivity and specificity of a test for the presence of an element based on this criterion?

The Statistics of Spectral Comparisons

Second, “spectral comparisons should be conducted,” but apparently, only “[w]hen peak identification does not discriminate between the specimens.” This procedure amounts to eyeballing (or otherwise comparing?) “the spectral shapes and relative peak heights of the questioned and known glass specimen spectra.” But what is known about the performance of criminalists who undertake this pattern-matching task? Has their sensitivity and specificity been determined in controlled experiments, or are judgments accepted on the basis of self-described but incompletely validated “knowledge, skill, ability, experience, education, or training ... used in conjunction with professional judgment,” to use a stock phrase found in many an ASTM Standard?

The Statistics of Peak Intensity Ratios

Third, only “[w]hen evaluation of spectral shapes and relative peak heights do not discriminate between the specimens” does the Standard recommend that “peak intensity ratios should be calculated.” These “peak intensity ratio comparisons” for elements such as “Ca/Mg, Ca/Ti, Ca/Fe, Sr/Zr, Fe/Zr, and Ca/K” “may be used” “[w]hen the area of a characteristic energy peak of an element has a signal-to-noise ratio of ten or more.” To choose between “association” and “discrimination of the samples based on elemental ratios,” the Standard recommends, “when practical,” analyzing “a minimum of three replicates on each questioned specimen examined and nine replicates on known glass sources.” Inasmuch as the Standard emphasizes that “μ-XRF is a nondestructive elemental analysis technique” and “fragments usually do not require sample preparation,” it is not clear just when the analyst should be content with fewer than three replicate measurements—or why three and nine measurements provide a sufficient sampling to assess measurement variability in two sets of specimens, respectively.

Nevertheless, let’s assume that we have three measurements on each of the four questioned specimens and nine on the known specimen. What should be done with these two sets of numbers? The Standard first proposes a “range overlap” test. I’ll quote it in full:
For each elemental ratio, compare the range of the questioned specimen replicates to the range for the known specimen replicates. Because standard deviations are not calculated, this statistical measure does not directly address the confidence level of an association. If the ranges of one or more elements in the questioned and known specimens do not overlap, it may be concluded that the specimens are not from the same source.
Two problems are glaringly apparent. First, statisticians appreciate that the range is not a robust statistic. It is heavily influenced by any outliers. Second, if the  properties of the "ratio ranges" are unknown, how can one know what to conclude—and what to tell a judge, jury, or investigator about the strength of the conclusion? Would a careful criminalist who finds no range overlap have to quote or paraphrase the introduction to the Standard, and report that "the specimens are indistinguishable in all of these observed and measured properties," so that "the possibility that they originated from the same source of glass cannot be eliminated"? Would the criminalist have to add that there is no scientific basis for stating what the statistical significance of this inability to tell them apart is? Or could an expert rely on the Standard to say that by not eliminating the same-source possibility, the tests "conducted for source discrimination or association" came out in favor of association?

The Standard offers a cryptic alternative to the simplistic range method (without favoring one over the other and without mentioning any other statistical procedures):
±3s—For each elemental ratio, compare the average ratio for the questioned specimen to the average ratio for the known specimens ±3s. This range corresponds to 99.7 % of a normally distributed population. If, for one or more elements, the average ratio in the questioned specimen does not fall within the average ratio for the known specimens ±3s, it may be concluded that the samples are not from the same source.
The problems with this poorly written formulation of a frequentist hypothesis test are legion:

1. What "population" is "normally distributed"? Apparently, it is the measurements of the elemental ratios in the questioned specimen. What supports the assumption of normality?

2. What is "s"? The standard deviation of what variable? It appears to be the sample standard deviation of the nine measurements on the known specimen.

3. The Standard seems to contemplate a 99.7% confidence interval (CI) for the mean μ of the ratios in the known specimen. If the measurement error is normally distributed about μ, then the CI for μ is approximately the known specimen's sample mean ±4.3s. This margin of error is larger than ±3s because the population standard deviation σ is unknown and the sample mean therefore follows a t-distribution with eight degrees of freedom. The desired 99.7% is the coverage probability for a ±3σ CI. Using ±3 with the estimator s rather than the true value σ results in a confidence coefficient below 99%. One would have to use a number greater than ±4 rather than ±3 to achieve 99.7% confidence.

4. The use of any confidence interval for the sample mean of the measurements in the known specimen is misguided. Why ignore the variance in the measured ratios in the questioned specimens? That is, the recommendation tells the analyst to ask whether, for each ratio in each questioned specimen, the miscomputed 99.7% CI covers “the average ratio in the questioned specimen.” But this “average ratio” is not the true ratio. The usual procedure (assuming normality) would be a two-sample t-test of the difference between the mean ratio for the questioned sample and the mean for the known specimen.

5. Even with the correct test statistic and distribution, the many separate tests (one for each ratio Ca/Mg, Ca/Ti, Fe/Zr, etc.) cloud the interpretation of the significance of the difference in a pair of sample means. Moreover, with multiple unknown specimens, the probability of finding a significant difference in at least one ratio for at least one unknown fragment is greater than the significance probability in a single comparison. The risk of a false exclusion for, say, ten independent comparisons could be ten times the nominal value of 0.003.

6. Why ±3 as opposed to, say, ±4? I mention ±4 not because it is clearly better, but because it is the standard for making associations using a different test method (ASTM E2330). What explains the same standards development organization promulgating facially inconsistent statistical standards?

7. Why strive for a 0.003 false-rejection probability as opposed to, say, 0.01, 0.03, or anything else? This type of question can be asked about any sharp cutoff. Why is a difference of 2.99σ dismissed as not useful when 3σ is definitive? Within the classical hypothesis-testing framework, an acceptable answer would be that the line has to be drawn somewhere, and the 0.003 significance level is needed to protect against the risk of a false rejection of the null hypothesis in situations in which a false rejection would be very troublesome. Some statistics textbooks even motivate the choice of the less demanding but more conventional significance level of 0.05 by analogizing to a trial in which a false conviction is much more serious than a false acquittal.

Here, however, that logic cuts in the opposite direction. The null hypothesis H0 that should not be falsely rejected is that the two sets of measurements come from fragments that do not have a common source. But 0.003 is computed for the hypothesis H1 that the fragments all come from the same, known source. The significance test in ASTM E2926-13 addresses (in its own way) the difference in the means when sampling from the known specimen. Using a very demanding standard for rejecting H1 in favor of the suspect’s claim H0 privileges the prosecution claim that the fragments come from different sources. 2/ And it does so without mentioning the power of the test: What is probability of reporting that fragments are indistinguishable — that there is an association — when the fragments do come from different sources? Twenty years ago, when a National Academy of Sciences panel examined and approved the FBI's categorical rule of "match windows" for DNA testing, it discussed both operating characteristics of the procedure—the ability to declare a match for DNA samples from the same source (sensitivity) and the ability to declare a nonmatch for DNA samples from different sources. [3] By looking only to sensitivity, ASTM E2926-13 takes a huge step backwards.

8. Whatever significance level is desired, to be fair and balanced in its interpretation of the data, a laboratory that undertakes hypothesis tests should report the probability of differences in the test statistic as large or larger than those observed under the two hypotheses: (1) when the sets of measurements come from the same broken window (H1); and (2) when the sets of measurements come from different sources of glass in the area in which the suspect lives and travels (H0). The ASTM Standard completes ignores H0. Data on the distribution of the elemental composition of glass in the geographic area would be required to address it, and the Standard should at least gesture to how such data should be used. If such data are missing, the best the analyst can do is to report candidly that the questioned fragment might have come from the known glass or from any other glass with a similar set of elemental concentrations and, for completeness, to add that how often other glass like this is present is unknown.

9. Would a likelihood ratio be a better way to express the probative value of the data? Certainly, there is an argument to that effect in the legal and forensic science literature. [4-8] Quantifying and aggregating the spectral data that the ASTM Standard now divides into three, lexically ordered procedures and combining them with other tests on glass would be a challenge, but it merits thought. Should not the Standard explicitly acknowledge that reporting on the strength of the evidence rather than making categorical judgments is a respectable approach?

* * *

In sum, even within the framework of frequentist hypothesis testing, ASTM E2926 is plagued with problems — from the wrong test statistic and procedure for the specified level of “confidence,” to the reversal of the null and alternative hypotheses, to the failure to consider the power of the test. Can such a Standard be considered “valid by forensic practitioners, academic researchers, measurement scientists, and statisticians”?

Notes
  1. The difference between the two is not pellucid, since OSAC-approved standards can be a list of “shoulds” and guidelines can include “shalls.”
  2. The best defense I can think of for it is a quasi-Bayesian argument that by the time H1 gets to this hypothesis test, it has survived the qualitative "peak identification" and "spectral comparison" tests. Given this prior knowledge, it should require unusually surprising evidence from the peak intensity ratios to reject H1 in favor of the defense claim H0.
References
  1. OSAC Registry of Approved Standards and OSAC Registry of Approved Guidelines http://www.nist.gov/forensics/osac/osac-registries.cfm, last visited Feb. 2, 2016
  2. NIST, Organization of Scientific Area Committees, http://www.nist.gov/forensics/osac/index.cfm, last visited Feb. 2, 2016
  3. National Research Council Committee on Forensic DNA Science: An Update, The Evaluation of Forensic DNA Evidence (1996)
  4. Colin Aitken & Franco Taroni, Statistics and the Evaluation of Evidence for Forensic Science (2d ed. 2004)
  5. James M. Curran et al., Forensic Interpretation of Glass Evidence (2000)
  6. ENFSI Guideline for Evaluative Reporting in Forensic Science (2015)
  7. David H. Kaye et al., The New Wigmore: Expert Evidence (2d ed. 2011)
  8. Royal Statistical Soc'y Working Group on Statistics and the Law, Fundamentals of Probability and Statistical Evidence in Criminal Proceedings: Guidance for Judges, Lawyers, Forensic Scientists and Expert Witnesses (2010)
Disclosure and disclaimer: Although I am a member of the Legal Resource Committee of OSAC, the views expressed here are mine alone. They are not those of any organization. They are not necessarily shared by anyone inside (or outside) of NIST, OSAC, any SAC, any OSAC Task Force, or any OSAC Resource Committee.

Friday, January 29, 2016

The First OSAC-approved Standard for Forensic Science

"All results for every forensic science method should indicate the uncertainty in the measurements that are made, and studies must be conducted that enable the estimation of those values." Source: National Research Council Committee on Identifying the Needs of the Forensic Science Community, Strengthening Forensic Science in the United States: A Path Forward 184 (2009)

"It is expected that in the absence of unforeseen error, an appropriate analytical scheme effectively results in no uncertainty in reported identifications." Source: Standard Practice for Identification of Seized Drugs (ASTM E2329-14, § 4.2), added to the National Institute of Standards and Technology OSAC Registry of Approved Standards on Jan. 27, 2016

The response to comments from within OSAC on this and related text in the ASTM standard stated: "Editorial. The Seized Drug subcommittee intends to clarify the quoted language pertaining to uncertainty and error during the next ASTM revision of this document." E2329-14 Seized Drugs Response to LRC Comments FINAL.pdf (277K) SAC Chemistry/Instrument Analysis, Jan. 11, 2016.

Saturday, January 23, 2016

Beyond the Higgs Boson: The “Prosecutor’s Fallacy” Does Not Explain Why Experimenters Are Cautious About Announcing New Discoveries

A posting of July 6, 2012, "The Probability that the Higgs Boson Has Been Discovered,"  mentioned the transposition of a p-value in stories in the popular press about the discovery of the Higgs Boson. Reporting about the excitement last month over a highly tentative announcement of a far more massive relative of the Higgs particle, the New York Times did not learn from its earlier error (or else physicist Kyle Cranmer misspoke). The Times stated that
When all the statistical effects are taken into consideration, Dr. Cranmer said, the bump in the Atlas data had about a 1-in-93 chance of being a fluke — far stronger than the 1-in-3.5-million odds of mere chance, known as five-sigma, considered the gold standard for a discovery. That might not be enough to bother presenting in a talk except for the fact that the competing CERN team, named C.M.S., found a bump in the same place.
One perceptive science writer, Faye Flam, spotted the transposition error in the Times article and promptly called attention to it in her science column for Bloomberg Views. However, she then argued that 1/93 is too high a significance level to use in particle physics because of “a problem of statistics called the prosecutor’s fallacy” (referring to  UCLA physicist Robert Cousins as the source of this argument).

That argument is wrong, or at least incomplete. The “prosecutor’s fallacy” is just a special case of the ubiquitous practice of transposing the terms in a conditional probability and thinking that the value stays the same. The usual name for this in statistics is the transposition fallacy. The transposition fallacy is a fallacy because Pr(A | B) <> P(B | A) unless P(A) = P(B).  In other words, P(A | B) actually could be higher — or lower — than P(B | A).

This inequality does not reveal why particle physicists normally choose a much more demanding significance level than 1/93.  Flam uses two examples with low prior probabilities P(A) to give the impression that P(A | B) must be greater than P(B | A), so that a higher value for P(B | A) is necessary to achieve some desired value for P(A | B). The argument works only in situations in which the prior probability of a new particle is small. That is a fair claim here, because the Standard Theory, which is well entrenched, does not predicts the super-massive particle). However, it is not clear why that always would be the case. Thus, the question of why particle physicists used the stringent 5-sigma rule for the discovery of the Higgs Boson remains.

One standard argument for setting an especially demanding significance level is that the cost of a false positive is far greater than the cost of a false negative (which can be tolerated while one waits for more data). Flam and Cousins mention error costs in this way, but this consideration provides a very different motivation than does correcting the errors that might arise from computing a probability by naive transposition.

Another common argument is that even if the costs of errors are not so disparate, and a moderate significance level such as 1/93 is acceptable, multiple opportunities to find significance make it too easy to find a “significant” result. Such data mining goes on a bit in hunting for new particles. See More on Statistical Reasoning and the Higgs Boson, July 11, 2012. Again, however, this valid reason for being more demanding with regard to a significance level does not flow from the transposition fallacy, the prosecutor’s fallacy, or whatever else it might be called. It applies regardless of the significance level that the experiment is seeking. It is a correction designed to make the declared significance level applicable to the mined data. The corrected level is still subject to the transposition fallacy.

References

Sunday, January 17, 2016

“Statistical Lawyer’s Tricks” with DNA Mixtures in the Trial of Tommy Whack

Maryland seems to produce more than its share of significant judicial opinions on DNA evidence. Perhaps this is partly the result of having an expert Forensics Division in the Maryland Office of the Public Defender. In any event, several years ago, the state’s high court unanimously agreed that the head of the homicide unit for the Prince George's County State's Attorney's Office badly misrepresented the meaning of DNA evidence. The case, Whack v. State, 1/ is instructive in several respects:
  • the state should have undertaken further DNA testing;
  • the police department’s DNA analyst's assessment of the “statistical significance” of the DNA evidence was incomplete;
  • the defense committed the so-called “prosecutor’s fallacy” of transposing conditional probabilities; and
  • the prosecutor's closing argument included a statement about probabilities that (according to the state's own lawyers on appeal) was so “preposterous” it could not have misled the jury (because no juror could have believed it).
Shooting a Dude

Early in the morning of October 21, 2008, police received a 911 call about a shooting in Landover, Maryland. They found George Jerome White, Jr., lying on the ground next to a pick-up truck. Barely able to speak, White told police he had been robbed by an approximately six-foot-tall black male, with light or medium skin complexion, about 20 years old, with long hair or dreadlocks. White later died from two gunshot wounds to his torso.

Not long before the shooting, eight calls were exchanged between White’s cell phone and one registered to Bryant Whack. Bryant, who lived in Virginia, was in town for a funeral and staying with his cousin, Thomas (Tommy) Whack, Jr.

Bryant told the police that Tommy used the phone to call a "chat line" and that sometime after midnight, Tommy said that he planned to meet a woman. The two cousins set off together, but Bryant lost sight of Tommy. Then he heard gunshots. A few minutes later, Tommy showed up, said "it was a dude," and they went back home.

Touch DNA for the State

The police did not have much other evidence connecting Tommy to the murder. Swabs from various places on and in the truck established that White’s DNA was in the sample from the passenger seat headrest. As the Court of Appeals described it:
The chance of the major DNA profile on the headrest coming from an African American other than White was one in 212 trillion; in other words, White's DNA matched the DNA profile extracted from the passenger seat headrest.
Arguably, this represents the transposition fallacy and is a questionable use of the term “match.” A more precise statement would be that the victim’s DNA matched the major profile at every locus, and the chance of a single, randomly selected African American doing so would only be 1/(212 trillion). The probability that some African American other than White contributed the profile is probably quite small, but it could be larger (or even smaller than) than the estimated frequency of unrelated African Americans with the major profile, which is what 1/(212 trillion) is.

But whatever the probability that the victim, White, was the source of the DNA on the headrest, the match between the major profile and White does not incriminate Tommy. In fact, the state’s DNA analyst concluded that none of "the DNA profiles of other, unknown people" in the sample "could have been [the defendant's]." The link to Tommy came from a different swab. This sample, from a passenger armrest,
contained a mixture of DNA from at least four people, with White's DNA being consistent with 14 of 15 tested locations in the DNA sample and [Tommy]'s DNA consistent with 11 of the 15 tested locations in the sample. In addition, the sample disclosed the DNA of at least two additional "unknown contributors."
With this extra evidence to buttress Bryant's thin story, Tommy was indicted on charges of first-degree murder, robbery, theft, and use of a handgun in the commission of a crime of violence. At trial, the state’s DNA analyst, Jessica Charak, testified to the meaning of this partial match:
     When it comes to mixtures, in saying that someone could potentially be included as a source of the mixture, we develop a statistic just as to how strong is that statement, what does it really mean. In this particular case what we do with mixtures is we have already made the statement that all of the DNA types of the victim [White] are accounted for at 14 of the 15 locations. That's a factual statement based on the results. It is also a factual statement that [at] 11 of the 15 locations all of the DNA types of [Petitioner] are accounted for.
     Now, the statistic that we do is on the mixture as an entire whole. So we ignore the fractions that we can say that those types are accounted for and we will calculate a statistic on everything, on every single DNA type that I was able to recover in the mixture. In this case what this probability says is what are the chances that another random person may also have a DNA profile that could also be included as a potential source of the mixture? In this case it was one in 172 individuals in the African American population would also have potentially have a DNA profile that I would have to say that they also could have contributed to that mixture.
This description of the implications of the data is hardly ideal. Why did Tommy match at only 11 loci? A juror might wonder why the fact that the alleles at four of Tommy’s 15 loci were not observed in the mixture did not exclude him as a potential contributor. Presumably, the analyst believed that it was possible that four alleles “dropped out” of the profile so that the 11-locus match was sufficient for an inclusion. And if four could have dropped out, why not five? Would a 10-locus match also have led to an inclusion? A 9-locus match? What was the analyst’s criterion for the minimum number of matching loci for “another random person [to] have a DNA profile that could also be included as a potential source of the mixture?” (In theory, it could be zero. However, that low a threshold would make every person in the world a potential source. With a random match probability of 1, the DNA evidence would have been irrelevant.)

A more complete and comprehensible analysis would compare (1) the probability of the data (including the heights rather than just the locations of all the peaks) on the assumption that White, Tommy, and two unknown individuals contributed to the sample and (2) the probability of the data on the assumption that White and three unknown individuals contributed. (Still more hypotheses could be considered if the number of contributors was uncertain.) Such “likelihood ratios” have been admitted in many cases, but they were not used here.

Transposition and Other Statistical Errors of the Defense

Let’s put these worries about the analysis and presentation of the partial match to the side — apparently, they were not raised at trial or on appeal. Let’s assume that 1/172 is the probability of an inclusion in the mixture for a randomly selected African American using whatever (perhaps arbitrary) criteria the analyst here employed. Even taking this figure at face value, both defense counsel and the state’s expert clearly committed the transposition fallacy (indicating that the common synonym of "prosecutor’s fallacy" is not always apt). On cross-examination, the defense lawyer asked:
Q: So you identified possibly that [Tommy] contributed to that sample, right?
A: Right.
Q: But the possibilities that some other African American contributed to that sample would be one in 172?
A: Right.
The correct answer would have been “No, I cannot tell you the probability that some other African American contributed to the sample. I am only saying that if some other, unrelated African American contributed and the defendant did not, then the probability that the defendant would have been included is 1/172.”

In closing argument, defense counsel made another statistically fallacious claim — that “[e]very 173rd person would have their DNA alleles in that car. So out of 1700 people, you have ten people.” Of course, it would very odd if exactly every 172d person would be identified as consistent with the mixture — just as it would very odd if a fair coin always came up heads every other toss. But this error is minor. It can be corrected by adding the words “on average” and “approximately” to the sentences.

“A Statistical Lawyer’s Trick” and the Prosecution’s “Preposterous” Reply

The big problem resided in the prosecutor’s rebuttal argument. For some reason, the state neglected to determine whether Bryant Whack, like his cousin Tommy, had a DNA profile that was consistent with the alleles observed in the mixture. Naturally, the defense tried to exploit this glaring gap in the state's case. Counsel pointed out that if 1 in 172 unrelated African Americans would be included, then the chance of including a cousin would be even higher. Bryant might have been the murderer! This DNA evidence, the defense emphasized, was nothing like the match of the victim to the headrest sample, which had an infinitesmal random match probability of 1/(212 trillion). A figure of 1/172 or more leaves plenty of room for doubt.

The prosecution’s responded by calling the comparison “a statistical lawyer trick”:
Remember, [Tommy] is the fourth sample, the only place where we have four samples. So if you remember, as Ms. Charak testified, that we know George White, one hundred percent George White, and [defense counsel] agrees is in that headrest sample, when I add two more 50 million times less. When we add [Tommy], and we know it's [Tommy], when we add [Tommy] that is why the number is 172. It is statistics again. It is a statistical lawyer trick. [Defense counsel] wants you to say don't believe the statistics because the science says he is there, but this 172 is no less strong than that 212. [Tommy] is there. [Tommy] left that DNA.
The jury acquitted Tommy of most of the charges — first-degree premeditated murder, first-degree felony murder, robbery with a dangerous weapon, robbery, and use of a handgun in the commission of a crime of violence. But the jury convicted him of second-degree murder, and he appealed.

The first appellate court, Maryland's Court of Special Appeals, apparently did not see any problem with the prosecution's presentation. It rejected the defendant’s claim that the state’s closing argument — to which counsel objected — was unfair, without bothering to write an opinion for publication.

Again, Tommy appealed, this time to the court of last resort in Maryland. Before the Court of Appeals, the state adopted an intriguing position. It admitted that “the prosecutor's statement regarding the probability statistics was factually incorrect,” but it contended that it was so astonishingly incorrect that the mistake did not matter. The Court of Appeals summarized this argument:
The State argues, however, that the prosecutor's statement, that "this 172 is no less strong than that 212," was "preposterous" and no jury would ever draw the conclusion that the two statistics were the same. The State contends that this statement was meant to suggest only that the science underlying both statistics was equally strong. The State acknowledges that the prosecutor's rebuttal argument may have been "inartfully argued," but the State maintains that the prosecution did not mischaracterize the evidence. Moreover, the State notes that the expert's report was admitted into evidence, allowing the jury to examine it further if there was any confusion.
The Court of Appeals’ Analysis

The Court of Appeals discerned two fatal flaws in the closing argument. First, it held that “the prosecutor went too far in stating emphatically that Petitioner's DNA was present in the truck.” But should not a party should be allowed to portray its theory of the events that unfolded as facts when it has presented evidence to support that theory? Emphatically asserting that Tommy leaned on the armrest and shot White is the state's theory of the case. To be sure, the evidence as a whole seems thin, but as the Court of Appeals acknowledged, "[t]he prosecutor is allowed liberal freedom of speech and may make any comment that is warranted by the evidence or inferences reasonably drawn therefrom."

Still, the statement that “the science says he is there” is far stronger than what "the science," as represented by Ms. Charak, said. Her testimony was that Tommy was among the "one in 172 individuals in the African American population [who] would also have potentially have a DNA profile that I would have to say that they also could have contributed to that mixture."

Even so, had the prosecutor asserted that the science proved that Tommy was there -- because it was unlikely (only a 1/172 chance) that a randomly selected African American would be included as a contributor -- there would have been no reversible error. The real problem is the second flaw that the Court of Appeals identified -- “overstating the statistical significance of the DNA evidence by equating the odds of one in 172 with one in 212 trillion.” The Court of Appeals conceded that “no jury was likely to believe that one in 172 was literally the same as one in 212 trillion,” but it still felt that “the prosecutor's statement could have seriously misled the jury” because “[t]he declaration that this one in 172 figure was ‘no less strong’ than the one in 212 trillion figure suggests that Petitioner's DNA ‘matched’ the DNA taken from the armrest to the same extent that White's DNA ‘matched’ a sample taken from the headrest.”

It may not be that unusual for prosecutors to push as hard as they can at trial, figuring that a conviction will stick on appeal (unless an appellate court perceives that the defendant might well have been innocent). In this case, however, the court explicitly announced a policy of DNA exceptionalism:
[C]ounsel have a responsibility to take extra care in describing DNA evidence, particularly when it comes to statistical probabilities. ... The prosecutor wrongly asserted that Petitioner's DNA was definitely on the armrest when the evidence demonstrated only that it might be present. The prosecutor also suggested that the statistical analysis backed up this assertion, urging jurors to draw an equivalency between the [near] mathematical certainty that White's DNA was in the truck with the probability that Petitioner's DNA was located there. These remarks were highly improper because the statements misrepresented complicated scientific evidence that was a key part of the prosecution's case.
Prosecutors in Maryland and elsewhere should heed this warning. They should not portray moderately small probabilities as overwhelmingly small. They should present their evidence for what it is worth.

The Rest of the Story

Although the court remanded for a new trial, that never came to pass. Instead, the prosecutor, Wes Adams, entered into a plea agreement. Tommy Whack was sentenced to 11 years in prison. 2/

Adams then entered the race for state's attorney in nearby Anne Arundel county. In a bitter election, he unseated the incumbent by a wide margin. 3/ When his performance in the Whack case became an issue, Adams insisted that he never mislead the jury and that he only decided against a new trial because of changes in the way DNA evidence is reported. "Hogwash," said his opponent. 4/

As the current County State's Attorney, Adams is “employing the highest ethical standards in vigorously prosecuting the guilty, protecting the innocent, and representing the interests of the State of Maryland in court.” 5/

Notes
  1. 433 Md. 728, 73 A.3d 186 (2013).
  2. Tim Pratt, State's Attorney Race Pits Leitess Against Adams, Capital Gazette, Oct. 6, 2014.
  3. Kelcie Pegher, Adams Ousts Incumbent Leitess for State's Attorney, Capital Gazette, Nov. 5, 2014.
  4. Id. 
  5. Wes Adams, Anne Arundel County State's Attorney, http://www.statesattorney-annearundel.com

Friday, January 15, 2016

Alaska Court of Appeals Deems Polygraph Evidence Admissible (or Not?)

In unrelated cases, Thomas Henry Alexander and James Griffiths were charged with sexual abuse of a minor. 1/ They each hired David Raskin, an emeritus professor of psychology at the University of Utah, to conduct a polygraph examination. Dr. Raskin determined that “there is a high likelihood” that Alexander and Griffith were truthful when they denied committing the crimes. He stated that his “confidence in these conclusions exceeds 90 percent.”

At a consolidated hearing on whether polygraph evidence meets the state's standard for the admissibility of scientific evidence
"Dr. Raskin testified that if polygraph examinations are properly conducted using the “control question” technique, one would “conservatively” expect polygraph examinations to be 90 percent accurate (or more) in assessing truth-telling and lying. More specifically, Dr. Raskin pointed to studies which apparently demonstrated that the accuracy rate of polygraph examinations was between 89 and 98 percent."
Of course, there is more than one “accuracy rate” that affects the probative value a polygraph finding. Both the sensitivity and specificity of the classifications must be considered. Presumably, Dr. Raskin testified that both the probability that a subject will be classified as deceptive when engaged in deception (the sensitivity) and the probability that a subject will be classified as truthful when not engaged in deception (the specificity) were in the 0.89 to 0.98 range.

The state countered with testimony from another psychologist, William Iacono, Distinguished McKnight University Professor at the University of Minnesota:
"Dr. Iacono testified that the better-conducted studies of polygraph examinations showed that these examinations had accuracy rates of between 51 percent (essentially, a coin flip) and 98 percent, with average results being about 70 percent accurate."
The two superior court judges decided to admit the evidence but only if each defendant submitted to an additional polygraph examination by an examiner of the State's choosing and testified submit to cross-examination at trial. They
"concluded that even if Dr. Iacono's figures were closer to the truth, the accuracy rate for the “control question” form of polygraph examination was still in line with the accuracy rates of other commonly admitted forms of scientific evidence—evidence such as fingerprint analysis, handwriting analysis, and eyewitness testimony."
Treating eyewitness testimony as a form of scientific evidence is odd, and one would hope that the specificity (true negative) rate of fingerprint and handwriting analysis is closer to Dr. Raskin’s understanding of polygraph specificity than to Dr. Iacono’s. And, when Griffith took the state-administered polygraph examination, he “apparently failed the exam” and then pleaded guilty.

The court of appeals upheld the admission of Dr. Raskin’s testimony. But it complained that under the “quite deferential” abuse-of-discretion standard that it was required to use,
"the two judges in this case might easily have reached differing conclusions regarding the scientific validity of polygraph examinations, even though they heard exactly the same evidence. And if the two judges had reached different conclusions, we apparently would have been required to affirm both of the conflicting decisions ... ."
Because this result would be “illogical and unfair,” the opinion urged the Alaska Supreme Court to revisit the issue of the standard of review for Daubert rulings—and to declare that on appeal, a method’s scientific validity must be determined de novo rather than only for an abuse of discretion.

Although the de novo rule is appealing and has a substantial following in the state courts, the problem of conflicting rulings on polygraph evidence can be handled another way. Even though there is at least a modicum of probative value to some polygraph testing,
"opening up the matter to the discretion of the trial courts—without providing more detailed standards than the usual balancing prescription—could lead to untoward results. ... Whether polygraph testimony should be admitted is doubtful, but if it is to be received, clear standards should be developed as to whether such testimony is admissible solely for impeachment purposes, how important the testimony must be in the context of the other evidence in the case for admissibility to be warranted, what level of training and competence examiners should have, what precautions should be taken against deceptive practices on the part of examinees, and what procedures would be best to give an independent or opposing expert a meaningful opportunity to view or review the examination and analysis. When all is said and done, the game simply does not seem worth the candle. A categorical rule of exclusion for polygraph results is a logical and defensible corollary to the general principles of relevancy." 3/
Notes
  1. State v. Alexander, Nos. A–11423, A–11433, 2015 WL 9257270 (Alas. Ct. App. Dec. 18, 2015).
  2. In State v. Coon, 974 P.2d 386, 395–98 (Alaska 1999), Alaska adopted the scientific-soundness standard articulated in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993).
  3. McCormick on Evidence § 206(A) (7th ed. 2013). See also State v. Porter, 698 A.2d 739, 768-69 (Conn. 1997) ("admission of the polygraph test would be highly detrimental to the operation of Connecticut courts, both procedurally and substantively. ... [A]ny limited evidentiary value that polygraph evidence does have is substantially outweighed by its prejudicial effects. We therefore reaffirm our per se rule against the use of polygraph evidence in Connecticut courts."); Rathe Salvage, Inc. v. R. Brown & Sons, Inc., 46 A.3d 891, 901 (Vt. 2012) (because the "limited, if not absence of, probative value is substantially outweighed by risks of confusion, delay, and time wasted on collateral issues related to variables in administration of the polygraph[, t]here was no error in the trial court's per se exclusion of polygraph evidence under Rule 403[, and] the trial court was not required to conduct a Daubert hearing to assess its reliability under Rule 702."). Being founded in part on the reasonable judgment that the evidence uniformly is low in probative value, the "per se" rule constitutionally can be applied to exclude evidence that a defendant in a criminal case passed a polygraph test. United States v. Scheffer, 523 U.S. 303 (1998); Porter, 698 A.2d at 777-79; People v. Richardson, 183 P.3d 1146, 1194-95 (Cal. 2008).

Friday, January 8, 2016

Massachusetts Supreme Court Demands a Witness from the Same DNA Laboratory -- But Not Because of the Confrontation Clause

The Sixth Amendment to the Constitution specifies that "in all criminal prosecutions, the accused shall enjoy the right … to be confronted with the witnesses against him." Does this mean that a prosecutor who wants to introduce evidence of a match between the defendant's DNA profile and a potentially incriminating sample (from a crime scene, for example) must produce a witness from the laboratory that ascertained the profile? After the factious opinions in Williams v. Illinois, 132 S.Ct. 2221 (2012), the law on whether and when a defendant has a Sixth Amendment right to confront the personnel at such a laboratory is a mess. 1/

In Williams, the U.S. Supreme Court upheld the state's reliance on the results of certain DNA testing even though the state failed to produce a single witness from the laboratory that did the crucial testing. But every theory put forth for skipping over a witness with some direct knowledge of the laboratory's work in the case was unacceptable to a majority of the Justices. This paradoxical result occurred because Justice Thomas -- using a theory that every other Justice repudiated -- reached the same result as did a plurality of four Justices, who relied on two different (and rather inventive) theories. Given these wildly disparate opinions, lower court decisions on who must testify about a laboratory report remain in disarray. 2/

Westlaw, the huge searchable legal database owned by the Thompson Reuters Corporation, reports that the Massachusetts Supreme Judicial Court recently held in Commonwealth v. Tassone, 3/ that the
trial court's admission of testimony from expert witness that deoxyribonucleic acid (DNA) profile generated from a known saliva sample of defendant matched a DNA profile obtained from a swab taken from eyeglasses that were left at scene of robbery violated defendant's Confrontation Clause rights. 4/
That's an amazing conclusion to draw from the opinion. What the Massachusetts court actually held is that the state's common law of evidence requires the prosecution to produce an expert from the laboratory that performed the DNA test to testify to a match to the defendant. The opinion could hardly be clearer. It states (with emphasis added):
The more challenging question, given the “significant confusion” that has been left in the wake of the Williams decision, 132 S.Ct. at 2277 (Kagan, J., dissenting), is whether the United States Supreme Court would conclude that evidence of this type, admitted under these circumstances, would violate the confrontation clause. Fortunately, we need not resolve that question because, regardless of the answer, we conclude that Roy's opinion was not admissible under our common law of evidence.
and
Because the defendant here had no meaningful opportunity for cross-examination, the admission of Roy's opinion violated the right to confrontation provided by our common law of evidence. ... The prosecution may not admit powerful evidence of a DNA match against a defendant and deny the defendant a meaningful opportunity to challenge the reliability of the facts or data on which the opinion rests by failing to call an expert witness affiliated with the laboratory that tested the sample connected to the crime scene.
and again
Regardless of whether the Supreme Court ultimately interprets the confrontation clause to permit the admission of such an opinion under circumstances that effectively deny the defendant any meaningful opportunity for cross-examination, its admission in our courts is barred by the right of confrontation in our common law of evidence. In other words, if the Commonwealth sends the crime scene DNA to Cellmark for analysis, and seeks to offer in evidence an opinion that the crime scene DNA matches the DNA of the defendant, it will need, at a minimum, to call an expert witness from Cellmark. (Note omitted)
In considering the impact of Tassone on other jurisdictions, one should keep in mind that the court relies partly on an unusual feature of Massachusetts law:
[A]lthough the Sixth Amendment's confrontation clause and Fed.R.Evid. 703 permit the prosecution to elicit on direct examination from an expert witness the underlying facts or data on which the expert's opinion is based for the limited purpose of explaining the basis for the expert's opinion, our common-law rules of evidence do not.
Federal Rule of Evidence 403 departed from the common law by allowing an expert to offer an opinion based on facts or data not admitted into evidence -- and that might be inadmissible -- as long as experts in the field reasonably rely on that type of information. The trial judge in Williams allowed a DNA analyst at the state police laboratory to refer to the testing at another laboratory (Cellmark) and to present her opinion of the probability that a randomly selected individual's DNA would have the incriminating profile that Cellmark reported to her.

The Illinois Supreme Court upheld the admission of this testimony despite the absence of any witness from Cellmark, on the theory that Illinois Rule of Evidence 703 allowed the police analyst to rely on data that, the analyst indicated, was likely to be accurate. Because the state never admitted Cellmark's report into evidence, the Illinois courts concluded that there was no requirement of confrontation.A plurality of four U.S. Supreme Court endorsed this loophole. This Rule 703 reasoning correctly supports the prosecution's dispensing with testimony from the laboratory that did the testing -- but only in rare cases (and probably not in Williams itself). 5/

In any event, Tassone neither applies nor departs from Williams. It avoids resolving the Confrontation Clause issue -- an entirely understandable approach considering (1) the usual judicial preference for not answering constitutional questions except when doing so is necessary, and (2) the post-Williams chaos.

Notes
  1. David H. Kaye & Jennifer L. Mnookin, Confronting Science: Expert Evidence and the Confrontation Clause, 2012 Sup. Ct. Rev. 99 (2012).
  2. David H. Kaye, David Bernstein & Jennifer  L. Mnookin, The New Wigmore on Evidence: Expert Evidence (Cum. Supp. 2016).
  3. 468 Mass. 391, 11 N.E.3d 67 (2014).
  4. That is the Westlaw summary of the holding in the case. The first West headnote on the case restates it:
    Admission of expert witness testimony that deoxyribonucleic acid (DNA) profile generated from a known saliva sample of defendant matched a DNA profile obtained from a swab taken from eyeglasses that were left at scene of robbery, where expert had no affiliation with laboratory that conducted DNA testing of eyeglasses swab, violated defendant's Confrontation Clause rights; defendant did not have a meaningful opportunity to cross-examine expert regarding the laboratory work, procedures, or protocols performed by testing laboratory or as to reliability of testing laboratory's data on which her opinion of a match with defendant's DNA profile rested. U.S.C.A. Const. Amend. 6.
  5. See supra notes 1 & 2.

Wednesday, December 30, 2015

Higher math in a Kansas case

The diagram of a car crash is drawn at a scale of 1 inch to 20 feet. The distance between two points on the diagram is 3 and 3/16 inches. How far apart are two locations shown in the diagram?

You would think that an expert in the field of "accident reconstruction" could answer this question correctly with a pencil and paper or a calculator (if not in his head). But today's online New York Times hosts a re-enactment of the deposition testimony of an expert accident reconstructionist who refused to try without his "formula sheets" and computer.

Here is a small part of the transcript:
A. Three and three-sixteenths inches.
Q. And that is, when you convert that from the scale, what does that convert to?
A. Sixty-eight feet, approximately, sir.
Q. What are the numbers?
A. Three and three-sixteenths.
Q. OK, well here, run it out for me (handing the witness a pocket calculator).
A. Run it out?
Q. Yeah, calculate it for me.
A. (Working on calculator) And again, I'd do this on the computer.
Q. You can't do it, can you?
A. Not without my formulas in front of me, no sir. I can't do it from my head.
Q. You're not able to do a simple scaling problem with a calculator?
A. I don't wish to. I don't wish to make any mistakes. I use instrumentation that does it exact [sic].
Q. You can't show us, based on the numbers you just gave me, that will spit out the 68-foot distance, can you?
A. Not here today I can't, no.
This colloquy suggests an extra-credit problem: Multiply 3 and 3/16 by 20. Do you obtain 68?

Film-maker and comic writer Brett Weiner dramatized this and more of the transcript without changing a word to achieve this surreal video, Verbatim: Expert Witness. Last year, a similar film, Verbatim: What Is a Photocopier?, won the audience award for best short film at the 2014 Dallas Film Festival. There, an IT guy in Ohio struggles with yet another deeply technical issue -- the meaning of the term "photocopier."