Thursday, November 26, 2015

Cell Phones, Brain Cancer, and Scientific Outliers Are Not the Best Reasons to Abandon Frye v. United States

Two days ago, the District of Columbia Court of Appeals (the District’s highest court) heard oral argument 1/ on whether to discard the very test that its predecessor introduced into the law of evidence in the celebrated — and castigated — case of Frye v. United States. 2/ That was 1923, and the evidence in question was a psychologist’s opinion that a systolic blood pressure test showed that James Alphonso Frye was telling the truth when he recanted his confession to a notorious murder in the District. With nary a citation to any previous case, the Court of Appeals famously wrote that
[W]hile courts will go a long way in admitting expert testimony deduced from a well-recognized scientific principle or discovery, the thing from which the deduction is made must be sufficiently established to have gained general acceptance in the particular field in which it belongs. 3/
Now it is 2015, the case is Murray v. Motorola, Inc., 4/ and the proffered evidence is expert testimony that cell phones cause (or raise the risk of) brain cancer. The methods used to form or support this opinion or related ones range from what the court calls “WOE” (the expert says, I thoroughly assessed the “weight of evidence”), to “PDM” (I considered the evidence of causation pragmatically, with the “Pragmatic Dialog Method”), to “a literature review” (I read everything I could find on the subject), to “laboratory experiments” (I conducted in vitro exposure of cells, with results that may not have been replicated), and to “experience as a toxicologist and pharmacologist” to show that “it is generally accepted to extrapolate findings from in vitro studies in human and mammalian cells to predict health effects in humans.”

The trial judge, Frederick H. Weisberg, ruled much of this testimony admissible on the theory that regardless of the extent to which the conclusions are within the mainstream of scientific thinking, the “methods” behind them were generally accepted in ascertaining carcinogenicity. He chastised the defense for “repeatedly challeng[ing] plaintiffs' experts on the ground that their conclusions and opinions are not generally accepted.” As he construed Frye, “[e]ven if 99 out of 100 scientists come out on one side of the causation inference, and only one comes out on the other, as long as the one used a ‘generally accepted methodology,’ Frye allows the lone expert to testify for one party and one of the other ninety-nine to testify for the opposing party.” Having placed himself in this box, Judge Weisberg asked the Court of Appeals to let him out, writing that “most, if not all, of Plaintiffs' experts would probably be excluded under the Rule 702/Daubert standard based on the present record” and granting the defendants' request to allow them to appeal his ruling immediately.

Defendants then convinced the Court of Appeals to jump in. Normally, the appellate court would review only the final judgment entered after a trial. In Murray, it granted an interlocutory appeal on the evidentiary ruling. Not only that, but it agreed to sit en banc, with all nine judges participating rather than to act through a normal panel of three randomly selected judges.

The question before the en banc court is thus framed as whether to replace the jurisdiction’s venerable Frye standard with the approach sketched in Daubert v. Merrell Dow Pharmaceuticals. 5Daubert changes the focus of the judicial inquiry from whether a theory or technique is generally accepted to whether it is scientifically valid. (see Box: What Daubert Did).

But does Frye really require Judge Weisberg to accept evidence that Daubert excludes in this case? The case, I shall argue, is not about Daubert versus Frye. It is about methodology versus conclusion. The judge's construction of Frye as sharply confined to “methodology” is what makes it impossible for him to reject as inadmissible the theory that cell phones cause brain cancers even if it is plainly not accepted among knowledgeable scientists. And that is just as much a problem under Daubert as it is under Frye. Daubert specifically states that the subject of the inquiry “is the scientific validity ... of the principles that underlie a proposed submission. The focus, of course, must be solely on principles and methodology, not on the conclusions that they generate.” 6/ Judge Weisberg decided that principles or putative methodologies like WOE, PDM, literature review, extrapolation from in vitro experiments, and experience are all generally accepted among scientists as a basis for inferring carcinogencity. But if this is correct, and if it insulates claims of general causation from scrutiny for general acceptance under Frye, then it does the same under Daubert (as originally formulated). 7/ Surely, weighing all the relevant data, being pragmatic, studying the literature, considering experiments, and using experience is what scientists everywhere do. They do it not out of habit, but because these things tend to lead to more correct conclusions (and less criticism from colleagues) than the alternatives of not weighing all the data, being doctrinaire, ignoring the literature, and so on.

In Daubert, the U.S. Supreme Court did not rule that Frye was antiquated or not up to job of screening out dangerous and dubious scientific evidence.  Rather, the Court reasoned that Congress, in adopting the Federal Rules of Evidence in 1975, had implicitly dropped a strict requirement of general acceptance. The Court then read Federal Rule 702 as requiring scientific evidence to be, well, “scientific,” as determined by district courts that could look to various hallmarks of scientifically warranted theories. One important criterion, the Court observed, was general acceptance. But such acceptance was no longer dispositive. It was only an indicator of the scientific validity that courts had to find in order to admit suitably challenged scientific evidence.

A majority of U.S. jurisdictions (41 according to the trial court order in Murray), either by legislation or judicial decision, follow the Daubert approach for filtering out unvalidated or invalid scientific evidence (although they still place great weight on the presence of absence of general acceptance in the relevant scientific community). At least one state, Massachusetts, still clings to Frye while embracing Daubert.

The problem with the toxic tort cases like Murray is that the line between “method” and “conclusion” is difficult to draw, and Judge Weisberg draws it in the wrong place. Although his opinion cites to (the first edition of) Wigmore on Evidence: Expert Evidence, it ignores the warning (in § 6.3.3(a)(1) of the second edition and § 5.2.3 of the first edition) that
Occasionally, however, courts define the theory or method at so high a level of abstraction that all kinds of generally applicable findings can be admitted without attending to whether the scientific community accepts them as well founded. For example, in Ibn-Tamas v. United States, [407 A.2d 626 (D.C. 1979),] the District of Columbia Court of Appeals reasoned that a psychologist's theory of the existence and development of various characteristics of battered women need not be generally accepted because an overarching, generally accepted methodology — clinical experience — was used to study the phenomenon. The problem, of course, is that such reasoning could be used to obviate heightened scrutiny for virtually any scientific development [citing, among other cases, Commonwealth v. Cifizzari, 492 N.E.2d 357, 364 (Mass. 1986) (“to admit bite mark evidence, including an expert opinion that no two people have the same bite mark, a foundation need not be laid that such identification technique has gained acceptance in the scientific community. What must be established is the reliability of the procedures involved, such as X-rays, models, and photographs.”)]. Indeed, in developing the lie-detection procedure used in Frye, Marston applied generally accepted techniques of experimental psychology to test his theory and equipment. Thus, an exclusively “high-level” interpretation of Frye is untenable. 8/
The opinion in Murray also overlooks the more extended analysis in Wigmore of why causation opinions in toxic tort cases should be considered theory rather than conclusions within the meaning of Frye. 9/ It would make no sense to ask whether psychologists generally accept the proposition that Marsden correctly measured the defendant's blood pressure or correctly applied some formula or threshold that indicated deception. Such case-specific facts do not appear before any general scientific community for scrutiny. On the other hand, whether elevated blood pressure is associated with deception, how it can be measured, and whether a formula or threshold for concluding that the defendant is deceptive or truthful are trans-case propositions that should be part of normal scientific discourse.

The same is true of claims of carcinogenicity. Whether cell phones can cause brain cancer at various levels of exposure are trans-case propositions that stimulate scientific dialog. The Frye test can function just as well (or as poorly) in vetting expert opinions that exposure can cause cancer as in screening a psychologist's opinion that deception can cause a detectable spike in blood pressure. In sum, denominating trans-case conclusions that have been or could be the subject of scientific investigation and controversy as "conclusions" that are beyond the reach of either Frye or Daubert is a category mistake.

There is another way to make this point. Given all the usual reasons to subject scientific evidence to stricter-than-normal scrutiny, courts in Frye jurisdictions need to consider whether it is generally accepted that the body of scientifically validated findings on which the expert relies is sufficient to justify, as scientifically reasonable, the trans-case conclusion. Thus. the Ninth Circuit Court of Appeals in Daubert originally reasoned — on the basis of Frye — that in the absence of some published, peer-reviewed epidemiological study showing a statistically significant association, the causal theories (whether they are labelled general premises or specific conclusions) of plaintiffs’ expert were inadmissible. The court determined that the body of research, namely, “the available animal and chemical studies, together with plaintiffs' expert reanalysis of epidemiological studies, provide insufficient foundation to allow admission of expert testimony to the effect that Bendectin caused plaintiffs' injuries.” 10/ It was appropriate — indeed, necessary — to consider all the “available ... studies,” but under Frye, there still had to be general acceptance of the proposition that drawing an inference of causation from such studies was generally accepted as scientifically valid. Gussying up the inferential process as a WOE analysis (or anything else) cannot alter this requirement.

Whether or not the Court of Appeals switches to Daubert, it should correct the trial court's blanket refusal to consider whether the theory that cellphones ever cause brain cancer at relevant exposure levels is generally accepted. General acceptance may not be determinative under Daubert, but it remains important. Whether the inquiry into this factor is compelled and conclusive under Frye or inevitable and influential under Daubert, it should not be skewed by a misconception of the scope of that inquiry. In the end, the courts in Murray should realize that
the choice between the general-acceptance and the relevancy-plus standards may be less important than the copious quantities of ink that courts and commentators have spilled over the issue would indicate. [O]ne approach is not inherently more lenient than the other—the outcomes depend more on how rigorously the standards are applied than on how the form of strict scrutiny is phrased. 11/
  1. Ann E. Marimow, D.C. Court Considers How To Screen Out ‘Bad Science’ in Local Trials, Wash. Post, Nov. 24, 2015
  2. 293 F. 1013 (D.C. Cir. 1923).
  3. Id. at 1014.
  4. No. 2001 CA 008479 B (D.C. Super. Ct.), available at
  5. 509 U.S. 579 (1993).
  6. Id. at 594–95 (emphasis added).
  7. In General Electric Co. v. Joiner, 522 U.S. 136 (1997), the Supreme Court blurred the distinction between methodology and conclusion, and Congress later amended Rule 702 to incorporate this shift. The result is that in federal courts, it is less important to draw a better line than the one in Murray and Ibn-ThomasSee David H. Kaye, David A. Bernstein, and Jennifer L. Mnookin, The New Wigmore: A Treatise on Evidence: Expert Evidence § 9.2.2 (2d ed. 2011).
  8. Id. § 6.3.3(a)(1).
  9. Id. § 9.2.3(b).
  10. Daubert v. Merrell Dow Pharms., Inc., 951 F. 2d 1128, 1131 (9th Cir. 1991).
  11. Kaye et al., supra note 7, § 7.2.4(a).

Tuesday, November 24, 2015

Public Comment Period for Seven National Commission on Forensic Science Work Products To Close on 12/22

Public Service Announcement
The comment period for seven National Commission on Forensic Science work products will close on 12/22/15.  The documents can be found, and comments can be left, at this location on The documents that are the most interesting (to me, at least) are as follows:
  • Directive Recommendation on the National Code of Professional Responsibility DOJ-LA-2015-0009-0002 (calls on the Attorney General to require forensic science service providers within the Department of Justice to adopt and enforce an enumerated 16-point “National Code of Professional Responsibility for Forensic Science and Forensic Medicine Service Providers”; to have someone define “steps ... to address violations”; and to “strongly urge” other groups to adopt the code)
  • Views Document on Establishing the Foundational Literature Within the Forensic Science Disciplines DOJ-LA-2015-0009-000 (asks for unspecified people or organizations to prepare “documentation” or “compilation” of “the literature that supports the underlying scientific foundation for each forensic discipline” “under stringent review criteria” and, apparently, for courts to rely on these compilations in responding to objections to admitting forensic science evidence)
  • Views Document on Using the Term Reasonable Degree of Scientific Certainty DOJ-LA-2015-0009-0008 (“legal professionals should not require that forensic discipline testimony be admitted conditioned upon the expert witness testifying that a conclusion is held to a ‘reasonable scientific certainty,’ a ‘reasonable degree of scientific certainty,’ or a ‘reasonable degree of [discipline] certainty’ [because] [s]uch terms have no scientific meaning and may mislead factfinders ... . Forensic science service providers should not endorse or promote the use of this terminology.”)
  • Views Document on Proficiency Testing in Forensic Science DOJ-LA-2015-0009-0007 (“As a recognized quality control tool, it is the view of the Commission that proficiency testing should ... be implemented [not only by accredited forensic science service providers, but also] by nonaccredited FSSPs in disciplines where proficiency tests are available from external organizations”)
I won’t discuss the substance of these documents here, but I can't help noting that the Commission lacks a professional copy editor. The dangling modifier in the sentence on proficiency testing is a sign of the absence of this quality control tool for writing.

Saturday, November 21, 2015

Latent Fingerprint Identification in Flux?

Two recent articles suggest that seeds of change are taking root in the field of latent fingerprint identification.

I. The Emerging Paradigm Shift in the Epistemology of Fingerprint Conclusions

In The Emerging Paradigm Shift in the Epistemology of Fingerprint Conclusions, the chief of the latent print branch of the U.S. Army Criminal Investigation Laboratory, Henry J. Swofford, writes of “a shift away from categoric conclusions having statements of absolute certainty, zero error rate, and the exclusion of all individuals to a more modest and defensible framework integrating empirical data for the evaluation and articulation of fingerprint evidence.” Mr. Swofford credits Christophe Champod and Ian Evett with initiating “a fingerprint revolution” by means of a 2001 “commentary, which at the time many considered a radical approach for evaluating, interpreting, and articulating fingerprint examination conclusions.” He describes the intense resistance this paper received in the latent print community and adds a mea culpa:
Throughout the years following the proposition of this new paradigm by Champod and Evett, the fingerprint community continued to respond with typical rhetoric citing the historical significance and longstanding acceptance by court systems, contending that the legal system is a validating authority on the science, as the basis to its reliability. Even the author of this commentary, after undergoing the traditional and widely accepted training at the time as a fingerprint practitioner, defensively responded to critiques of the discipline without fully considering, understanding, or appreciating the constructive benefits of such suggestions [citing Swofford (2012)]. Touting 100% certainty and zero error rates throughout this time, the fingerprint community largely attributed the cause of errors to be the incompetence of the individual analyst and failure to properly execute the examination methodology. Such attitudes not only stifled potential progress by limiting the ability to recognize inherent weaknesses in the system, they also held analysts to impossible standards and created a culture of blame amongst the practitioners and a false sense of perfection for the method itself.
The article by Champod and Evett is a penetrating and cogent critique of what its authors called the culture of “positivity.” They were responding to the fingerprint community’s understanding, as exemplified in guidelines from the FBI’s Technical Working Group on Friction Ridge Analysis, Study and Technology (TWGFAST), that
"Friction ridge identifications are absolute conclusions. Probable, possible, or likely identification are outside the acceptable limits of the science of friction ridge identification" (Simons 1997, p. 432).
Their thesis was that a “science of friction ridge identification” could not generate “absolute conclusions.” Being “essentially inductive,” the reasoning process was necessarily “probabilistic.” In comparing latent prints and exemplars in “an open population ... probabilistic statements are unavoidable.” (I would go further and say that even in a closed population — one in which exemplars from all the possible perpetrators have been collected — any inferences to identity are inherently probabilistic, but one source of uncertainty has been eliminated.) Although the article referred to “personal probabilities,” their analysis was not explicitly Bayesian. Although they wrote about “numerical measures of evidential weight,” they only mentioned the probability of a random match. They indicated that if “the probability that there is another person who would match the mark at issue” could be calculated, it “should be put before the court for the jury to deliberate.”

Mr. Swofford’s recent article embraces the message of probabilism. Comparing the movement toward statistically informed probabilistic reasoning in forensic science to the development of evidence-based medicine, the article calls for “more scientifically defensible ways to evaluate and articulate fingerprint evidence [and] quantifiable, standardized criterion to support subjective, experience-based opinions, thus providing a more transparent, demonstrable, and scientifically acceptable framework to express fingerprint evidence.”

Nonetheless, the article does not clearly address how the weight or strength of the evidence should be expressed, and a new DFSC policy on which he signed off is not fully consistent with the approach that Champod, Evett, and others have developed and promoted. That approach, Part II of this posting will indicate, uses the likelihood ratio or Bayes factor to express the strength of evidence. In their 2001 clarion call to the latent fingerprint community, however, Champod and Evett did not actually present the framework for “evidential weight” that they have championed both before and afterward (e.g., Evett 2015). The word “likelihood” appears but once in the article (in a quotation from a court that uses it to mean the posterior probability that a defendant is the source of a mark).

II. Fingerprint Identification: Advances Since the 2009 National Research Council Report

The second article does not have the seemingly obligatory words “paradigm shift” in its title, but it does appear in a collection of papers on “the paradigm shift for forensic science.” In a thoughtful review, Fingerprint Identification: Advances Since the 2009 National Research Council Report, Professor Christophe Champod of the Université de Lausanne efficiently summarizes and comments on the major institutional, scientific, and scholarly developments involving latent print examination during the last five or six years. For anyone who wants to know what is happening in the field and what is on the horizon, this is the article to read.

Champod observes that “[w]hat is clear from the post NRC report scholarly literature is that the days where invoking ‘uniqueness’ as the main (if not the only) supporting argument for an individualization conclusion are over.” He clearly articulates his favored substitute for conclusions of individualization:
A proper evaluation of the findings calls for an assignment of two probabilities. The ratio between these two probabilities gives all the required information that allows discriminating between the two propositions at hand and the fact finder to take a stand on the case. This approach is what is generally called the Bayesian framework. Nothing prevents its adoption for fingerprint evidence.
[M]y position remains unchanged: the expert should only devote his or her testimony to the strength to be attached to the forensic findings and that value is best expressed using a likelihood ratio. The questions of the relevant population—which impacts on prior probabilities—and decision thresholds are outside the expert’s province but rightly belong to the fact finder.
I might offer two qualifications. First, although presenting the likelihood ratio is fundamentally different from expressing a posterior probability (or a announcing a decision that the latent print comes from the suspect’s finger), and although the Bayesian conceptualization of scientific reasoning clarifies this distinction, one need not be a Bayesian to embrace the likelihood ratio (or its logarithm) as a measure of the weight of evidence. The intuition that evidence that is more probable under one hypothesis than another lends more support to the former than the latter can be taken as a starting point. (But counter-examples and criticisms the “law of likelihood” have been advanced. E.g., van Enk (2015); Mayo (2014).)

Second, whether the likelihood-ratio approach to presenting results is thought to be Bayesian or to rest on a distinct "law of likelihood," what stands in the way of its widespread adoption is conservatism and the absence of data-driven conditional probabilities with which to compute likelihood ratios. To be sure, even without accepted numbers for likelihoods, the analyst who reaches a categorical conclusion should have some sense of the likelihoods that underlie the decision. As subjective and fuzzy as these estimates may be, they can be the basis for reporting the results of a latent print examination as a qualitative likelihood ratio (NIST Expert Working Group on Human Factors in Latent Print Analysis 2012).  Still, a question remains: How do we know that the examiner is as good at judging these likelihoods as at coming to a categorical decision without articulating them?

Looking forward to less opaquely ascertained likelihoods, Champod presents the following vision:
I foresee the introduction in court of probability-based fingerprint evidence. This is not to say that fingerprint experts will be replaced by a statistical tool. The human will continue to outperform machines for a wide range of tasks such as assessing the features on a mark, judging its level of distortion, putting the elements into its context, communicating the findings and applying critical thinking. But statistical models will bring assistance in an assessment that is very prone to bias: probability assignment. What is aimed at here is to find an appropriate distribution of tasks between the human and the machine. The call for transparency from the NRC report will not be satisfied merely with the move towards opinions, but also require offering a systematic and case-specific measure of the probability of random association that is at stake. It is the only way to bring the fingerprint area within the ethos of good scientific practice.
Acknowledgement: Thanks to Ted Vosk for telling me about the first article discussed here.

Thursday, November 19, 2015

Marching Toward Improved Latent Fingerprint Testimony at the Army's Defense Forensic Science Center

The U.S. Army’s Defense Forensic Science Center (DFSC) has announced a change in its practice of reporting a positive association between a latent fingerprint and an exemplar. (The full notice of November 3, 2015, is reproduced below.)

The notice seems to say that it is no longer appropriate to “use the terms ‘identification’ or ‘individualization’ in technical reports and expert witness testimony to express the association of an item of evidence to a specific known source” because “these terms imply absolute certainty of the conclusion to the fact-finder which has not been demonstrated by available scientific data.” The DFSC “recognizes the importance of ensuring forensic science results are reported to the fact-finder in a manner which appropriately conveys the strength of the evidence, yet also acknowledges that absolute certainty should not be claimed based on currently available scientific data.”

All this sounds forward-looking, but are the words “based on currently available scientific data” meant to imply that “absolute certainty” is just a temporary deficiency, soon to be cured by more research? If so, the statement is mistaken. Inasmuch as all science is contingent (potentially subject to revision), no amount of research can deliver “absolute certainty.” But some propositions are nearly certain. Although we cannot be absolutely certain that the earth is the third planet orbiting the sun, we can be darned sure of it.

So what is the DFSC’s understanding of the current data? Are fingerprint analysts not allowed to say that they have made an “identification” because they cannot be third-planet-from-the-sun sure of it? Or is the policy change a reflection of substantially greater uncertainty than this?

The notice skates above the surface of these questions. However, three years ago, its author wrote an article entitled “Individualization Using Friction Skin Impressions: Scientifically Reliable, Legally Valid” in which he insisted on “the validity of testimonial claims of individualization.” At that time, he maintained that even though “[n]othing in science can ever be proven in the most absolute sense,”
It can be well agreed that the fundamental premise of friction ridge skin uniqueness has withstood considerable scrutiny since the late 17th century. Furthermore, ... friction ridge skin uniqueness is well within the bounds to be considered a scientific law that will occur invariably as a natural phenomenon, and it should be recognized, as such ... .
(Swofford 2012, p. 75). Sounds like an assertion that individualization via latent prints is third-planet-from-the-sun science. In a perceptive 2015 article, however, Mr. Swofford repudiated this traditional view of the current level of fingerprint-identification certainty as unduly defensive and detrimental to the field.*

In any event,
[T]he DFSC has modified the language which is used to express “identification” results on latent print technical reports. The revised languages [sic] is as follows: “The latent print on Exhibit ## and the record finger/palm prints bearing the name XXXX have corresponding ridge detail. The likelihood of observing this amount of correspondence when two impressions are made by different sources is considered extremely low.”
This is a step forward from an assertion that it is 100% certain that the latent print comes from XXXX’s finger. But the details of the move from absolute to partial certainty are not perfect. (What is?)

Exactly what “is considered extremely low” and by whom?  First, is the examiner saying that some number such as 0.00001 is known and that the DFSC considers this number to be extremely low? Or is the testimony that no specific figure is known, but the DFSC believes that it is within an otherwise unspecified range that is extremely low? Although asking the expert on cross-examination for the “extremely low” number should reveal that no specific number is known, would it be better to make this clear at the outset?

Second, what is the nature of the quantity that someone considers extremely low? Is it a “likelihood” or instead a probability? Colloquially, the words are synonyms, but technically, “likelihood” pertains to the hypothesis, not to the evidence. The probability of the evidence E given the hypothesis H (written P(E|H)), when summed or integrated over all possible E, must equal 1. The hypothesis is fixed, the evidence varies, and the probability attaches to the evidence. For example, if the probability that the “amount of correspondence” is 0.00001 “when two impressions are made by different sources,” then the probability of all other amounts must be 0.99999.

The concept of likelihood, however, treats the evidence as fixed and asks how strongly the fixed evidence supports possible hypotheses. The hypotheses vary, and there is no reason to believe that the sum or integral over all possible hypotheses “will be anything in particular” (Edwards 1992, p. 12). If the probability that the “amount of correspondence” is 0.00001 “when two impressions are made by different sources,” then the probability of the same correspondence when the two impressions are made by the same source could be 0.01. Or it could be 0.05, or many other values between 0 and 1. Mathematically, H’s likelihood is proportional to E’s conditional probability, but conceptually, “the distinction between probability and likelihood is vital ... .” (Ibid.)

Apparently the DFSC is not using “likelihood” as statisticians do when they are thinking about the logic of statistical inference, but is just referring to the garden variety probability of the latent print examiner’s observations (the evidence E) conditional on the hypothesis H0 of “different sources.” This sounds a lot like traditional null hypothesis testing or like the use of a Fisherian p-value (Kaye 2015).

Such discourse is fine as far as it goes. Talking about the low probability of the level of correspondence when the prints come from different fingers is much better than asserting that the observed correspondence is utterly inconceivable under the different-source hypothesis H0 or that the same-source hypothesis is absolutely certain to be true.

Nevertheless, this kind of testimony is still incomplete. As has been discussed many times in the forensic science and statistics literature (see Can Forensic Pattern Matching be Validated?), it is necessary to consider the evidence probability under the alternative hypothesis H1. How probable is it that the latent print would have the same degree of observed correspondence to the exemplar if it originated from the finger that produced the exemplar?

The ratio of these two probabilities, P(E|H1) to P(E|H0), equals the likelihood ratio, L(H1; E) to L(H0; E). Unless the numerator P(E|H1) is 1 (a conclusion that also is not “based on currently available scientific data”), giving only the denominator is problematic. It ignores the limitation in the evidence emphasized in two of the three references provided in favor of the new policy (the 2009 NRC Report and the 2012 NIST Expert Working Group Report).

Hopefully, the DFSC and other organizations will continue to refine their method of reporting associations. The Center's Information Paper promises that “[t]he next step will be to quantify both the amount of corresponding ridge detail and the related likelihood calculations.” But the DFSC need not wait for “likelihood calculations” to acknowledge in its reports and testimony that there is some variability in latent prints from the same finger.


* Henry J. Swofford, The Emerging Paradigm Shift in the Epistemology of Fingerprint Conclusions, 65 J. Forensic Identification 201, 203 (2015). Postscript: This reference and the accompanying text were added on 11/20/15 10:35 pm EST. This article is discussed briefly in a posting of 11/21/15.

  • A. W. F. Edwards, Likelihood: Expanded Edition (1992).
  • David H. Kaye, Presenting Forensic Identification Findings: The Current Situation, in Communicating the Results of Forensic Science Examinations 12–30 (C. Neumann et al. eds. 2015) (Final Technical Report for NIST Award 70NANB12H014).
  • Henry J. Swofford, Individualization Using Friction Skin Impressions: Scientifically Reliable, Legally Valid, 62 J. Forensic Identification 65 (2012).

Thanks to Ted Vosk for calling the Information Paper discussed here to my attention.


4930 N 31ST STREET

03 November 2015


SUBJECT: Use of the term “Identification” in Latent Print Technical Reports

1. Forensic science laboratories routinely use the terms “identification” or “individualization” in technical reports and expert witness testimony to express the association of an item of evidence to a specific known source. Over the last several years, there has been growing debate among the scientific and legal communities regarding the use of such terms within the pattern evidence disciplines to express source associations which rely on expert interpretation. Central to the debate is that these terms imply absolute certainty of the conclusion to the fact-finder which has not been demonstrated by available scientific data. As a result, several well respected and authoritative scientific committees and organizations have recommended forensic science laboratories not report or testify, directly or by implication, to a source attribution to the exclusion of all others in the world or to assert 100% certainty and state conclusions in absolute terms when dealing with population issues.

2. The Defense Forensic Science Center (DFSC) recognizes the importance of ensuring forensic science results are reported to the fact-finder in a manner which appropriately conveys the strength of the evidence, yet also acknowledges that absolute certainty should not be claimed based on currently available scientific data. As a result, the DFSC has modified the language which is used to express “identification” results on latent print technical reports. The revised languages is as follows:
"The latent print on Exhibit ## and the record finger/palm prints bearing the name XXXX have corresponding ridge detail. The likelihood of observing this amount of correspondence when two impressions are made by different sources is considered extremely low."
3. This revision to the reporting language is not the result of changes in the examination methods and does not impact the strength of the source associations. Instead, it simply reflects a more scientifically appropriate framework for expressing source associations made when evaluating latent print evidence. The next step will be to quantify both the amount of corresponding ridge detail and the related likelihood calculations. In the interim, customers should continue to maintain strong confidence in latent print examination results.

[Page 2 of 2]

4. References:
a. National Research Council (2009). Strengthening Forensic Science in the United States: A Path Forward. National Research Council, Committee on Identifying the Needs of the Forensic Science Community. National Academies Press, Washington, D.C.

b. National Institute of Standards and Technology (2012). Latent Print Examination and Human Factors: Improving the Practice through a Systems Approach. Expert Working Group on Human Factors in Latent Print Analysis, U.S. Department of Commerce, National Institute of Standards and Technology.

c. Garrett, R. (2009). Letter to All Members of the International Association for Identification, Feb. 19, 2009.
5. Questions regarding this information paper may be directed to Mr. Henry Swofford, Chief, Latent Print Branch, USACIL, DFSC, 404-469-5611 and

Sunday, November 8, 2015

Can Forensic Pattern Matching Be Validated?

An article in the latest issue of the International Statistical Review raises (once again) fundamental questions for forensic scientists: 1/ How can one establish the validity of human judgment in a pattern recognition task such as deciding whether two samples of fingerprints or handwriting emanate from the same source? How can one estimate error probabilities for these judgments?

The message I get reading between the lines is that convincing validation is barely possible and the subjective assessments that are today’s norm will have to be replaced by objective measurements and statistical decision rules. This conclusion may not sit well with practicing criminalists who are committed to the current mode of skill-based assessments. At the same time, the particular statistical perspective of the article (null hypothesis testing) stands in opposition to a movement in the academic segment of the forensic science world that importunes criminalists to get away from categorical judgments — whether these judgments are subjective or objective. Nevertheless, the author of the article on Statistical Issues in Assessing Forensic EvidenceKaren Kafadar, is a leading figure in forensic statistics, 2/ and her perspective is traditional among statisticians. Thus, an examination of a few parts of the article seems appropriate.

The article focuses on “forensic evidence that involves patterns (latent fingerprints, firearms and toolmarks, handwriting, tire treads and microscopic hair),” often comparing it to DNA evidence (much as the NRC Committee on Identifying the Needs of the Forensic Science Community did in 2009). Professor Kafadar emphasizes that in the pattern-matching fields, analysts do not make quantitative measurements of a pre-specified number of well-defined, discrete features like the short tandem repeat (STR) alleles that now dominate forensic DNA testing. Instead, the analyses “depend to a large extent on the examiner whose past experience enables some qualitative assessment of the distinctiveness of the features.” In other words, human perception and judgment establish how similar the two feature sets are and how discriminating those feature sets are. Such “pattern evidence is ... subjective and in need of quantitative validation.”

I. Validating Expert Judgments

How, then, can one quantitatively validate the subjective process? The article proceeds to “define measures used in quantifying error probabilities and how they can be used for pattern evidence.”

A. Validity

The first measure is

Validity (accuracy): Given a sample piece of evidence on which a measurement is made, is the measurement accurate? That is, if the measurement is ‘angle of bifurcation’ or ‘number of matching features’, does that measurement yield the correct answer? For example, if a bifurcation appears on an image with an angle of 30°, does the measurement technology render a result of ‘30’ [degrees], at least on average if several measurements are made? As another example, if a hair diameter is 153 μm, will the measurement, or average of several measurements, indicate ‘153’?

This is only a rough definition. Suppose a measuring instrument always gives a value of 30.001 when the angle is actually 30. Are the measurements “valid”? Neither individually nor on average is the instrument entirely accurate. But the measurements always are close, so maybe they do qualify as valid. There are degrees of validity, and a common measure of validity in this example would be the root mean squared error, where an error is a difference between the true angle and the measurement of it.

But fingerprint analysts do not measure alignments in degrees. Their comparisons are more like that of a person asked to hold two objects, one in each hand, and say which one is heavier (or whether the weights are practically the same). Experiments can validate the ability of test subjects to discriminate the masses under various conditions. If the subjects rarely err, their qualitative, comparative, subjective judgments could be considered valid.

Of course, there is no specific point at which accuracy suddenly merits the accolade of “valid,” and it can take more than one statistic to measure the degree of validity. For example, two forensic scientists, Max Houck and Jay Siegel, interpret a study of the outcomes of the microscopic comparisons and mitochondrial DNA testing as establishing a 91% “accuracy,” 3/ where “accuracy” is the overall “proportion of true results.” 4/ Yet, the microscopic hair analysts associated a questioned hair with a known sample in 1/5 of the cases in which DNA testing excluded any such association (and 1/3 in cases in which there was a DNA exclusion and a definitive result from the microscopy). 5/ In dealing with binary classifications, “validity (accuracy)” may require attention to more than being correct “on average.”

Moreover, whether the validity statistic derived from one experiment applies more widely is almost always open to debate. Even if one group of “weight analysts” always successfully discriminated between the heavier and the lighter weights, the question of generalizability or “external validity,” as social scientists often call it, would remain. A rigorous, double-blind study might show that the analysts did superbly with one set of weights under particular, controlled conditions. This study would possess high internal validity. But it might not tell us much about the performance of different analysts under different conditions; its external validity might be weak. Indeed, it has been said that “[i]t is axiomatic in social science research that there is an inverse relationship between internal and external validity.” 6/

Plainly, the quick definition of “validity” in Statistical Assessments does not exhaust the subject. (Nor was it intended to.) Things get even more complicated when we think of validity as relating to the purpose of the measurement. The data from a polygraph instrument may be valid measurements of physiological characteristics but not valid measures of conscious deception. The usual idea of validity is that the instrument (human or machine) accurately measures what it is supposed to measure. This aspect of “validity” is closely related to the requirement of “fit” announced in the Supreme Court's majority opinion in Daubert v. Merrell Dow Pharmaceuticals, Inc. 7/

B. Consistency

The article indicates that it takes more than “validity” to validate a measurement or inference process. The second requirement is

Consistency (reliability): Given the same sample, how consistent (or variable) are the results? If the measurement is repeated under different conditions (e.g. different fingers, different examiners, different analysis times, different measurement systems and different levels of quality in evidence), is the measurement the same? ... Under what conditions are the measurements most variable? That is, do measurements vary most with different levels of latent print quality? Or with different fingers of the same person? Or with different times of day for the same examiner? Or with different automated fingerprint identification systems (AFIS)? Or with different examiners? If measurements are found to be most consistent when the latent print quality is high and when AFIS system type A is used, but results vary greatly among examiners when the latent print quality is low or when other AFIS systems are used, then one would be in a good position to recommend the restriction of this particular type of forensic evidence under only those conditions when consistency can be assured. ... Notice that a measurement can be highly consistent around the wrong answer (consistent but inaccurate). ...

The critical definitional point here is that “reliability” concerns consistency, but there is room for argument over whether the measuring process has to be consistent under all conditions to be considered “reliable.” If one automated system is consistent in a given domain, it is reliable in that domain. If one skilled examiner reaches consistent results, her reliability is high even if inter-examiner reliability is low. In these examples, the notion of “reliability” overlaps or blurs into the idea of external validity. Likewise, all our weight analysts might be very reliable when comparing 10 pound weights to 20 pound ones but quite unreliable in distinguishing between with 15 and 16-pound ones. This would not prove that subjective judgments are ipso facto unreliable — only that reliability is less for more difficult tasks than for easy ones.

These ruminations on terminology do not undercut the important message in Statistical Assessments that research that teases out the conditions under which reliability and validity are degraded is vital to avoiding unnecessary errors: “[M]any observational studies are needed to confirm the performance of latent print analysis under a wide array of scenarios, examiners and laboratories.”

C. Well-determined Error Probabilities

The final component of validation described in Statistical Assessments is "well-determined error probabilities." When it comes to the classification task (differentiating same-source from different-source specimens), the error probabilities indicate whether the classifications are valid. A highly specific test has relatively few false positives — when confronted with different-source specimens, examiners conclude that they do not match. A highly sensitive test has relatively few false negatives — when confronted with same-source specimens, examiners conclude that they do match.

Tests that are both sensitive and specific also can be described as generating results that have a high “likelihood ratio.” If a perceived positive association is much more probable when the specimens truly are associated, and a negative association (an exclusion) is much more probable when they are not, then the likelihood ratio LR has a large numerator (close to the maximum probability of 1) and a small denominator (close to the minimum of 0):
LR = Pr(test + | association) / Pr(test + | no association)
      = specificity / 1 – Pr(test – | no association)
      = specificity / (1 – sensitivity)
      = large (almost 1) / small (a little more than 0)
      = very large
But a high likelihood ratio does not guarantee a high probability of a true association. It signals high “probative value,” to use the legal phrase, because it justifies a substantial change in probability that the suspected source is the real source compared to that probability without the evidence. For example, if the odds of an association without knowledge of the evidence are 1 to 10,000 and the examiner’s perception is 1,000 times more probable if the specimens are from the same source (LR = 1000), then, by Bayes' rule, the odds given the evidence rise to 1000 × 1:10000 = 1:10. Odds of only 1 to 10 cannot justify a conviction or even a conclusion that the two specimens probably are associated, but the examiner’s evidence has contributed greatly to the case. With other evidence, guilt may be established; without the forensic-science evidence, the totality of the evidence may fall far short. Therefore, if the well-defined error probabilities are low (and, hence, the likelihood ratio is high), it would be a mistake to dismiss the examiner’s assessment as lacking in value.

Yet, the standard terminology of positive and negative “predictive value” used in Statistical Assessments suggests that much more than this is required for the evidence to have “value.” For example, the article states that

In the courtroom, one does not have the ‘true’ answer; one has only the results of the forensic analysis. The question for the jury to decide is as follows: Given the results of the analysis, what is the probability that the condition is present or absent? For fingerprint analysis, one might phrase this question as follows:

PPV = P{same source | analysis claims ‘same source’}.

If PPV is high, and if the test result indicates ‘same source’, then we have some reasonable confidence that the two prints really did come from the same person. But if PPV is low, then, despite the test result (‘same source’), there may be an unacceptably high chance that in fact the prints did not come from the same person—that is, we have made a serious ‘type I error’ in claiming a ‘match’ when in fact the prints came from different persons.

Yes, the fingerprint analyst who asserts that the defendant is certainly the source when the PPV is low is likely to have falsely rejected the hypothesis that the defendant is not the source. But why must fingerprint examiners make these categorical judgments? Their job is to supply probative evidence to the judge or jury so as to permit the factfinder to reach the best conclusion based on the totality of the evidence in the case. 8/ If experiments have shown that examiners like the one in question, operating under comparable conditions with comparable prints, almost always report that the prints come from the same source when they do (high sensitivity) and that they do not come from the same source when they do not (high specificity), then there is no error in reporting that the prints in question are substantially more likely to have various features in common if they came from the same finger than if they came from fingers from two different individuals. 9/ This is a correct statement about the weight of the evidence rather than the probability of the hypothesis.

Indeed, one can imagine expanding the range of evaluative conclusions that fingerprint examiners might give. Instead of thinking “it’s either an identification or an exclusion” (for simplicity, I am ignoring judgments of “insufficient” and “inconclusive”), the examiner might be trained to offer judgments on a scale for the likelihood ratio, as European forensic science institutes have proposed. 10/ A large number of clear and unusual corresponding features in the latent print and the exemplar should generate a large subjective probability for the numerator of LR and a small probability for the denominator. A smaller number of such features should generate a smaller subjective ratio.

Although this mode of reporting on the evidentiary weight of the features is more nuanced and supplies more information to the factfinder, it would increase the difficulty of validating the judgments. How could one be confident that the moderate-likelihood-ratio judgments correspond to less powerful evidence than the high-likelihood-ratio ones?

II. Validating an Objective Statistical Rule

Statistical Assessments does not seriously consider the possibility of moving from categorical decisions on source attribution to a weight-of-evidence system. Instead, it presents a schematic for validating source attributions in which quantitative measurement replaces subjective impressions of the nature and implications of the degree of similarity in the feature sets. The proposal is to devise an empirically grounded null hypothesis test for objective measurements. Development of the test would proceed as follows (using 95% as an example):

    (1) Identify a metric (or set of metrics) that describes the essential features of the data. For example, these metrics might consist of the numbers of certain types of features (minutiae) or the differences between the two prints in the (i) average distances between the features (e.g. between ridges or bifurcations), (ii) eccentricities of identified loops or (iii) other characteristics on the prints that could be measured.
    (2) Determine a range on the metric(s) that is ‘likely to occur’ (has a 95% chance of occurring) if ‘nothing interesting is happening’ (i.e. the two prints do not arise from the same source). For example, one could calculate these metrics on 10,000 randomly selected latent prints known to have come from different sources.
    (3) Identify ‘extreme range’ = range of the metric(s) outside of the ‘95%’ range. For example, one can calculate ranges in which 95% of the 10,000 values of each metric lie.
    (4) Conduct the experiment and calculate the metric(s). For example, from the ‘best match’ that is identified, one can calculate the relevant metrics.
    (5) If the metric falls in the ‘expected’ range, then data are deemed consistent with the hypothesis that ‘nothing interesting is happening’. If the metric falls in the ‘extreme’ range, the data are not consistent with this hypothesis and indicate instead an alternative hypothesis.

This type of approach keeps the risk of a false rejection of the null hypothesis (that the suspect is not the source) to no more than 5% (ignoring the complications arising from the fact that not one but many variables are being considered separately), but it is subject to well-known criticisms. First, why 5%? Why not 1%? Or 9.3%?

Second, whatever the level of the test, does it make sense to report an association when the measurements barely make it into the “extreme range” but not when they are barely shy of it?

Third, what is the risk of a false acceptance — a false exclusion of a truly matching print? To estimate that error probability, a different distribution would need to be considered — the distribution of the measured values of the features when sampling from the same finger. The 2009 NRC Report refers this issue. In a somewhat garbled passage on the postulated uniqueness of fingerprints, 11/ it observes that
Uniqueness and persistence are necessary conditions for friction ridge identification to be feasible, but [u]niqueness does not guarantee that prints from two different people are always sufficiently different that they cannot be confused, or that two impressions made by the same finger will also be sufficiently similar to be discerned as coming from the same source. The impression left by a given finger will differ every time, because of inevitable variations in pressure, which change the degree of contact between each part of the ridge structure and the impression medium. None of these variabilities — of features across a population of fingers or of repeated impressions left by the same finger — has been characterized, quantified, or compared. 12/
To rest assured that both error probabilities of the statistical test of the quantified feature set are comfortably low, the same-finger experiment also needs to be conducted. Of course, Dr. Kafadar might readily concede the need to consider the alternative hypothesis (that prints originated from the same finger) and say that replicate measurements from the same fingers should be part of the experimental validation of the more objective latent-print examination process (or that in ordinary casework, examiners should make replicate measurements for each suspect). 13/

Still, the question remains: What if a similarity score on the crime-scene latent print and the ten-print exemplar falls outside the 95% range of variability of prints from different individuals and outside the 95% range for replicate latent prints from the same individual? Which hypothesis is left standing — the null (different source) or the alternative (same source)? One could say that the fingerprint evidence is inconclusive in this case, but would it be better to report a likelihood ratio in all cases rather than worrying about the tail-end probabilities in any of them? (This LR would not depend on whether the different-source hypothesis is rejected. It would increase more smoothly with an increasing similarity score.)

III. Human Expertise and Statistical Criteria

A major appeal of objectively ascertained similarity scores and a fixed cut-off is that the system supplies consistent results with quantified error probabilities and reliability. But would the more objective process be any more accurate than subjective, human judgment in forensic pattern recognition tasks? The objective measures that might emerge are likely to be more limited than the many features that the human pattern matchers might evaluate. And, it can be argued that the statistical evaluation of them may not be as sensitive to unusual circumstances or subtleties as individual “clinical” examination would be.

Thus, the preference in Statistical Assessments for objectively ascertained similarity scores and a fixed cut-off is reminiscent of the arguments for “actuarial” or “statistical” rather than “clinical” assessments in psychology and medicine. 14/ The work in those fields of expertise raises serious doubt about claims of superior decisionmaking from expert human examiners. Nonetheless, more direct data on this issue can be gathered. Along with the research Dr. Kafadar proposes, studies of whether the statistical system outperforms the classical, clinical one in the forensic science fields should be undertaken. The burden of proof should be on the advocates of purely clinical judgments. Unless the less transparent and less easily validated human judgments clearly outperform the algorithmic approaches, they should give way to more objective measurements and interpretations.

1. Karen Kafadar, Statistical Issues in Assessing Forensic Evidence, 83 Int’l Stat. Rev. 111–34 (2015).

2. Dr. Kafadar is Commonwealth Professor and chair of the statistics department at the University of Virginia, a member of the Forensic Science Standards Board of the Organization of Scientific Area Committees of the National Institute of Standards and Technology, and a leading participant in the newly established “Forensic Science Center of Excellence focused on pattern and digital evidence” — “a partnership that includes Carnegie Mellon University (Pittsburgh, Penn.), the University of Virginia (Charlottesville, Va.) and the University of California, Irvine (Irvine, Calif.) [that] will focus on improving the statistical foundation for fingerprint, firearm, toolmark, dental and other pattern evidence analyses, and for computer, video, audio and other digital evidence analyses.” New NIST Center of Excellence  to Improve Statistical Analysis of Forensic Evidence, NIST Tech Beat, May 26, 2015.

2. As Dr. Kafadar has observed, “[s]tatistics plays multiple roles in moving forensic science forward, in characterizing forensic analyses and their underlying bases, designing experiments and analyzing relevant data that can lead to reduced error rates and increased accuracy, and communicating the results in the courtroom.” U.Va. Partners in New Effort to Improve Statistical Analysis of Forensic Evidence, UVAToday, June 2, 2015.

3. Max M. Houck & Jay A. Siegel, Fundamentals of Forensic Science 310 (2015).

4. Id.

5. David H. Kaye, Ultracrepidarianism in Forensic Science: The Hair Evidence Debacle, 72 Wash. & Lee L. Rev. Online 227 (2015).

6. E.g., Allan Steckler & Kenneth R. McLeroy, The Importance of External Validity, 98 Am. J. Public Health 9 (2008).

7. See David H. Kaye et al., The New Wigmore: A Treatise on Evidence: Expert Evidence (2d ed. 2011).

8.  As Sir Ronald Fisher reminded his follow statisticians, “We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of other free minds to utilize them in making their own decisions.” Ronald A. Fisher Statistical Methods and Scientific Induction, 17 J. Roy. Statist. Soc. B 69 (1955).

9. Yet, Statistical Assessments insists that by virtue of Bayes’ rule, “low prevalence, high sensitivity and high specificity are needed for high PPV and NPV ... [there is a] need for sensible restriction of the suspect population.” This terminology is confusing. Low prevalence (guilt is rare) comes with a large suspect population rather than a restricted one. It cuts against a high PPV. Conversely, if “low prevalence” means a small suspect population (innocence is relatively rare), then it is harder to have a high NPV.

10. ENFSI Guideline for Evaluative Reporting in Forensic Science, June 9, 2015.

11. The assertion below that “[u]niqueness and persistence are necessary conditions for friction ridge identification to be feasible” ignores the value of a probabilistic identification. A declared match can be immensely probative even if a print is not unique in the population. If a particular print occurred twice in the world’s population, a match to the suspect still would be powerful evidence of identification. DNA evidence is like that — the possibility of a genetically identical twin somewhere has not greatly undermined the feasibility DNA identifications. The correspondence in the feature set still makes the source probability higher than it was prior to learning of the DNA match. The matching alleles need not make the probability equal to 1 to constitute a useful identification.

12. NRC Committee on Identifying the Needs of the Forensic Science Community, Strengthening Forensic Science in the United States: A Path Forward 144 (2009)(footnote omitted).

13. Statistical Assessments states that fingerprint analysts currently compare “latent prints found at a [crime scene] with those from a database of ‘latent exemplars’ taken under controlled conditions.” Does this mean that latent print examiners create an ad hoc databank in each case of a suspect’s latent prints to gain a sense of the variability of those prints? I had always thought that examiners merely compare a given latent print to exemplars of full prints from suspects (what used to be called “rolled prints”). In the same vein, giving DNA profiling as an example, Statistical Assessments asserts that “[a]nalysis of the evidence generally proceeds by comparing it with specimens in a database.” However, even if CODIS database trawls have become routine, the existence and use of a large database has little to do with the validity of the side-by-side comparisons that typify fingerprint, bullet, handwriting, and hair analyses.

14. See, e.g., R.M. Dawes et al., Clinical Versus Actuarial Judgment, 243 Science 1668 (1989) (“Research comparing these two approaches shows the actuarial method to be superior.”); William M. Grove, & Paul E. Meehl, Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The Clinical–statistical controversy, 2 Psych., Pub. Pol’y & L. 293 (1996) (“Empirical comparisons of the accuracy of the two methods (136 studies over a wide range of predictands) show that the mechanical method is almost invariably equal to or superior to the clinical method); Konstantinos V. Katsikopoulos et al., From Meehl to Fast and Frugal Heuristics (and Back): New Insights into How to Bridge the Clinical—Actuarial Divide, 18 Theory & Psych. 443 (2008); Steven Schwartz & Timothy Griffin, Medical Thinking: The Psychology of Medical Judgment and Decision Making (2012).

Thanks are due to Barry Scheck for calling the article discussed here to my attention.

Sunday, October 25, 2015

SWGDAM Guidelines on "Probabilistic Genotyping Systems" (Part 2)

What makes a "Probabilistic Genotyping System" probabilistic? That a computer program delivers a probability related to a DNA profile does not make it a PGS. After all, traditional, manual analysis of DNA data leads to probabilities. Here, I present a toy example of a single-source sample to convey a sense of the nature of probabilistic genotyping.

I do so with some trepidation. Neither the SWGDAM Guidelines nor the articles that I have located supplies a simple and clear exposition of the actual workings of any modern forensic PGS. The Guidelines state that
A probabilistic genotyping system is comprised of software, or software and hardware, with analytical and statistical functions that entail complex formulae and algorithms. Particularly useful for low-level DNA samples (i.e., those in which the quantity of DNA for individuals is such that stochastic effects may be observed) and complex mixtures (i.e., multi-contributor samples, particularly those exhibiting allele sharing and/or stochastic effects), probabilistic genotyping approaches can reduce subjectivity in the analysis of DNA typing results.
That sounds great, but what do these "complex formulae and algorithms" do? Well,
probabilistic approaches provide a statistical weighting to the different genotype combinations. Probabilistic genotyping does not utilize a stochastic threshold. Instead, it incorporates a probability of alleles dropping out or in. In making use of more genotyping information when performing statistical calculations and evaluating potential DNA contributors, probabilistic genotyping enhances the ability to distinguish true contributors and noncontributors.
Moreover, "[t]he use of a likelihood ratio as a reporting statistic for probabilistic genotyping differs substantially from binary statistics such as the combined probability of exclusion."

This sounds good too, but what is "a statistical weighting," and how is a probability of exclusion, which is not confined to 0 to 1, a "binary statistic"? To gain a clearer picture of what might be going on, I thought I would start with the simplest possible situation — a crime-scene sample with a single contributor — to surmise how a probabilistic analysis might operate. My analysis is something of a guess. Corrections are welcome.

Two Peaks, One Inferred Genotype, One Likelihood Ratio of 50: Not a PGS!

In "short tandem repeat" typing via capillary electrophoresis, the laboratory extracts DNA from a sample and uses the PCR (polymerase chain reaction) to make millions of copies of a short stretch of DNA between a designated starting point and a stopping point (a "locus"). These fragments vary in length among different individuals (although none are unique). The laboratory runs the sample fragments through a machine that measures the quantity of the fragments as a function of the length of the fragments. For example, a plot of the quantity on the y-axis and the fragment length on the x-axis might show two prominent peaks, which I will call A and B, of roughly equal height rising above a noisy baseline. This AB pattern at a single locus is exactly what one would expect for DNA from an individual who inherited a fragment of length A from one parent and a fragment of length B from the other parent. Starting with roughly equal numbers of maternally and paternally inherited DNA molecules in the original sample, PCR should generate about equal quantities of the maternal and paternal length variants ("STR alleles") of the two distinct lengths. These produce the two peaks in the graph (the electropherogram).

The analyst then could compute the “random match probability” or “probability of inclusion” (PI) — that is, the probability P(RAB) that a randomly selected individual would be type AB. Even if the analyst used a computer program to do the calculation, no “probabilistic genotyping” would be involved. The “genotype” AB would be regarded as known to a certainty (for the purpose of the computation), and the probability PI pertains to something else — to the chance of coincidentally finding an individual with a matching profile: PI = P(RAB). If 1 in 50 people have the profile AB, then PI = 1/50.

The evidentiary value of the inclusion can be computed as a “likelihood ratio” (LR). If the hypothesis (Hp) that the the suspect, who also is type AB, is the contributor of the DNA in the sample is correct, and if the sample is has plenty of undegraded DNA, the probability of the data DAB (an A and a B peak detected in the sample) is P(DAB|Hp) = 1. On the other hand, if someone unrelated to the suspect is the contributor (Hd), then P(DAB|Hd) is the probability of inclusion PI = 1/50. Thus, the evidence — the A and B peaks — is 1/PI = 50 times more probable when the suspect is the contributor than when an unrelated person is. This ratio of the probabilities of the evidence conditional on the hypotheses is the likelihood ratio. It measures the support the evidence lends to Hp as opposed to Hd. LRs greater than 1 support Hp over Hd (e.g., Kaye et al. 2011).

Two Peaks, Two Inferred Genotypes with Probabilities for Each Genotype: A PGS?

This much is straightforward, conventional thinking. But an AB contributor is not the only conceivable explanation for the two peaks. Maybe they reflect DNA from an AA individual (one who inherited the fragment of length A from both parents), and the B is just an artifact known as “stutter” (Brooks et al. 2012). If this possibility cannot be dismissed as wildly improbable (as it could be if, for example, the putative stutter peak were far from the A peak), then the analysis should take into account both AA and AB as possible contributor profiles.

One way to do so would be to study the detection probability P(DAB) in experiments with samples from AA and AB contributors. Suppose that a large number of such experiments showed that when the contributor is AA, the probability of detecting AB is P(DAB|CAA) = 1/10 and that when the contributor is AB, the probability is P(DAB|CAB) = 1. Sometimes, AA contributors produce AB peaks; AB contributors always do.

In a case in which the suspect is type AB, what is the evidentiary value of the two peaks A and B? The suspect is still AB, so P(DAB|Hp) is unchanged at 1/50. But the denominator of the LR, P(DAB|Hd) requires us to consider the probability that the contributor’s profile is AA as well as the probability that it is AB. Imagine that the laboratory receives crime-scene samples with DNA profiles that are representative of a population in which 1 in 100 people are AA and (as stated before) 1 in 50 are AB. Because only 1 in 10 DNA samples from AA contributors will appear to be AB, about 1 in 1000 samples will have the AB peaks and come from AA contributors:

P(CAA & DAB) = P(CAA) ⋅ P(DAB|CAA) = (1/100) ⋅ (1/10) = 1/1000.

More samples, about 20 per 1000, will have the AB peaks and come from AB contributors:

P(CAB & DAB) = P(CAB) ⋅ P(DAB|CAB) = (1/50) ⋅ (1) = 20/1000.

Thus, in about 20 out of 21 detections of AB peaks, the contributor is AB. (Most readers who have borne with me this far will recognize this result as a simple application of Bayes' rule for the posterior probability: P(CAB|DAB) = 20/21.)

A PGS thus could assign probabilities of P(CAA|DAB) = 1/21) and P(CAB|DAB) = 20/21 for the two possible contributor genotypes. The hypothesis Hd is that either an unrelated person who is AA or, as before, that the peaks come from an unrelated AB contributor. If the suspect is not the source and if the apparent AB profile really is AA (which has probability 1/21), Hd requires that a random, unrelated person be type AA (an event that has probability P(RAA) = 1/100). Likewise, if the suspect is not the source and the apparent AB profile really is AA (which has probability 20/21), then Hd requires that a random, unrelated person be type AB (an event that has probability P(RAB) = 1/50). Consequently, the probability of the evidence DAB given Hd is

         P(DAB|Hd) = P(RAA) ⋅ P(CAA|DAB) + P(RAB) ⋅ P(CAB|DAB)
                    = (1/100) (1/21) + (1/50) (20/21) = 41/2100 = 0.0195,

which is very close to the previous denominator of 1/50 = 0.020. The resulting LR is 2100/41 = 51.2.

The Probability in PGS

This toy model of a PGS only used information about peak location and only mentioned a stutter peak as a source of uncertainty in the contributor's genotype. A more sophisticated PGS would use peak heights as well and would attend to allelle drop-in and drop-out, and other complicating features. The most complete models dispense with the rules of thumb (“analytical thresholds,” “stochastic thresholds,” and “peak-height ratios”) that human examiners employ to decide whether a peak is high enough to count as real, what to do with it in computing a likelihood ratio, and what potential genotypes to cross off the list of possibilities when confronted with a mixture of DNA from several contributors (Kelly et al. 2014).

I do not propose to explain these matters any better than SWGDAM has. My purpose here has been to clarify just what is “probabilistic” about a PGS. The key point is not that the system produces a likelihood ratio as opposed to a probability of exclusion of inclusion. Likelihood ratios also apply to categorical inferences as to what profiles are present in a mixed sample. A PGS is distinctive because it assigns probabilities to the possible profiles and uses more information to arrive at what, one hopes, is a better likelihood ratio for the hypotheses about whether a suspect is a contributor.

  • C. Brookes, J.A. Bright, S. Harbison, J. Buckleton, Characterising Stutter in Forensic STR Multiplexes, 6 Forensic Sci. Int’l: Genetics 58-63 (2012)
  • David H. Kaye et al., The New Wigmore on Evidence: Expert Evidence (2d ed. 2011)
  • Hannah Kelly, Jo-Anne Bright, John S. Buckleton, James M. Curran, A Comparison of Statistical Models for the Analysis of Complex Forensic DNA Profiles, 54 Sci. & Justice 66–70 (2014)

Thursday, October 22, 2015

SWGDAM Guidelines on "Probabilistic Genotyping Systems" (Part 1)

In June, the Scientific Working Group on DNA Analysis Methods (SWGDAM), approved new “Guidelines for the Validation of Probabilistic Genotyping Systems.” 1/ They begin,
Guidance is provided herein for the validation of probabilistic genotyping software used for the analysis of autosomal short tandem repeat (STR) typing results. These guidelines are not intended to be applied retroactively. It is anticipated that they will evolve with future developments in probabilistic genotyping systems.
These three sentences, raise four questions. First, is the phrase “probabilistic genotyping system” (PGS) the best label? I will get to the question of what “probabilistic” means a little later, but given the perception of segments of the public and the legal community that “autosomal short tandem repeat (STR) results” are “very likely” “to reveal predispositions to diseases in the individuals being profiled as well as their siblings and offspring,” 2/ is “genotyping” the right word to use for identifying DNA variations that are not genes? A more neutral term such as “probabilistic typing systems” might be less suggestive.

Second, why do the drafters of standards and guidelines prefer stilted writing—“guidance is provided herein”—as opposed to plain English sentences such as “This document offers guidance”? I know this kind of criticism is small potatoes, but scientists are smart enough to be good writers.

Third, what are the drafters trying to say with the doubly passively voiced sentence, “These guidelines are not intended to be applied retroactively”? Who should not apply these standards retroactively? One would think that the guidelines are for laboratories, but how could a laboratory apply a recommendation retroactively? It cannot go back in time to validate software that it has been using even though neither it nor the developer had validated the software in the manner that SWGDAM now recommends. The only thing the laboratory could do to give retroactive effect to the new advice would be to use some better validated software on data from old cases and advise prosecutors, defendants, or defense lawyers of major discrepancies. Is SWDAM saying that looking back at past cases (for research or other purposes) would be wrong? Or merely that SWGDAM is taking no position on the desirability of undertaking such retrospective analyses? Or is this part of the guidelines written for a difference audience—courts that might be asked to grant postconviction relief? But unless every PGS was adequately validated, surely courts should consider what these guidelines have to say as relevant to (but not necessarily dispositive of) whether the laboratory’s earlier report was scientifically acceptable. Most courts can be expected to appreciate the fallacy of the argument that "because the world gets wiser as it gets older, therefore it was foolish before." 3/

Fourth, why does SWGDAM anticipate that “future developments in probabilistic genotyping systems” will cause these standards to “evolve”? The principles of good software development and validation do not depend on the specific programs. Those principles may evolve whether or not PGSs improve over time. Of course, the guidelines could change if the programs become so superior that SWGDAM would reconsider its view (expressed in the next paragraph) that the only permissible use of a PGS is “to assist the DNA analyst in the interpretation of forensic DNA typing results.” Is SWGDAM envisioning that it could reverse its opinion that “Probabilistic genotyping is not intended to replace the human evaluation of the forensic DNA typing results” because of “future developments in [PGS]”? In light of current problems with human interpretations of mixtures of minute quantities, there are observers who would welcome replacing the current protocols for interpreting these samples with valid and reliable automated expert or probabilistic systems.


1. Scientific Working Group on DNA Analysis Methods, Guidelines for the Validation of Probabilistic Genotyping Systems, June 15, 2015

2. Gary R. Skusea1 & Anne M. Burgera, Justice as Fairness: Forensic Implications of DNA and Privacy, Champion, Apr. 2015, at 24. For a more authoritative assessment, see Henry T. Greely & David H. Kaye, A Brief of Genetics, Genomics and Forensic Science Researchers in Maryland v. King, 53 Jurimetrics J. 43 (2013).

3. Hart v. Lancashire &Yorkshire Ry. Co., 21 L.T.R. N.S. 261, 263 (1869).