Efforts to Improve the Accuracy of Our Judgments and Forecasts

Targa heategija lühikursuse jaoks välja võetud tekstilõik.

Our grantmaking decisions rely crucially on our uncertain, subjective judgments — about the quality of some body of evidence, about the capabilities of our grantees, about what will happen if we make a certain grant, about what will happen if we don’t make that grant, and so on.

In some cases, we need to make judgments about relatively tangible outcomes in the relatively near future, as when we have supported campaigning work for criminal justice reform. In others, our work relies on speculative forecasts about the much longer term, as for example with potential risks from advanced artificial intelligence. We often try to quantify our judgments in the form of probabilities — for example, the former link estimates a 20% chance of success for a particular campaign, while the latter estimates a 10% chance that a particular sort of technology will be developed in the next 20 years.

We think it’s important to improve the accuracy of our judgments and forecasts if we can. I’ve been working on a project to explore whether there is good research on the general question of how to make good and accurate forecasts, and/or specialists in this topic who might help us do so. Some preliminary thoughts follow.

I first discuss the last of these points (credence calibration training), since I think it is a good introduction to the kinds of tangible things one can do to improve forecasting ability.

Calibration training

An important component of accuracy is called “calibration.” If you are “well-calibrated,” what that means is that statements (including predictions) you make with 30% confidence are true about 30% of the time, statements you make with 70% confidence are true about 70% of the time, and so on.

Without training, most people are not well-calibrated, but instead overconfident. Statements they make with 90% confidence might be true only 70% of the time, and statements they make with 75% confidence might be true only 60% of the time.2 But it is possible to “practice” calibration by assigning probabilities to factual statements, then checking whether the statements are true, and tracking one’s performance over time. In a few hours, one can practice on hundreds of questions and discover patterns like “When I’m 80% confident, I’m right only 65% of the time; maybe I should adjust so that I report 65% for the level of internally-experienced confidence I previously associated with 80%.”

I recently attended a calibration training webinar run by Hubbard Decision Research, which was essentially an abbreviated version of the classic calibration training exercise described in Lichtenstein & Fischhoff (1980). It was also attended by two participants from other organizations, who did not seem to be familiar with the idea of calibration and, as expected, were grossly overconfident on the first set of questions.3 But, as the training continued, their scores on the question sets began to improve until, on the final question set, they both achieved perfect calibration.


For me, this was somewhat inspiring to watch. It isn’t often the case that a cognitive skill as useful and domain-general as probability calibration can be trained, with such objectively-measured dramatic improvements, in so short a time.

The research I’ve reviewed broadly supports this impression. For example:

  • Rieber (2004) lists “training for calibration feedback” as his first recommendation for improving calibration, and summarizes a number of studies indicating both short- and long-term improvements on calibration.4 In particular, decades ago, Royal Dutch Shell began to provide calibration for their geologists, who are now (reportedly) quite well-calibrated when forecasting which sites will produce oil.5
  • Since 2001, Hubbard Decision Research trained over 1,000 people across a variety of industries. Analyzing the data from these participants, Doug Hubbard reports that 80% of people achieve perfect calibration (on trivia questions) after just a few hours of training. He also claims that, according to his data and at least one controlled (but not randomized) trial, this training predicts subsequent real-world forecasting success.6

I should note that calibration isn’t sufficient by itself for good forecasting. For example, you can be well-calibrated on a set of true/false statements, for which about half the statements happen to be true, simply by responding “True, with 50% confidence” to every statement. This performance would be well-calibrated but not very informative. Ideally, an expert would assign high confidence to statements that are likely to be true, and low confidence to statements that are unlikely to be true. An expert that can do so is not just well-calibrated, but also exhibits good “resolution” (sometimes called “discrimination”). If we combine calibration and resolution, we arrive at a measure of accuracy called a “proper scoring rule.”7 The calibration trainings described above sometimes involve proper scoring rules, and likely train people to be well-calibrated while exhibiting at least some resolution, though the main benefit they seem to have (based on the research and my observations) pertains to calibration specifically.

The primary source of my earlier training in calibration was a game intended to automate the process. The Open Philanthropy Project is now working with developers to create a more extensive calibration training game for training our staff; we will also make the game available publicly.

Further advice for improving judgment accuracy

Below I list some common advice for improving judgment and forecasting accuracy (in the absence of strong causal models or much statistical data) that has at least some support in the academic literature, and which I find intuitively likely to be helpful.8

  1. Train probabilistic reasoning: In one especially compelling study (Chang et al. 2016), a single hour of training in probabilistic reasoning noticeably improved forecasting accuracy.9 Similar training has improved judgmental accuracy in some earlier studies,10 and is sometimes included in calibration training.11
  2. Incentivize accuracy: In many domains, incentives for accuracy are overwhelmed by stronger incentives for other things, such as incentives for appearing confident, being entertaining, or signaling group loyalty. Some studies suggest that accuracy can be improved merely by providing sufficiently strong incentives for accuracy such as money or the approval of peers.12
  3. Think of alternatives: Some studies suggest that judgmental accuracy can be improved by prompting subjects to consider alternate hypotheses.13
  4. Decompose the problem: Another common recommendation is to break each problem into easier-to-estimate sub-problems.14
  5. Combine multiple judgments: Often, a weighted (and sometimes “extremized”15) combination of multiple subjects’ judgments outperforms the judgments of any one person.16
  6. Correlates of judgmental accuracy: According to some of the most compelling studies on forecasting accuracy I’ve seen,17 correlates of good forecasting ability include “thinking like a fox” (i.e. eschewing grand theories for attention to lots of messy details), strong domain knowledge, general cognitive ability, and high scores on “need for cognition,” “actively open-minded thinking,” and “cognitive reflection” scales.
  7. Prediction markets: I’ve seen it argued, and find it intuitive, that an organization might improve forecasting accuracy by using prediction markets. I haven’t studied the performance of prediction markets yet.
  8. Learn a lot about the phenomena you want to forecast: This one probably sounds obvious, but I think it’s important to flag, to avoid leaving the impression that forecasting ability is more cross-domain/generalizable than it is. Several studies suggest that accuracy can be boosted by having (or acquiring) domain expertise. A commonly-held hypothesis, which I find intuitively plausible, is that calibration training is especially helpful for improving calibration, and that domain expertise is helpful for improving resolution.18

Another interesting takeaway from the forecasting literature is the degree to which - and consistency with which - some experts exhibit better accuracy than others. For example, tournament-level bridge players tend to show reliably good accuracy, whereas TV pundits, political scientists, and professional futurists seem not to.19 A famous recent result in comparative real-world accuracy comes from a series of IARPA forecasting tournaments, in which ordinary people competed with each other and with professional intelligence analysts (who also had access to expensively-collected classified information) to forecast geopolitical events. As reported in Tetlock & Gardner’s Superforecasting, forecasts made by combining (in a certain way) the forecasts of the best-performing ordinary people were (repeatedly) more accurate than those of the trained intelligence analysts.