The top left panel has a lot of pdfs for the equilibrium climate sensitivity, generated by various authors, with their 90% probability ranges presented as bars on the right. In the bottom right panel, we have the range of sensitivity values from the CMIP3 ensemble (pale blue dots). Clearly, the spread of the latter is narrower than most/all of the former, which is the basis for the IPCC statement. There are numerous examples of similar statements in the literature, too (not exclusively restricted to climate sensitivity).

Many of the pdfs are based on some sort of Bayesian inversion of the warming trend over the 20th century (often, both surface air temp and ocean heat uptake data are used). This calculation requires a prior pdf for the sensitivity and perhaps other parameters. And herein lies the root of the problem.

Consider the following trivial example: We have an ensemble of models, each of which provides an output "X" that we are interested in. Let's assume that this set of values is well approximated by the standard Gaussian N(0,1). Now, let's also assume we have a single observation which takes the value 1.5, and which has an associated observational uncertainty of 2. The IPCC-approved method for evaluating the ensemble is to perform a Bayesian inversion on the observation, which in this trivial case will (assuming a uniform prior) result in the "observationally-constrained pdf" for X of N(1.5,2).

It seems that the model ensemble is a bit biased, and substantially too narrow, and therefore does not cover the "full range of uncertainty" according to the observation, right?

No, actually, this inference is dead wrong. Perhaps the most striking and immediate way to convince yourself of this is to note that if this method was valid, then it would not matter what value was observed - so long as it had an observational uncertainty of 2, we would automatically conclude that the ensemble was too narrow and (with some non-negligible probability) did not include the truth. Therefore, we could write down this conclusion without even bothering to make this inaccurate observation at all, just by threatening to do so. And what's worse, the more inaccurate the (hypothetical) observation is, the worse our ensemble will appear to be! I hope it is obvious to all that this state of affairs is nonsensical. An observation cannot cause us to reject the models more strongly as it gets more inaccurate - rather, the limiting case of a worthless observation tells us absolutely nothing at all.

That's all very well as a theoretical point, but it needs a practical application. So we also performed a similar sort of calculation for a more realistic scenario, more directly comparable to the IPCC situation. Using a simple energy balance model (actually the two-box model discussed by Isaac Held

here, which dates at least to Gregory if not before), we used surface air temperature rise and ocean heat uptake as constraints on sensitivity and the ocean heat uptake efficiency parameter. The following fig shows the results of this, along with an ensemble of models (blue dots) which are intended to roughly represent the CMIP3 ensemble (in that they have a similar range of equilibrium sensitivity, ocean heat uptake efficiency, and transient climate sensitivity).

The qualitative similarity of this figure to several outputs of the Forest, Sokolov et al group is not entirely coincidental, and it should be clear that if we integrate out the ocean heat uptake efficiency, the marginal distributions for sensitivity (of the Bayesian estimate, and "CMIP3" ensemble) will be qualitatively similar to those in the IPCC figure, with the Bayesian pdf of course having a greater spread than the "CMIP3" proxy ensemble. Just as in the trivial Gaussian case above, we can check that this will remain true irrespective of the actual value of the observations made. Thus, we have another case where it may seem intuitively reasonable to state that the ensemble "may not represent the full range of uncertainty", but in fact it is clear that this conclusion could, if valid, be stated without the need to trouble ourselves by actually making any observations. Therefore, it can hardly be claimed that this result was due to the observations actually indicating any problem with the ensemble.

So let's have another look at what is going on.

The belief that the posterior pdf correctly represents the researchers' views, depends on the prior also correctly representing their prior views. But in this case, the low confidence in the models is imposed at the outset, and is not something generated by the observations. In the trivial Gaussian case, the models represent the prior belief that X should (with 90% probability) lie in [-1.64,1.64], but a uniform prior on [-10,10] only assigns 16% probability to this range. The posterior probability of this range, once we update with the observation 1.5±2, has actually tripled to 47%. Similarly, in the energy balance example, the prior we used only assigns 28% probability to the 90% spread of the models, and this probability doubles to 56% in the posterior. So the correct interpretation of the results is not that the observations have shown up any limitation in the model ensemble, but rather, that if one starts out with a strong prior presumption that the models are unlikely to be right, then although the observations actually substantially increase our faith in the models, they are not sufficient to persuade us to be highly confident in them.

Fortunately, there is an alternative way of looking at things, which is to see how well the ensemble (or more generally, probabilistic prediction) actually predicted the observation. This is not new, of course - quite the reverse, it is surely how most people have always evaluated predictions. There is a minor detail which is important to be aware of, which is that if the observation is inaccurate, then we must generate a prediction of the observation, rather than the truth, in order for the evaluation to be fair. (Without this detail, a mismatch between prediction and observation may be due to observational error, and it would be incorrect to interpret this as a predictive failure). One important benefit of this "forward" procedure is that it takes place entirely in observation-space, so we don't need to presume any direct correspondence between the internal parameters of the model, and the real world. It also eliminates the need to perform any difficult inversions of observational procedures.

For the trivial numerical example, the predictive distribution for the observation is given by N(0,2.2) (with 2.2 being sqrt(1^2+2^2), since the predictive and observational uncertainties are independent and add in quadrature). That is the solid blue curve in the following figure:

The observed value of 1.5 obviously lies well inside the predictive interval. Therefore, it is hard to see how this observation can logically be interpreted as reducing our confidence in the models. We can also perform a Bayesian calculation, starting with a prior that is based on the ensemble, and updating with the observation. In this case, the posterior (magenta dotted curve above) is N(0.3,0.9) and this assigns a slightly increased probability of 92% to the prior 90% probability range of [-1.64,1.64]. Thus, the analysis shows that if we started out believing the models, the observation would slightly enhance our confidence in them.

For the more realistic climate example, the comparison is performed between the actual air temperature trends of the models, and their ocean heat gains. The red dot in the below is the pair of observed values:

This shows good agreement for the energy balance models (blue dots - the solid contours are the predictive distribution accounting for observational uncertainty), and also for the real CMIP3 models (purple crosses), so again the only conclusion we can reasonably draw from these comparisons is that these observations fail to show any weakness in the models.

The take-home point is that observations can only conflict with a probabilistic prediction (such as that arising from the simple "democratic" interpretation of the IPCC ensemble) through being

both outside (in the extreme tail of) the model range, and

also precise, such that they constrain the truth to lie outside the predictive range. While this may seem like a rather trivial point, I think it's an important one to present, in view of how the erroneous but intuitive interpretation of these Bayesian inversions has come to dominate the consensus viewpoint. It was a pleasant surprise (especially after

this saga) that it sped through the review process with rather encouraging comments.