**Santer, B. D.,**K. Taylor and L. Corsetti, 1995: Statistical evaluation of AMIP model performance. Abstracts of the First International AMIP Scientific Conference, Monterey, California, 11.

This investigation explores the usefulness of different statistical measures of model performance in the context of the AMIP experiment. We employ a suite of univariate and multivariate statistics to compare monthly mean output from the AMIP simulations with observed data. The statistics provide information on model errors in simulation of the mean state, temporal and spatial variability, the time-mean spatial pattern, and the time evolution of spatial patterns. Model versus observed comparisons are shown for SLP, total cloud cover, and surface air temperature, using results from 25-30 AMIP models. We consider the practical significance of model versus observed differences, as opposed to the purely statistical significance of these differences. Formal statistical significance is often difficult to judge: its determination depends on basic assumptions regarding the data being tested (e.g., its spatial and temporal autocorrelation structure and equality of variances). These assumptions are frequently violated. In assessing practical significance, however, one is not directly concerned with these issues, and instead considers whether model versus observed differences are large enough to indicate real model deficiencies. Here we use two "yardsticks" to evaluate whether model errors are of practical importance:

The differences between two independently-derived observational data sets. This is a measure of observational uncertainty.

The differences between individual pairs of "initial condition realizations" of the AMIP experiment. This measures uncertainties due to inherently unpredictable atmospheric variability. For both yardsticks, the degree of separation between different observed data sets or initial condition realizations is determined by computing the same statistics used in model-data intercomparisons. The key findings of our study are as follow.:

Model errors are complex, and better-than-average performance in simulating the mean state does not necessarily translate to better-than-average performance in simulating the time-mean spatial field or temporal variability. This emphasizes the point that there is no "universal" statistic that quantifies all aspects of model errors. Validation studies should therefore use a suite of statistics to characterize these errors.

For the three variables considered (SLP, total cloud cover, and surface air temperature), the model versus observed differences, as characterized by a suite of univariate and multivariate statistics, were generally much larger than the differences between two independent data sets or between different initial-condition realizations. The implication is that for these three fields, model errors in the AGCMs participating in AMIP are currently larger than our observational uncertainties and larger than the differences that we would expect due to inherently unpredictable atmospheric variability.

The dimensionless statistics applied allow one to compare the fidelity
with which the AMIP models simulate different climate variables. For the
three variables examined here, the smallest errors in the simulation of
the climatological annual mean state are for surface air temperature and
the largest errors are for cloud cover.