Some Earth Systems Cautions

Before diving into CMIP6 data, it helps to have a quick primer on a few caveats in the data related to the way the Earth itself works. Everything we’re about to say here was already said better in this Nature Climate Change article by Fiedler, et. al. so if you’re intending to make serious money decisions off the back of this data, the subscription is well worth the price.

First, this is all just a simulation

The most important caveat is that all this data is just a simulation. Actually, lots of simulations. Even the “historical” data is a simulation, not actual measurements. What climate researchers do is develop a model that tries to represent how various Earth systems work and then they run it over and over and over again, with slightly different initial conditions and random variations. Most of those variants produce close enough to the same thing. A few outliers produce weird results in weird places based on some random feedback loop or another. Which one of them is right? We have absolutely no way of knowing. So what do we do? We average them all and hope for the best.

If you want to consume this data in its most statistically-honest way, then you need to actually look at each of those variant runs and consider them all, collectively, as an “ensemble”, keeping fully in mind the variability and confidence intervals that reflect their real uncertainty.

The best use of this data is directional, indicitive trends, not ironclad decisions.

This data is best used as an informative, “directional” indicator, not an absolute number. As in, if the data says it’s going to be 67.432 degrees F in May, 2045 in the 100km square around your house, then you shouldn’t assume it’s actually, literally going to be 67.432F in May, 2045. What you should do is average all the months of May from 2040-2060 and compare them to, say, the average of all the months of May from 2000-2020 and then get a feel for what the difference will be. Maybe the 2040-2050 average is 64.322F and the 2000-2010 average is 67.6654F. What does that mean? It means it will be “noticably hotter”. It doesn’t mean it will be 3.3434F hotter, precisely.

Speaking of which, don't be fooled by the data's precision

This data is the direct output of a computer simulation. When a computer divides two numbers, it might get a result out to seven decimal places. Upon seeing a temperature preduction of “32.431434535 degrees F”, one might reasonably assume “wow, that’s a super precise number! They must be really sure about it!”

Well, they’re not.

That precision is an indication of how computers work, not an indication of how confident you should be in the numbers. If you’re finding yourself parsing differences of a few hunderths of a degree between two places, then you’re probably assuming this data has more precision than it really has and should rethink what you’re doing with it.

This data does not model extreme events well

There are a bunch of reasons why this data doesn’t model extreme weather events (like hurricanes, etc) well. This problem is a known and active topic of discussion in the scientific community.

For one, because we’re averaging across a bunch of runs of the same simulation, extreme events are going to wash out, even if the models modeled them well in the first place.

For another, extreme weather events actually develop at a scale much smaller than the 100km nominal scale that these simulations run under. As such, the models themselves don’t model them well, anyway.

For another, extreme weather events are just simply uncommon. Even if the models try to throw in random “once in a thousand years” floods, almost by definition they’re not always going to land in the 100 year timescale that the simulation models on every run of every variation. Some variations will show them, and some will not, even from the same model.

And, lastly, some extreme weather events are so unknowable they’re basically acts of God. For example, will there be a major volcanic eruption in the year 2035 that throws the climate out of whack for a decade? Who knows. It’s totally unknowable. But, if it does happen, its influence for the following decade will fully eclipse the influence of just about everything else.

So, in short, if your business use of this data depends most critically on knowing the probability of extreme weather events, this dataset might not be for you.

Don't assume anything at resolutions less than 100km

You might be tempted to look at the 100km square that contains your house and say “this is what will happen to my house!” But, that’s not an entirely appropriate way to think about this data. Maybe your house sits on the edge of a lake. Maybe elevation in that 100km square varies from sea level to 5,000ft, depending on exactly where you are. If you’re a resident of the city of San Fransico, for example, imagine how dramatically different weather is from one street to another, depending on if you’re inside or outside the fog belt. At resolutions below 100km, local microclimates have a huge influence on what actually happens in your daily life. And climate models do not account for microclimates at all.

But there is still an appropriate way to use this data to think about your house, as long as you combine it with other data. First, take what you know about the geography inside your house’s 100km square. Is it wildly different from one end to the next, or is it pretty much the same? Are there any features like mountains or lakes or oceans that will influence the weather? Are there known, local phenomena like SF’s fog belt to take into consideration? Now, with that local knowledge in mind, look at the climate data and use it as a directional indication, like we discussed earlier. If you live in a cold belt, it will probably still be cold, but will it be less cold? If you live in a rainy place, will it get generally rainier? Those are better uses of this data with regards to your actual house.

Near-term predictions, ironically, are less indicitive than long term ones

There’s a lot of variability in the weather and the climate. Ironically, climate simulations actually do their best job (the scientific community thinks, at least) beyond the year 2050. Why? Because between now and 2050, the influence of normal weather variability, CO2 emissions, and just basic randomness is all pretty much equal. It’s not until 2050 or so that the influence of CO2 starts to very clearly drown out everything else in a super clear, undeniable way.

So, if you’re using this data to make predictions less than a couple decades out, that’s best done while combining it with other weather data, not just by using the climate data alone.

Ok, enough for general caveats. Now for a few specific ones…

Near-surface temperature isn't soil temperature

Our temperature data is what’s specifically called the “near surface temperature” which means what the temperature would be at 2 meters above the Earth’s surface. This is not the same thing as soil temperature, so be aware. This is especially relevant if you’re trying to use this data to understand anything about how agriculture is going to work.

Some variables, especially humidity, are "diurnal"

Beware that humidity, especially, varies throughout the day, and in some regions can be significantly higher in the morning (think: fog) than in the afternoon. All of our numbers are the average across the day, so that’s easy to miss.