Data Needs Continuum

Two months ago, I embarked on a journey to make climate data more accessible for non-climate-scientists. Step one of that journey is trying to find a good, productive place in the business and academic landscape for us as a company to live. Specifically, this means pondering the question: "who needs this, anyway, and how do we best serve them?"

So, in thinking about our place as a company, it occurred to me there's actually a hidden structure in the way that professionals use all kinds of data, not just climate data. There's a continuum of uses for data in any given field (economics, sociology, climate, medicine, chemistry, etc) and users in each section of that continuum have different, even contradictory, needs from their data and their tools. Let's say that continuum looks like this, with basically six sections:

data needs continuum

Let's define each section on that continuum like so:

  • Direct Science:
    The edge of human knowledge in a particular field. Where physicists do physics and climate scientists study climate and biologists sequence DNA.
  • Adjacent Science:
    Scientists who use data from other fields to inform research in theirs. Where economists use psychology to inform studies of purchasing behavior, or political scientists use climate projections to study the influence of drought on political stability.
  • Sophisticated Almost-Science:
    Real science with hypothesis testing, but in a mostly-commercial setting for internal use without peer review. Where physicists extract market signals from the noise for Wall Street firms, or climate consultants do yield projections for Big Ag companies. This is the last section you'll likely find PhDs in.
  • Advanced Business Intelligence:
    Mining largely-internal data for commercial ends. Where data-savvy technologists find hidden profit drivers, operational risks, and money sinks. This is the last section you'll find fluent statisticians and coders in.
  • Basic Business Intelligence:
    Creating interactive reports and dashboards for team leaders across the business to monitor their own KPIs. Where SQL-fluent analysts create cashflow reports and conversion funnel reports. The last section you'll find people who know SQL in.
  • Mass Storytelling:
    Highly-specific, lightly-interactive tools and widgets for the mass-market public to use. Where journalists and TED-talkers create their graphs to change the world.

If you're a data tools vendor, you can really only make users wildy happy in one section of this continuum at a time. If you're lucky, maybe you can make one other, adjacent section reasonably happy, too. But, if you try to make more than 1-2 sections happy at the same time, you'll end up mangling the tradeoffs that your customers care about and creating a least-common-denominator product that basically nobody is wildy happy with.

I've become obsessed with finding the right section for Pollen to focus on.

Tradeoff Triangle

Satisfying every part of the data continuum at the same time is difficult because there's a hidden tradeoff triangle that users in each section value differently. Imagine that tradeoff triangle is like below, with money, time, and flexibility all being in opposite corners from each other:

data tradeoff triangle

You can optimize for time at the expense of flexibility like, for example, having a minimal number of configuration options and then letting the tool guess the defauts on all the rest. Or you can optimize for cost at the expense of time by making the user cobble together a bunch of different open source tools and configure them themselves. But, generally-speaking, it's infeasible to create a tool that simultaneously is super-cheap, super-flexible, and super-quick to setup and use.

As you can see in the illustration, different sections pick different tradeoffs in this triangle. Users in the Direct Science section are, by definition, doing things no human has ever done before. So, they need flexibility above all else. Sure, they'd also like to save cost and time, but they don't ever want to do that at the cost of flexibility. At the other end of the continuum, Mass Storytelling users have a very specific outcome in mind, and are probably only going to use this tool once and then never again. So they want it to be cheap (preferably free) and, a close second, very easy to use.

Other Important Concerns

Each section of the continuum also has unique needs that drive their tool choices.

For example, both of the first two science sections need to survive the brutal process of peer review. This means, among other things, that users need to be borderline obsessive about properly citing the data provenance and exact meaning of all the data used. And they need to be able to provide radical transparency into the methods they used to transform and analyze the data. Everything has to be repeatable.

As you start to pass into the Sophisticated Almost-Science and the Advanced Business Intelligence sections, ongoing operational concerns start to take hold, like "will my data show up cleanly, every day, in the same place at the same time?" Security and economy of scale concerns also start to raise their heads. That peer-review thing, though? No longer an issue.

Finally, as you pass into the Basic Business Intelligence and the Mass Storytelling sections, usability concerns take a front seat, like "would a casual user even know what they're looking at here, without any training?" and "will this graph be so ugly it distracts executives from the underlying point in the data?"

These other important concerns will sometimes entirely drive a users's choice of tools. They're often much more conciously aware of these concerns than they are of their own position in the tradeoff triangle I mentioned before. For example, if your tool makes citation and tracing data provenance difficult, then scientists won't use it. If your tool requires tweaking XML configuration files of any kind, then mass storytellers won't use it.

Tools Tastes

Not surprisingly, each section has different tastes in tools and technologies. Those tastes are driven by their position in the tradeoff triangle, by the other concerns they have to fulfill, and by their relative comfort both with coding and with hardcore math.

Both the science sections have a lot of freedom in their choice of tools and also a lot of comfort with math. Most likely, each user there will choose whatever tool they learned to use in graduate school: FORTRAN, MATLAB, SAS or SPSS if they're on the older side, and Python, R, and Jupyter if they're on the younger side. What's most important, at the end of the day, is that their tools speak "math" and that they have enough horsepower to get the computations done.

Both the Sophisticated Almost-Science and the Advanced Business Intelligence sections need the flexibility to code when they have to, but each user's personal preference starts to have to bend to the operational, security, or economy of scale needs of their company at large. Python, R, the cloud, and the data science libraries around all of the above rule the day.

Getting more towards the last two sections, SQL and drag-and-drop tools like Tableau and Looker rule the day. And attractive design becomes much more important. Tools that can produce dashboards with simple dropdowns for parameters will enjoy a lot of success.

How This Relates to Climate, Specifically

To make all this concrete, let's compliment all of the above with some climate-specific examples.

Starting at the leftmost end, in the Direct Science section, climate scientists are simulating the climate, asking interesting questions based on the results (like "does the change in reflectivity of the Greenland ice sheet as it melts have an appreciable effect on global average temperature in the year 2100?"), and then simulating the climate again, but with different assumptions.

Sitting right next to those climate scientists in the Adjacent Science section, economists, sociologists, epidemiologists and political scientists are taking the results of those climate simulations and then running analyses of their own to see how economies, societies and political systems have responded over time to changes in rainfall or temperature or air pollution.

In these first two science-based sections, most users are currently very well served by a Python-and-Jupyter-oriented open source project called Pangeo and by an academically-oriented indexing and hosting service called the Earth System Grid Federation (ESGF).

Moving further to the right, we have the Sophisticated Almost-Science section, where you get third parties like ClimateAi helping companies predict climate change's impact on their supply chains and Blackrock helping investors predict climate change's impact on their portfolios. These organizations are staffed by data scientists and PhDs and grad students who are applying rigor to their work, even if it's not being published. Generally, these places are probably rolling their own solutions, which starts with wrestling with ESGF (or, if they've heard of it, Google Earth Engine). Really, though, they'd prefer that someone else did this boring data engineering work for them so they can instead spend more time in their area of expertise, where they add the most value. Saving a few bucks and having a few more customizable knobs isn't nearly as important as getting to market early and accurately. I consider this to be a major opportunity.

Move slightly more to the right, to the Advanced Business Intelligence section, you have teams of not-PhDs trying to do the same thing the PhDs are doing in the Sophisticated Almost-Science section. Smaller companies or scrappier startups are struggling (and probably failing) to answer the same questions: "what will happen to my fields" or "what will happen to my supply chains" or "what will happen to my portfolio". Or, they are paying consultants a lot of money to answer questions they suspect they could answer for themselves, if only they had a head start. As you might suspect, I consider this to be a major opportunity, too.

In the last two sections, you have people at the New York Times who want to make a home page widget where you can enter your zipcode and find out what your increased wildfire risk is over 30 years. Or people at climate advocacy organizations who want to create a well-researched mailer describing how much hotter and drier each state will get (especially those swing states!). Or small water district managers who just want to know what's going to happen to projected rainfall over the next decade. For these people, a quick Tableau dashboard, or a pretty D3 widget on a web page will do just fine. They don't need nor want to know what Jupyter is or how to work with Pandas. If you're lucky, they know SQL and can hammer together relatively formulatic javascript. If you're unlucky, they only know how to drag and drop. This market, once the data is made available in simple, tabular formats, is already very well served by generic tools like Tableau, Looker and even Excel.

Where Does Pollen Fit?

Which section is Pollen.io going to focus on? I'm still working that out by talking to potential customers. But, stay tuned. You can probably guess what general direction we're heading. Subscribe to our mailing list below to stay up to date.