What is Data Mesh?

cover of O'Reilly Data Mesh Book

Why Does Data Mesh Exist?

Zhamak Dehghani, creator of the data mesh concept, worked as a consultant with many companies on the operational side of engineering for years. Data mesh is her solution to large customers spending ever increasing amounts on big data platforms (such as data lakes) but failing to see a return on the investment.

As Zhamak dug deeper, she found many issues that were common to almost all organizations – mainly that data quality was poor, time to ad hoc analysis of data was abysmal (at large companies, it was usually 6+ months), data engineers were badly overworked and owned management of all the data for the analytics plane in a large centralized data lake when those data engineers lacked the domain knowledge to make the data truly valuable.

Zhamak knew that the issues with scale had been mostly addressed if not totally solved on the operational side. So she began to apply the same approach from the operational side to the analytics side.

The first step was to change the approach to data to treat it like a product / create data products. Treating data like an asset means hoarding it and amassing it for no real reason. Treating it as a product means honing it and making it useful and easily consumed.

The second piece was to do away with the hyper-centralized data lake where everything was just shoved into the lake and there were no SLAs or quality and it often lost its context. This is often referred to as the distributed architecture piece of data mesh.

What is Data Mesh?

The shortest summary: treat data as a product, not a by-product. By driving data product thinking and applying domain driven design to data, you can unlock significant value from your data. Data needs to be owned by those who know it best.

Data mesh is a set of read-only products made of data that are general purpose relative to not being designed to answer a specific question or set of questions. For sharing the data on the outside for a domain in a way others can consume that data. But each data product is designed specifically for data non real-time consumption/analytics and owned by the domain team. And the overall mesh setup, whether at the data product or data platform level, works to make data products interoperable so you can combine data from multiple domains.

There is a consensus of almost every expert (and non-expert) in the data management space: data mesh is not for everyone. Data mesh overcomes challenges faced by large organizations and while parts of the concept might be useful for all companies, the costs of implementing will outweigh the benefits for many smaller companies.

Data mesh is her solution to large customers spending ever increasing amounts on big data platforms (such as data lakes) but failing to see a return on the investment.

Centralized data infrastructure combined with centralized ownership of data by a data engineering team – a team that lacks true insight into the data – creates complexity without much value for most companies. The data quality is poor, there is lacking context, the data pipelines and ETL processes are brittle, agility around data is nearly non-existent, etc. Data producers (services) treat data as a by-product instead of a product.

To get the most value from your data, those who understand it best must own the data and compose it into data products for others to consume. And then those data product owners must be free, within some rational bounds, to store the data in places that make sense rather than putting all data into a data lake.

To derive value from data,companies must have:

1) distributed data ownership
2) distributed data architecture / technology

The other pillars of the data mesh framework solve the problems created by these two “must haves.”

The Four Pillars of Data Mesh

Domain-Driven Data Ownership
Data as a Product
Self-Service Infrastructure Platform
Federated Computational Governance

No more data ownership by a data warehouse/lake team. Instead, give ownership to those who know the data best (domain driven design but for data) and can package it best for others including with strong documentation, SLAs, freshness/quality information, etc. The data producers are not always the data product owners or those who know the domain best. Data pipelines must also be owned by the data product owners to ensure SLAs are met, changes don’t create cascading failures, etc. Distribute the data architecture so there isn’t only one single place to store the data or only one way to store it.

Creating data products with true data owners. This one is often phrased think of your data as a product but not all data should be productized. Data product owners need to have SLAs/SLOs/SLIs for their data products, just like any other product owner would. There needs to be documentation and easy discovery/consumption as well.

A self-service data infrastructure platform for creating data products, serving data products to consumers, and consumption of data products – often referred to as the overall data platform consisting of one to many data “nodes” where each domain may choose to store data products.

Keep decision making as locally as possible but have some centralized control so everyone doesn’t have to reinvent the data access security rules.

 

If you’d like to read more, please visit the Data Mesh Resources page.