Quick Data Mesh Overview
The shortest summary: treat data as a product, not a by-product. By driving data product thinking and applying domain driven design to data, you can unlock significant value from your data. Data needs to be owned by those who know it best.
A data mesh is a set of read-only products made of data that are general purpose relative to not being designed to answer a specific question or set of questions. For sharing the data on the outside for a domain in a way others can consume that data. But each data product is designed specifically for data non real-time consumption/analytics and owned by the domain team. And the overall mesh setup, whether at the data product or data platform level, works to make data products interoperable so you can combine data from multiple domains.
We wrote a short blog post on the topic as well here.
There is a consensus of almost every expert (and non-expert) in the data management space: data mesh is not for everyone. Data mesh overcomes challenges faced by large organizations and while parts of the concept might be useful for all companies, the costs of implementing will outweigh the benefits for many smaller companies.
The data mesh concept was created by Zhamak Dehghani when she was at ThoughtWorks (LinkedIn; Twitter). Data mesh is her solution to large customers spending ever increasing amounts on big data platforms (such as data lakes) but failing to see a return on the investment.
For her, centralized data infrastructure combined with centralized ownership of data by a data engineering team – a team that lacks true insight into the data – creates complexity without much value for most companies. The data quality is poor, there is lacking context, the data pipelines and ETL processes are brittle, agility around data is nearly non-existent, etc. Data producers (services) treat data as a by-product instead of a product.
To get the most value from your data, those who understand it best must own the data and compose it into data products for others to consume. And then those data product owners must be free, within some rational bounds, to store the data in places that make sense rather than putting all data into a data lake.
To derive value from your data, there are two things companies must have:
1) distributed data ownership
2) distributed data architecture / technology
The other pieces of the data mesh framework Zhamak talks about are included to solve the problems created by these two “must haves.”
Data mesh has 4 pillars:
1. Domain-oriented, decentralized data ownership and architecture. No more data ownership by a data warehouse/lake team; give ownership to those who know the data best (domain driven design but for data) and can package it best for others including with strong documentation, SLAs, freshness/quality information, etc. The data producers are NOT always the data product owners or those who know the domain best. Data pipelines must also be owned by the data product owners to ensure SLAs are met, changes don’t create cascading failures, etc.
Distribute the data architecture so there isn’t only one single place to store the data or only one way to store it. There are obviously rational bounds – no, 50 teams should not all invent new ways to store their data – but trying to force every one to put every bit of data into the data lake doesn’t work. At all.
2. Creating data products with true data owners. This one is often phrased think of your data as a product but not all data should be productized. Data product owners need to have SLAs/SLOs/SLIs for their data products, just like any other product owner would. There needs to be documentation and easy discovery/consumption as well.
3. A self-service data infrastructure platform for creating data products, serving data products to consumers, and consumption of data products – often referred to as the overall data platform consisting of one to many data “nodes” where each domain may choose to store data products.
4. Federated computational governance. Basically, keep decision making as locally as possible but have some centralized control so everyone doesn’t have to reinvent the data access security rules.
Zhamak has been, for good reason, reticent to make default architectural recommendations. A data mesh needs a strong technological basis but organizational buy-in and structure is far more important than any one technology or product. If you do not have true data product ownership, no piece of software will make your data mesh drive the value you expect/want from it.
If you’d like to read more, please see our recommended content here.