We wrote a short blog post (which links back to this page) as an intro to data mesh here.
Before jumping in, understanding why data mesh is even a concept is crucial. Without that context, it is very difficult to understand all the other pieces of data mesh. Also, data mesh is not for everyone. It is not designed for companies that do not need it. There is a scoring system in Barr Moses’ intro to the data mesh concept here.
Why Does Data Mesh Exist
Zhamak covers this much better than we can in some of her videos. Her talks at Datanova (talk 1 | talk 2) cover this incredibly well. A short version is below.
Zhamak Dehghani – the creator of the data mesh concept – worked as a consultant with many companies on the operational side of engineering for years. She started to see a pattern where companies were investing more and more money into their data lakes but failing to see returns. Data swamp comes to mind.
So she dug deeper and found many issues that were common to almost all organizations – mainly that data quality was poor, time to ad hoc analysis of data was abysmal if even possible – at large companies, it was usually 6+ months-, data engineers were badly overworked and owned all the data for the analytics plane in a large centralized data lake when those data engineers lacked the domain knowledge to make the data truly valuable, etc.
Zhamak knew that the issues with scale had been mostly…at least addressed if not totally solved on the operational side. So she decided it was well past time to apply the same approach from the operational side to the analytics plane.
The first step was to change the approach to data to treat it like a product / create data products. Treating data like an asset means hoarding it and amassing it for no real reason. Treating it as a product means honing it and making it useful and easily consumed.
The second piece was to do away with the hyper-centralized data lake where everything was just shoved into the lake and there were no SLAs or quality and it often lost its context. This is often referred to as the distributed architecture piece of data mesh.
So, that should help you understand the genesis of data mesh. Without that context, data mesh can seem like a solution in search of a problem because the problem is as vast as “the way we’ve handled data for analysis internally at companies is completely bonkers and needs a wholesale change”. Yes, it’s that big
Quick Data Mesh Overview
The shortest summary: treat data as a product, not a by-product. By driving data product thinking and applying domain driven design to data, you can unlock significant value from your data. Data needs to be owned by those who know it best.
A data mesh is a set of read-only products made of data that are general purpose relative to not being designed to answer a specific question or set of questions. For sharing the data on the outside for a domain in a way others can consume that data. But each data product is designed specifically for data non real-time consumption/analytics and owned by the domain team. And the overall mesh setup, whether at the data product or data platform level, works to make data products interoperable so you can combine data from multiple domains.
We wrote a short blog post on the topic as well here.
There is a consensus of almost every expert (and non-expert) in the data management space: data mesh is not for everyone. Data mesh overcomes challenges faced by large organizations and while parts of the concept might be useful for all companies, the costs of implementing will outweigh the benefits for many smaller companies.
The data mesh concept was created by Zhamak Dehghani when she was at ThoughtWorks (LinkedIn; Twitter). Data mesh is her solution to large customers spending ever increasing amounts on big data platforms (such as data lakes) but failing to see a return on the investment.
For her, centralized data infrastructure combined with centralized ownership of data by a data engineering team – a team that lacks true insight into the data – creates complexity without much value for most companies. The data quality is poor, there is lacking context, the data pipelines and ETL processes are brittle, agility around data is nearly non-existent, etc. Data producers (services) treat data as a by-product instead of a product.
To get the most value from your data, those who understand it best must own the data and compose it into data products for others to consume. And then those data product owners must be free, within some rational bounds, to store the data in places that make sense rather than putting all data into a data lake.
To derive value from your data, there are two things companies must have:
1) distributed data ownership
2) distributed data architecture / technology
The other pieces of the data mesh framework Zhamak talks about are included to solve the problems created by these two “must haves.”
Data mesh has 4 pillars:
1. Domain-oriented, decentralized data ownership and architecture. No more data ownership by a data warehouse/lake team; give ownership to those who know the data best (domain driven design but for data) and can package it best for others including with strong documentation, SLAs, freshness/quality information, etc. The data producers are NOT always the data product owners or those who know the domain best. Data pipelines must also be owned by the data product owners to ensure SLAs are met, changes don’t create cascading failures, etc.
Distribute the data architecture so there isn’t only one single place to store the data or only one way to store it. There are obviously rational bounds – no, 50 teams should not all invent new ways to store their data – but trying to force every one to put every bit of data into the data lake doesn’t work. At all.
2. Creating data products with true data owners. This one is often phrased think of your data as a product but not all data should be productized. Data product owners need to have SLAs/SLOs/SLIs for their data products, just like any other product owner would. There needs to be documentation and easy discovery/consumption as well.
3. A self-service data infrastructure platform for creating data products, serving data products to consumers, and consumption of data products – often referred to as the overall data platform consisting of one to many data “nodes” where each domain may choose to store data products.
4. Federated computational governance. Basically, keep decision making as locally as possible but have some centralized control so everyone doesn’t have to reinvent the data access security rules.
Zhamak has been, for good reason, reticent to make default architectural recommendations. A data mesh needs a strong technological basis but organizational buy-in and structure is far more important than any one technology or product. If you do not have true data product ownership, no piece of software will make your data mesh drive the value you expect/want from it.
If you’d like to read more, please visit the Data Mesh Resources page.