DML Blog: Data quality - prevention is better than the cure

Jul 16, 2024 by Jean-Georges Perrin

Many of us spend a lot of our time dealing with poor quality data, and any solution we put in place tends to be reactive or yet another workaround implemented in an increasingly complex data pipeline. But this is proving ineffective and expensive, and preventing organizations from realizing the value of their data, when that’s becoming more important than ever.

What if we could prevent data quality issues, at source, where it is cheapest and most effective?

We can, and in this talk you’ll learn how. You’ll leave this talk with knowledge of how to use data contracts as your first step towards data mesh.

Speakers:

Andrew Jones – Principal (Data) Engineer, GoCardless

Watch the Replay

Read the Transcript

Speaker 1: 00:01 There it goes. Alright.

00:12 Alright, well thanks everybody for joining us. My name is Paul, I’m the head of community here at Data Mesh Learning. Today we have exciting talk with Andrew Jones, principal data engineer of GoCardless. And his talk is going to be data quality prevention is better than the Cure. So yeah, before we get into the talk, I just have some community updates that I’ll go over. So coming up next week on Tuesday we’ll have the monthly data Mesh Roundtable as usual. And that’s going to be with Andrew Sharp, Amy Regata, Karen Hawkinson and Juan. And then we also have a guest, Ole Oli Nu Chief of Zia. So yeah, the topic is data catalogs so we hope you can join us for that. That’s a interactive session. So Ole we will talk for a little bit but then we have go over questions and people in the audience. It’s a discussion so yeah, hope you can make it for that.

01:09 We also have Mar Diaz product manager for data platform at Vinta and that’s on June 26th and that’ll be at 9:00 AM San Francisco time or 1800 CET time and that’ll be all about the data mesh journey and adding to Spain. So yeah, we’re excited to have Marge to come talk with us later in the month. And then today if you’re in Belgium, you have a new Belgium and I believe this is happening right now or in a little bit, you can join Tom de Wolf, Steve Lowe and Jean George Peren at a meetup at Data Mesh Belgium meetup and you scan the QR code, the link there. So maybe you can catch the end of it If you’re in Belgium right now, you should stay through the talk and then you can go to the Belgium meetup. Of course we post all of our past events on YouTube or we’re streaming live right now so you can check that out if you miss any of our events, they’re always available there to check out later.

02:12 If you’re not already a part of the DML Slack group, you should do that. You can scan the QR code, you can interact with your peers, ask questions, data mesh questions or you can help answer other people’s questions. It’s just a place for people to interact with each other and talk about data mesh. So you should scan the QR code, join the Slack group you haven’t already. And then of course as always our data mesh learning website, we have a ton of resources, case studies, blog posts and we list all of our events there. So you can check that out. Okay, I’m going to stop sharing before I make any introductions. If you have questions feel free to put them in the chat. We will have Andrew answer those questions at the end of the meeting. So yeah, without further ado, I’ll hand it over to Andrew to take it away.

Speaker 2: 03:08 Great, thanks Paul. Hi everyone. So I’m going to talk a bit about data quality and why prevention is better on the cure. So think about data quality. If you’re joining this livestream you’ll think about data quality a lot. You see the impact and the cost of poor data quality in day-to-day work and you understand the need to improve debt quality but many organizations are not able to improve the quality of data. It’s not really but only able to work around the poor quality data about doing so at great expense and with poor outcomes. So say I want to show you how we can prevent data quality issues at the source because when it comes to data quality prevention is better than cure. So why is data quality important? Well I said if you’re joining this bid presentation, you probably understand why you see it all the time.

04:04 This is backed up by a survey recently from DBT who found that 7% of positioners highlight data quality as one of their chief obvious calls in preventing in preparing data analysis. Actually that’s up from 41% from the last time they run the survey. So it’s getting worse. So we all understand why their quality is an issue, why it’s getting our way, why instance delivering value. But on the other hand we find that business leaders also aren’t getting what they want from their data. So it’s another survey this time from Nash grid and it found that 64% organizations think that big data analytics are a way to deliver competitive advantage, yet only one in five are using it to deliver increased revenue. And this was a survey with business leaders and I think what’s interesting about this is it shows this massive gap between the aspirations of business leaders and what they want to get from the data and reality they’re seeing when they’re seeing what is being delivered through the data and from the data teams.

05:16 I think a lot of this is really down to how we deal with data and the quality of our data because we know that working with poor data quality is difficult, it’s time consuming and it’s expensive and it gets more expensive for later you leave it. So Gartner have a start as always and they found that poor debt quality across organizations an average of 12.9 million a year. I dunno how true that is, but I think having a way to demonstrate how expensive it is, it’s important. But if we can do that then we can get resources we need and more importantly the alignment we need to improve the quality of data at source so we as data practitioners can deliver greater value for the organization. And one way I found quite reason to help articulate the cost of poor debt quality is just one 1,000 rule of debt quality which looks a bit like this.

06:25 We’ll go through it all in a bit more detail in a minute, but what it’s saying is that the earlier you manage something but cheaper it is and that as a rule that feels almost universal. The obvious one being healthcare. So think about healthcare, it’s much better for you and also cheaper for the health provider if people don’t get sick in the first place. So the better lot in preventing illnesses, the next step would be when you do get sick, get treated quickly, get treated locally again, better alcohol for you cheaper with a healthcare provider, if it goes beyond that then the costs go up and up and up. So with healthcare we think about prevention a lot. Maybe an example a bit closer to us would be software engineering and about software engineering we spend a lot of time trying to prevent bad quality code from maintenance way into production.

07:25 So we do a lot of things like unit testing, we do CI checks, we do code reviews, you might do pair programming, all these things slowed down with the of code, but we do them because we understand that cost of delivering poor quality code into production and causing instance is high. So we invest in prevention, obviously we don’t prevent every bug making way into production so we still do mediation and we still invest a lot there and software engineering that’s around ability. Have a good instant management process, write it back quickly, feature flagging, all those things help with mediation.

08:08 Fire in this case would be just to come writing code in production on a server which obviously no one’s doing unless you’re like student. But again with software engineering we invest a lot in prevention with data quality. We don’t tend to invest that much in prevention and that’s what I want to talk about our next few slides is how we can start thinking a bit more about prevention when it comes to state quality. So let’s go through this. It’s one rule, one 1,000 rule a bit more in detail with debt quality. It comes to debt quality. What the dollar tickets mean is the cost per record. So if you fail to do anything about debt quality, look at a hundred per record. Mediation is 10, record prevention is one doctor record. So let’s get through the pyramid now we’ll go bottom up and fire up to prevention.

09:04 So look at a failure which is a hundred per record. This is where you have data but it’s not accessible, it’s not usable, you do nothing to try and organize it or to try and improve quality of it and then maybe you give up. So then the a hundred dollars includes the opportunity cost of not being able to deliver what you need to deliver for the organization and then maybe then your organization can’t meet it strategic goals and then your company comes less competitive over time and hundred dollars in include all of that as well. So not a great place to be. Again, I think if you’re joining this call you’re probably doing something about data quality. So I’d be surprised if anyone’s in failure. I think most of us will be thinking about remediation and this is where I spending a lot of our focus at the moment as an industry.

09:55 And this is things like durability, things like using something like soda orations to bank checks, send alerts when debt quality issues are found. All of those kinds of things under remediation. Our this is good, it’s good that you notice it helps you notice issues quicker, maybe helps you recover quicker from instance. So it’s good but there’s still a cost. We’re still talking about $10 per record here and reason we still a high cost is because often these observability, these alerts often alert you when poor quality data is already in production, it’s already causing an instant, it’s already affecting users and maybe data is already in many different places, it’s gone to other systems and that all has to the cost of dealing with that instant.

10:49 Also, once we notice an issue and we’ve worked out all bit wrong, it tends to be too late to implement the fix upstream in the source system where it occurred. So what we end up doing is we implement work our downstream in our ETL, that’d be as I’m sure you know that another coalesce statement, another if else statement, all those kind of things where we’re working around the poor quality of our data and following with that is it looks okay if you do it once or twice you should keep doing that. Your a l becomes increasingly expensive and increasingly complex and that complexity on its own leads to more data issues. And if you keep having all these data issues, your users will start to lose trust in your data and once they lose that trust it’s very difficult win it back. So mediation is good but still costly.

11:51 So all about prevention where it’s cheapest and most effective deal with to improve the quality of data. I say this is probably where many of us aren’t focusing too much at the moment. Most of our get to remediation and that’s where we operate. Few of our are going towards prevention, but what prevention looks like, it’s all about things like shifting left cash issues at the source. It’s about preventing commissions from Macon in first place through good change management processes and things like that. And it’s about ensuring that when an incident does happen, rather reduced impact, maybe preventing that poor data from leaving the source system and that means that it can be fixed at the source and the data can be replaced from the source and that can be done before that poor data ends up in BTL and vening all the downstream services that are power from that.

12:52 So like I said though, most of us aren’t thinking too much about prevention at the moment. Most of us get into remediation. So the next few slides let’s focus on prevention and what it would take for us to start preventing more dead quality problems at the source. So if we start by taking this pyramid, this 100 rule of data quality, overlaying it on a simple data pipeline, it might look a bit like this. So a hundred dollars per record failure, that’s when issues make range applications and affect end users potentially paying users of your product mediation. That’s what we do in pipelines like I spoke about ready, all you have to do in your ETL to try and manage poor debt quality prevention, outsource on source data and that’s where it’s cheapest and most effective to deal with poor debt quality.

13:55 But how do you do that? How do you actually prevent data quality issues at the source and how do we incentivize data producers to put a bit more effort and a bit more discipline into production of data so that we can prevent some of these common data issue the quality issues occurring completely? Well I think first of all what we need to do is we need to correctly assign responsibility of that data to the right people. And for many of us that be assigning responsibility, focus on data to the product engineering or software engineering functions in organizations that ultimately failure should be responsible for source data and they may not feel that way today in your organization but think about it. But by really ones who can be responsible for it because if the only ones who impact the quality of our data, there’s nothing we can do downstream to make it more timely. Nothing we can do downstream to make it more complete, to make it more accurate. Best we can do is work around those issues, maybe infer data or techniques like that, but we can never really do much more than that to improve the quality of data. So responsibility has to lie with the people have a source showing data, which typically is a product engineering team.

15:26 So if that’s the case and if I don’t feel responsible at the moment, then what do we do next? How do we incentivize them to take on that responsibility? How do make them care about data quality? How do we provide them the right tools to help them produce data, have the right quality, quality we need it? I think there are a few things we need to do and in particular I think there are two problems we need to solve. The first is that currently if there’s a huge gap between those producing the data and those consumer data actually driving value from the data. So those coming in last mile creating value both data science teams, so BI analysts, those people using the data to drive real value of business, they’re so far away from where this data is actually being generated. Often they’re getting data through another team or through many different teams in the middle.

16:31 Typically a data engineering team may feel like they’re a platform team, could be anyone, but there’s a huge gap between those providing data and both consumer data and that means as their consumer it kind of stuck with what you’re given. If a data isn’t what you need isn’t right quality for you. If it breaks regularly, you can talk to data engineering team might be able to do some working around that, but you might not even know who actually general data at a source. It’s very difficult for to find out and speak to them and have a conversation about why you need it to be a better quality and how that impacts the business as a whole, how that improves outcomes for a business. On the other side, if you’re a product engineer or software engineer or product manager and a product engineering team, you don’t really know how data’s being used to create value, you’re too far away from it.

17:33 And then why would you care about that quality if you don’t see the value it’s provided for a company, you wouldn’t. You’ve got many on things you need to do, why would you care? So I think we need to really encourage growth collaboration between these two areas between both consuming data and both producing data to bring ’em a lot closer together. And to do that we as data practitioners, as people actually getting value from data, we need to be really good at articulating the value of what work provid. So why do we need this data? What values you gain from a company? Why would better quality data help us? And if we could do that and do that well there’s no reason why we can’t incentivize the data producers to provide data that meets our requirements and raise our right quality for us to deliver what we’re delivering.

18:31 The second problem I think we need to solve if it’s lack of interfaces, so often we are getting data through some kind of ELT or change data capture staff process and what that is doing is just grabbing data out of the source databases or other third party tools, grabbing that data, chuck it into a data warehouse or data lake and then we’re building on top of that. And main points about one of the main problems is that we are effectively building directly on top of that source database and this source database is going to change that. Full engineers need to change our database as I do features, as I improve performance or do any of the other things they’re trying to do but need to do to deliver what they need to deliver. So we should be expecting that to change and yet we are so often called out by change because we’re building directly top of the database and because there’s no interface in between that database and the data we’re building on.

19:36 So that’s another thing I think we need to solve. We need a more explicit interface through which data can be provided from facilit databases to us so we can build it with confidence. Now interfaces first simple idea but also very powerful and that’s why we see them everywhere in software engineering. So for example, think about a Python lab you’re using, what you’re doing there is you’re using the public interface and you know that it doesn’t matter what’s going on behind the scenes or methods, you can rely on that interface not changing three different minor versions. One day there will be a change updating change and they’ll produce a major version, a new major version of a library and probably advise some sort migration path from the say version one to version two.

20:28 So that’s a great example for interface you’re using something you build on with competence. Now if you’re using Python or similar language, you can technically go and access a private methods of that library and if you’re doing that you have none of those same expectations. The liable can change that anytime they like and that breaks your thing that’s on you. But if you go through the interface, there’s an almost silent agreement you’re making there where the advisor of library is going to commit to providing an interface until they create a new major version. So library, good example, API is another good example. Think about a company like Stripe or Slack or us I work for Go Carlos, so I can be like us. We provide an API that allows our businesses to build entire businesses on top of APIs. That’s a platform and they could do so in competence knowing that we will have good change management on the api, won’t break it without attending them and if we do have the operation path, they know that we are committed to a certain SLAs and that gives ’em a certain level of competence in terms of the uptime, the latency, all of those kinds of things and what allows them to build on that of confidence.

21:52 And also us as an API provider or an interface provider is very clear to us what our responsibilities are. So again, interface is very simple but you see from evidence software engineer because they’re very powerful, they help you assign responsibility, they help enable growth, collaboration, you’ve got something to talk around, something to discuss around, something to agree on and they most importantly allow things to change on one side without taking things on the other side. Take the a p example again. Stripe, slack, go colors. We’re changing things all the time internally doesn’t affect anyone downstream of us, boom on top of our interface. So the interface is really important but we don’t see ’em often in data engineering. But I think for data we can use the data contract as the interface.

22:48 So what is that contract and apart from the interface, how does it help us prevent data quality issues? Well I think the easiest way to think about their contract is it is a set of met data about your data and again simple idea but very powerful when you start describing your data and you can describe it in any way, you need to match what your requirements are. So that could be things like having a schema, you have things like having categorization like present data, not personal data, anything you need really when you start describing your data in a standard way, you can then do a lot with a data contract that helps us prevent data quality issues at the source. So for example, we can use that contract to encourage collaboration. So as I mentioned earlier, it’s kind of on us as state consumers to really feed into the data contract, provide our requirements, articulate the value, quantify the ri, maybe encourage that and foster that collaboration, put the effort in. And if we could do that then we can take produces real take on responsibility, take on responsibility for quality of data, for taking responsibility for how the schema evolves, applies to change management around that by self providing ations and their slows around the data that meet our requirements or get across I can to our requirements against this common discussion there collaboration needed. So we can use that contract. Once you have something describing what you need and what’s being provided, you can use that contract to encourage greater collaboration.

24:34 You can also use that contract to create this interface I’ve been talking about. So the way most people define those contracts and the best way to define that contract is in a format that’s also machine readable. And if it’s machine readable made in yaml, maybe code made in Python doesn’t really matter. But if it’s machine readable, it’s then very easy to create interfaces from that data contract. And once you have that interface you can then allow or enable data producers to provide data through that interface. So this interface in practical terms will be something like a table in a data warehouse or stream stream or whatever. Again, whatever you need really not prescribed. But once you have this interface you can start providing change management around it and now you’re no longer building on top of the database, you’re building on top of that interface instead. And this allows you to catch issues at the source and allows you to prevent many common issues from occurring.

25:37 Nice thing about this is that again we do this well what I have is we have five engineers producing quality data for well-defined stable interface and if by doing that maybe doesn’t need to go through the engineering anymore, maybe it’s good enough to use directly and that allows us to bring the producers and consumers a lot closer together IT collaboration again also move to that bottleneck of data engineering. So I’m sure some of you work in a data engineering team, you always have a backlog tb, you never get to many of the things on that backlog. It’s a bottleneck and bottlenecks are there to be removed. So this is one way to try and move that bottleneck. Doesn’t mean we’re not going to have any more data engineering, but that’s happy default, not all data needs to be refined through data engineering to become useful.

26:32 Now this is a session for the data meh sharing community. Imagine many of you are at least a bit many of data mesh or heard data mesh and some of you’re probably thinking a lot of it sounds a bit like data mesh, the things we want to achieve with data mesh. And you’re right, it does and that’s not an accident from a IC data contracts. It can be a step towards data mesh because there are strong between the goals, what we’re trying to achieve through data contracts and what we’re trying to achieve through data mesh. We’re trying to solve the same problems actually when I think about data contracts and when I started thinking about data contracts, originally it was around four to five years ago and I was trying to solve these problems at my organization thinking about why data get breaking and causing all these outages and causing all these incidents downstream and I thinking about interfaces and things like that.

27:41 I was also the same sort of time when the data article came out from ACK originally I was reading those articles and they have definitely one of inspirations behind my ideas behind day contracts. I really liked very articulate problems we want to solve. I liked the solution we describing but at the time I was just a tech lead of a data platform team, small data platform team. I felt like I couldn’t go and try and change the whole organizational structure to try and help us improve quality our data but it was too big a problem for me to solve. But providing become new language data contracts, providing a bit of tooling but helped enable that, talking with certain teams and working on specific problems using this become new language and supporting a move towards a more centralized organization, that was something I could do from our was organization. So bottom up approach towards data mesh and data contracts helped us do that.

28:48 So if you think about data mesh, it’s got its four principles and each of these can be delivered through a data contract. So first one if you’re not familiar is around data domain ownership of data. And I won’t go into much detail today, but basically data mesh in data mesh is all about data being owned by generate. That’s where most knowledgeable people about the data. Second principle data, mesh data product data’s product. And this is around responsibility, partly around data owners taking responsibility for the quality of that product and make sure our product meets the requirements of its users. Facebook data measures around self-service platform and providing infrastructure that supports these Thomas teams so they can build, manage their own data products and do so independently and forth. One is around federated governance. A governance implemented through a federated model balancing the need for central oversight with autonomy of data main teams.

29:52 And like I said, each of its principles can quite easily be delivered through data contracts as we could go through those in term with domain ownership. I we’ve spoken a lot about how through their contracts we want to try and define ownership and define responsibilities and use it to encourage greater collaboration between the owner of the data might be in certain domain and users of data who might being of domains data contracts helped in terms of treat data product. If you create a product you need some sort of interface to get that product and I’ve always spoken a lot about interfaces and data contracts deliver those interfaces and it does so in a standard way so that it helps enable that accountability and profitability between data products if you’re all using the same tooling, the same language, the same APIs, the same interfaces in terms of a self-service data platform.

30:55 I haven’t got time to talk about that today, but I talk about it in our places like my book and my newsletter. But this idea of building platform round data contracts, it’s something that we’ve done at Echo Carlos where our work, I’ve seen other organizations do it as well and really can use data contracts to drive complete data platform, complete data built round data products, round day contracts and finally day governance. Again, not much time talk about it here today, but I’ve about our places. But you can really embed and even automate a lot of their governance through their platform by capturing what you need to capture in the day contract and building rights or tooling and platform to either embed it or automate it away completely. And give us a redo here at GoCardless.

31:50 So I really like to think of data contracts as a step on ladder towards a data mesh. So I think one of the drawbacks our criticism data mesh is that it can put a big change organization and depend your organization but can be too average too far. Their contracts can help you get started on that journey almost by self. But maybe you want to introduce a bit more language on date products and maybe you get to data mesh but you’re on a journey and you’re deliver value all the way through at each step and who knows maybe their contracts or maybe their products. That might be all you need for now. You might not need to go all the way to data mesh to deliver what you need to do at your organization. So to summarize what we spoke about today, well we started off talking about this gap between the aspirations of organization leaders, business leaders and reality of what they are seeing when it comes to what they are achieving through the data. And we also source that from DPT around how debt quality is. It’s one of the major contributions that’s preventing us from delivering through the data.

33:11 I think debt quality is one of the things that really preventing organizations getting paid from data. I think to solve that problem we need to think again about how we are dealing with data, how we’re trying to improve data quality and really move not just from the remediation step but to prevention. How can we start addressing data quality at source? To do that, we need to improve collaboration between those trading and those consuming data and we need interfaces for which they can be provided by stable and by itself, right? Quality for the users. So I think over the last 10, 15 years or so, data engine haves been too difficult, it’s been too expensive and because of that we’re not delivering what we need to deliver to articulate those problems. The one 1,000 rule is a great way to articulate that, to describe that, to get some buy from that because what it does is it talks about costs, which is good document to make when you’re trying to make a big change. And when it comes to state quality, highest costs are when you’re doing nothing to try and address state quality. It’s much cheaper to do what we’re doing now or for stuff around ability or you’re suffering remediation step, you’re alerting, but it’s much cheaper steel if we can prevent these issues from occurring in first place and day contracts helps with that because when it comes to day quality prevention, it’s better than the cure.

34:50 Thank you.

Speaker 1: 34:54 Alright, thanks Andrew. Yeah, I think that last graphic you showed, that’s what everybody should use to help drive organizational change towards data mesh and a good example of why data contracts can be useful because it all comes down to saving money. I’m sure that’s a huge draw. Yeah, thanks a lot. We have some questions here so I will read them off. So from Diane, oops, what happened to the question? There we go. Diana Bik. So I just wanted to ask what would be the definition of data quality and data quality issues, specifically the definition for collaboration between engineering teams and data engineers?

Speaker 2: 35:50 Yeah, that’s a good question. For me, I think data quality is one of those terms that is used a lot in lockdown context. I think for me it’s really, it’s about the expectations that you get from data. So if you’re telling me that there’s more about expectations rather than hitting a particular level that quality. So if you’re telling me your data is going to be available daily and that you won’t make making changes until you create a new version, that to me is you’re defining some data quality measures and as long as you’re telling me what to expect from that, I can build on data with some confidence it might not exactly what I mean, maybe I’ll drive it hourly but at least I know now and I can build it some level of confidence. I think what we have at the moment typically is there notification set.

36:43 So you dunno how the scheme’s going to evolve, it might change overnight and take everything downstream. You dunno how timing is going to be, it’s not really being defined or set. So then what you end up do is make assumptions about what it’ll be based on what it currently is. Well it hasn’t broken in months so I think it’s very stable and it seems to be up to date to the hour so I would assume that, but if you’re assuming that and that’s not what B people are provid the source, then one day that’s going to assumptions going to break, something’s going to break your assumptions and you might be able something that’s critical of business and now it’s falling down and everyone’s asking you why thought it was, thought it was going to be Apple time and I thought it was going to be reality. So I think it’s really about expectations when it comes to quality. I think that’s the most important part and the actual MES themselves are less important. So yeah, I hope that answers your question Donna. It’s a bit of a broad answer but I think it really depends on and all you know what it is then you have some confidence when moving on it and if you dunno what it is then that’s when big problems happen.

Speaker 1: 38:05 Okay, cool. And then Diana actually had a follow up. So the definition which is supposed to help to set the line, where are the issues to be fixed by engineering teams and what data engineers may tolerate and implement cleansing procedures for?

Speaker 2: 38:20 Oh yeah, that’s a good follow up question. I think it’s just in language in my question actually words like tolerate and cleansing. And I think that that speaks to how we’ve been dealing with data for the last probably about 15 years or so since seen really and HD FX and became cheaper to store data and we had all these ideas, our schema read, all those kind of things is that once we started thinking about those architectures, we really focused on making it too easy to collect data and too hard to use data. And that’s why data engine is so hard. We have to tolerate poor quality data, we have to cleanse it, we have to refine it, use your whole data oil language, which again is problematic and you could argue that’s maybe okay if we’re just doing analytics, I’m not sure that’s true but you can argue that. But I think when you’re using data to provide key product features or to drive key business processes, that may be reasons if you don’t mind, it’s not okay for us to tolerate those things anymore. If we want these things to be reliable, we want to provide a product feature that drives revenue, it’s based on data of equity or based on ML build on that data or whatever it might be.

39:43 We need to have that quality from the start and be reliability from the start. Otherwise it can’t be reliable downstream. So obviously this doesn’t apply to all data, but I think if you are using data for those kind of basically critical things, then you shouldn’t be tolerating these kind of issues. You should be having those conversations. Like many other conversations a company, whenever you have meetings depending on other should be having conversations about, okay, to deliver this critical thing for company, I need physical data with the reputation set, I can’t deliver that. I have those conversations with ’em and move away from this idea of being get well given and being lucky to have that at least.

Speaker 1: 40:31 Yeah, I guess that’s also part of the shift in organizational thinking.

Speaker 2: 40:37 Yeah, I think I’ve moving, I spoke to someone earlier about this actually on different zoom call I does seem to moving in this way will happen more quickly in some areas and different piece in other areas as a big change. But our work, we have changed that culture where we don’t expect the engineers to spend the time cleaning and cleansing. We expect them to be doing business value and if they need better quality data upstream to do that, then we’ll go and ask for that and we’ll make the cater that and we’ll get prioritized. And the same way, if anything that’s prioritized in the company so we engage with OKRs and cost paying, all those kind of things, it just becomes part we do. It’s not different to any other dependency that our team has. I think that’s the way many organizations are moving.

Speaker 1: 41:26 Yeah, well that’s sort of what you just said there ties into what Martin Harrison was asking. So are any actually delivering the data quality standards you described so they don’t get issues one, it sounds like the organization you’re working for now is doing that.

Speaker 2: 41:43 Certainly where we’re working from. I’m talking from mainly firsthand experience here. I work at Carlos, aside from Go Carlos sometimes speak to other organizations and there are many that are looking at this, I don’t want to name ’em, but they’re actually often quite large companies, quite tech heavy companies to be fair or not always. Sometimes they’re media companies, sometimes they’re banks saving even banks, financial organizations. But I’ve seen a slow moving. They are also, why aren’t they contracted? Try and solve problems. I spoke quite earlier to try and improve quality, try and drive standards around that. So I think it’s certainly, certainly becoming a more widespread idea for sure. And we’ve seen it more and more.

Speaker 1: 42:35 Well we’ll try and find a company who’s willing to talk about implementing data contracts. Yeah, I know it’s probably a little sensitive about who you can

Speaker 2: 42:45 Actually say. Yeah, I speak about them directly so I don’t want to, but there are some who spoke about public already. So HelloFresh, I think you take contracts and they spoke about it publicly before. Obviously you might have heard of many different companies building their contract solutions. So Anderson Building Gable has many customers, again, I don’t name ’em, but if you follow him on 18, he talks about some of the big customers or the people he knows of using their contract. He’s posted about recently name with some various logos on there. So there are public examples to PayPal. So obviously JDP is a friend of CML, so there are the examples out there that are public as well. If you touch for him.

Speaker 1: 43:33 Yeah, maybe we should have JGP come talk about his experience at PayPal implementing data contracts. I’m sure he would love to do that. Okay, then last question we have from Richard Atkins is please give an example of how we use the codified contract such as yaml. Is this just a representation of what’s at source or used by ELT or a definition of the source itself?

Speaker 2: 44:02 Yeah, I see a question. I think

44:07 I would say it’s not, I’m guessing or what I’m reading from your brave questions phrase, the idea of it being represented at the source, it sound out to me like you’re thinking of that contract describes maybe a table and database and it’s codifying that in a way and I see people do this and there pros and cons to it, but I don’t like that approach so much. I think you’re still tie yourself to that database, to that table to how it’s described in that upstream service. And I think what you really want is some sort of abstraction there. Separation. So when we talk about data contracts and I talk about data contracts in the book and what we’ve done about card, it’s separate from the table or separate from database. It’s not a of what it looks like in the source database, it’s a reputation of what that service it’s providing to downstream service downstream consumers.

45:12 So it might look a bit like a table, but it might look a bit different and probably will look different because what we’re talking about here is providing data in the right format that meets the requirements of people downstream. And often that’ll look different to what looks like in transaction database. It might be more designed for analytical use cases, might be more of an event stream if you’re doing event streaming in your organizations. But it shouldn’t look like a source database itself. It should be different. And that allows for database to change every time while not changing about that contract, not changing about their product if you like. So I think that’s one thing. And then it’s also the same representation used by btl in that case it says ERT in the question. So yeah, I think Reinforc, what I was saying before, it is not about using their contracts with ELT, we’ve changed data capture with those kinds of processes that basically make copies of database and brings them into the data warehouse.

46:16 With Data Lake you can do that, but I think that’s limited benefits to that, very limited benefits. I think what we really want to do is move away from building on top of those databases directly and having a different kind of interface. And to be fair, that is more work particularly on the data producer side, but we can help with that. We can find libraries to help with that. There’s different architectural patterns to use with that, which I haven’t got time to go into now, but it’s things like the ABO pattern. If you don’t listen to yourself, I got in the book and there’s best things online about it. They’re not new that contracts, they are patterns getting micro service architectures. But I think we need to adopt the same kind of patterns because what we’re building with data, it’s as important as maybe another microservice or another service somewhere else. So why shouldn’t we have the same expectations from the data? Why shouldn’t we have an interface and API? So that doesn’t look like the database. I think it’s all the same really. And I think that’s why, yeah, I wouldn’t codify for source system, source database. I think about how you can codify a new interface for that data.

Speaker 1: 47:34 So do you mean the API to access that data?

Speaker 2: 47:38 Yeah, so it could be an API typically, because we’re looking at data sets, it would be payable and data warehouse. So we do is people define that contract by doing in a form of code for us, not ya. It’s a form of code. And we do that. That was the most natural thing for us to do. It’s the same way people in their college define their infrastructure as code and the APIs we just do the same sort of way because deliberately because you want feel the same, sort of feel the same for ’em. And then what we do is we take that code and we generate a bar case, BRE from Google, a BRE table, and that’s the interface. And then we say you as a service need to write to that BRE table directly. We also use this with streaming. So we Google pubs up and we say, okay, use service now.

48:27 Stream that data through pubs up, activate, use together. So you stream popup query. But what we’re doing there is we’re creating that abstraction, creating a different interface, and then on the consumer produce side to make it easy as possible. I said providing libraries for people to make use of pattern like outbox and things like that to make it easy to get data into pop up into BigQuery, but match their forms requirements and makes it as easy as possible for them. But it is still worth it in like C, D, C and E. They’re doing some work, but we’ve made a case like that. Overall that’s better outcome of the company.

Speaker 1: 49:07 Okay, thanks. Well I think, well, are there any more questions? I’ll wait a second. There’s a little bit delay. Are there any more questions? Go ahead and add them to the chat now.

Speaker 2: 49:20 We’ll just hold. Yeah, really good questions. Thank you everyone.

Speaker 1: 49:22 Yeah, thank you everybody for your questions so far. I think they were great. Well I believe we are done with the questions, so yeah, I just want to say special thank you to Andrew. Really appreciate you taking the time to speak with us. Yeah, this is fascinating. And of course we’ll post this online, so if you want to watch it again, you can watch it on our YouTube channel. But yeah, thank you Andrew for joining us today.

Speaker 2: 49:49 Cool. Thanks Paul. Thanks for having me.

Speaker 1: 49:50 Yeah. All right, well we’ll see you later. Alright, thanks everybody. Bye.

Data Mesh Learning Community Resources

Engage with us on Slack
Organize a local meetup
Attend an upcoming event
Join an end-user roundtable
Help us showcase data mesh end-user journeys
Sign up for our newsletter
Become a community sponsor

DML Blog: Data quality – prevention is better than the cure

Watch the Replay

Read the Transcript

Ways to Participate