Data Chronicles – From Mesh to AI and Everything In Between: Data Contracts

In this episode of “Data Chronicles,” host Amy Raygada explored the role of “Data Contracts” with expert Andrew Jones. They delved into how these agreements ensure data integrity and consistency across complex systems. Andrew shared valuable insights on establishing clear expectations between data producers and consumers, highlighting the importance of data quality in AI initiatives and decision-making. The episode offered key strategies for leveraging data contracts to enhance data reliability and drive organizational success.

Speakers:

Watch the Replay

Read the Transcript

Speaker 1: 00:01 Okay. Alright, thanks for joining us, everybody for the episode two of Data Chronicles, from mesh to AI and everything in between with Amy Marie Gata and today we have special guest Andrew Jones. Before we get into the conversation, of course always need to go over the community things going on. So I will do that. And I’m Paul Head of community. And first I want to say thank you to our community sponsors. So Starburst Reltio, next Data Stream, our platinum gold sponsors. So thank you for all on us to do these live streams and these meetups with everybody. Here’s some upcoming meetups. So our monthly data Mesh Roundtable in September, we’ll talk about budget ownership. That’s going to be on September 10th, so we do those on the second Tuesday of each month. Bright and early in the West Coast at 5:00 AM but 8:00 AM New York and 1400 central European time.

01:04 And then of course the next episode with Amy will also feature Kendale El Omari and that’ll be on September 17th. So I hope you can join us for those and you can always go to the meetup page and register for those events there. And then I just wanted to highlight last month, or actually sorry, earlier this month, that was on August 6th, seems like a long time ago. We had a data mesh learning panel on data governance and data mesh, so you should check that out. Features some of our data mesh learning MVPs, Karin Hawkinson, Kenda and Andrew Sharp. And then as well we had Martin Lan from CEO of Co-founder of Soda. So check that out as a really great conversation on data governance. And we have all of our other live streams on YouTube as well. Yeah, as I mentioned, you go to our data machine learning YouTube channel, you can see all of our past meetups and see the stuff that you weren’t able to see in person.

02:03 And then coming soon, we’re still working on this, we plan to put our streams also on Spotify so you can consume that material in that way as well. So stay tuned for that. We’ll let you know when that’s live and if you want to get into the conversation, sometimes we have these streams and we aren’t able to talk about everything we wanted to. We ran out of time. So you can always continue those conversations in the data mesh learning slack group, you scan the QR code, you can join the data mesh, learning slack, ask you questions, help answer questions and just meet other people in the community. And then of course you have the data mesh learning website@datameshlearning.com. We have hundreds of resources, we list our events there, lots of use cases. So it’s definitely a great resource you’re trying to start your data mesh learning journey or if you’re already on your data mesh learning journey or data mesh journey. And that is it for me. So I will hand things over to Amy.

Speaker 2: 03:05 Hi everyone. Thank you so much for tuning today. My name is Amy Regata. I’ve been working in it for around 20 years in data around 13 at this point and I’m happy to have today here Andrew Jones, who’s also been around 20 years in the industry and the last 10 years in data. So I’m really happy that we are here together to discuss the importance of data contracts, which is a topic that I think is still a little bit cloudy for many people. So thanks so, so much for coming. Andrew, how are you today?

Speaker 3: 03:36 Hi, I am very good, thank you. Thanks for having me.

Speaker 2: 03:39 Great. So Andrew, first of all I would like you to tell the people a little bit about your experience with the data contract and also what is the data contract and how you came up with this idea because you are basically the person who started all this.

Speaker 3: 03:56 So I came with the idea around six years ago and what we had then, we had a quite typical data platform. We had a change data capture service tracking data from our upstream databases into our data warehouse. And then we had BI and data science buildup of that. It’s quite typical how many people are building data platforms. And so by the time, and we still are today, but we had many of the same problems that many people have as well in that the data wasn’t that reliable. Schemas kept changing upstream, the data itself kept changing, no change management around that. The quality of data wasn’t great because basically building on top of internal models of a transactional database, which isn’t really well suited for analytics terms about data, they useful, they made sense in context of our service but not to rest of business, all the kind of problems that many people were having.

04:50 So I started thinking about how to solve that because at the time one of our organization’s key strategies was to build more on this data, particularly using machine learning at the time these days called it ai, but the same kind of thing. We’re trying to use our data to try and drive revenue, try and drive competitive advantage, but it didn’t seem like we could do that top of the data we had at that time. So I started thinking about a service, I kind of felt the root cause was we’re building on top of the upstream database directly sends you via ttc but it sends you directly same schema, same data, same everything. And I started thinking you mentioned I started, we both got a bit of software engineering background, so I started thinking I software engineer, that would be a bit of a red flag.

05:32 I wouldn’t build on top of someone’s database. Of course it’s not going to be reliable, it’s not going to make much sense to me. I’m not going to have much context on it. It’s going to change quite frequently. I kind of want some kind of interface in software engineering might be an API want an interface tracked away from that, but I can build on confidence and that is a better quality. And I just thought well anything come this overnight. But eventually I was getting like, well why don’t we have the same sort of thing for data? And that’s where data contracts really started and where a contract came from. And API is a bit like contract between a provider and consumer. It’s a same thing of data, it’s called a date contract. That’s really where it was started. And then it becames a very simple idea but once you start thinking about interfaces and scribing your data then becomes a very powerful idea as well.

Speaker 2: 06:24 Cool. Yeah, just what we spoke about, I mean at the end everyone has a lot of issues with the data quality because it comes from upstream and when you work in software engineering, we already have kind of a data contractor. And then when I was reading about data contracts in beta specifically back in 2021 or something in your blog, I was like, oh why no one had done this before. I really need this in my life. And there was not really much information or anything that could really give me a whole idea because there are many type of systems, many kind of things that you can put a data contract on top, but it was kind of difficult at that time to get all this information. But then you came with the book and also other people started to write about it with the PayPal JGP and all these other people talking about data contracts and that made it a little bit easier. However, what are your thoughts about how these data contracts have been evolving over time in these years that since you started until 2024, have you seen many updates or something that has changed quite a lot or what are your thoughts on that?

Speaker 3: 07:37 I think it certainly getting more attention over years and the more organizations are certainly adopting it and they’re adopting it for a variety of problems, the problem, variety of problems trying to solve with data contracts, but their contracts quite flexible. They can solve a problem you have. So for us it’s a lot around change management of schemas. Other organizations might be around day quality and they want to improve that. So there are a few different problems you can solve, data contracts, maybe that governance as well. So I think we’re seeing more and more uptake of these ideas and I think that’s probably accelerating more these days because people, like I mentioned, we had a strategy to try and do more of our data and that need about data quality and because it aligned with our strategic goals allowed us to adopt a new idea, maybe shift things a bit left and maybe ask people to do a bit more work upfront, better data.

08:36 Because ultimately that led a better outcome of business and a lot of companies doing that as well of data at the moment. But looking at how they can use their data maybe with AI as well, looking at how they can use their data to generate value and realizing that in order to do that you need to invest in debt quality and the best place to invest in debt quality is at a source and best place, cheap place to do that. And if it does require some work from maybe software engineering teams or people look after systems upstream, but we’re starting to realize that it’s worth the effort if the outcomes along with business goals.

Speaker 2: 09:13 And one question for you, because I see a lot of people missing up the concept of a data contract. Yeah, it’s about data quality but it’s much more than that. How do you see these, how it helps really to maintain data quality but how it helps as well in data platform reliability. The other day I was speaking for example with Yulia in the past episode and we were talking about data platform reliability and all that. And I was mentioning to her for me that also data contract is part of that because it’s not only about data quality but it also as you said, governance ownership. It’s so much more. So can you elaborate on how these data contracts will help to maintain these data platform reliability and quality?

Speaker 3: 09:59 Yeah, sure. I mentioned a data contract, really quite a simple idea. It’s you describe the data in a document that is both human machine readable, the rest of the data contract, that’s all it is. And when I say describe your data exactly how you describe that or what you choose to describe, what kind of labels you put in there, it is up to you really. So for us, we put in there the start, we had things like categorize our data, is it personal data or not? Is it secret confidential public information? Is it about a customer or an employee? Those kind of simple categorizations. And that allowed us our platform layer to automate a lot of things about how to do access controls for example. So we can start applying role-based access controls based on that without everyone having to do their own access management of our own data, implement those policy rules centrally.

10:54 And because using data contracts to go more data in the same standard format, we can then quite easily really build tooling that applies that automation, those policies to the data no matter where it besides or who owns it along way to grab data with sufficient information. And we started doing that. That was even before we implemented that tooling. We had categorization there, we knew that was going to be important, but we started applying that kind of pattern to a lot of our applic capabilities. But all of them really, we haven’t found a capability we couldn’t implement in the same sort of way. And that could be something as simple as backup. So you want your data backed up. So having something in day contract says, I want it backed up, I want it backed up daily, I want it to be retained for three months. Then suddenly you can easily backup data no matter where it is or who owns it. It’s very easy to convert tooling and it just went from there. Everything we wanted to add as platform capability, we were able to add both in data contracts and using that as the source of truth to describe the data. And the tool in itself was very easy to build really.

Speaker 2: 12:02 Yeah, yeah, totally. By the way, for the audience, if you have any questions that you would like to throw in the chat, just throw it and I will ask to Andrew while we are talking here about this topic. Andrew, now that you are trying to do also some independent consultancy and all that, what have you seen in some of the industries or some of the companies that you have maybe consulted or talk about are the most difficult things that they are dealing with and what kind of solutions can you propose to these things that are just trying to start with data contracts?

Speaker 3: 12:43 I think the most difficult thing that I’m seeing most often is around the communication and people side. Really the data teams who are looking at this, they understand why it’s important and maybe struggling with their quality for a while. They feel the impact firsthand and they’re looking at how they can solve that but they’re having, not everyone, but some many organizations have trouble getting buy-in from the other parts of the organization about why is equality important, what’s the impact having on data teams and the applications they’re building And we favor trying to deliver through data. How do we incentivize other parts of business to provide back quality data, how to go into care about data quality, how do we communicate in terms that I understand and how do we link this back to the business strategy? So a lot of it is about communication. Again, it’s a lot about people side. I think talking wise that’s still a bit of a challenge as well. It’s a bit more mature than it was when I started, but there’s at least patterns and there’s fire art you can look at and get inspired by. But when it comes to people’s side of organizational change that’s unique in every organization you’ve got different people who’ve got different structure, you’ve got different characters, you’ve got, it’s all very different. And how you navigate that is probably the biggest challenge that many organizations are facing with state contracts.

Speaker 2: 14:13 Yeah and there’s always this kind of gate of Game of Thrones that you have to go and preach a little bit here, preach a little bit there and it takes forever. How long did it take you to do the whole implementation until you feel now comfortable and how was it you did it at once or little by little, how was your experience with that?

Speaker 3: 14:36 So the actual tooling bit didn’t take that long. Once we had a design and we were building on top of things, we already had infrastructure as co platform, that was quite easy. You did an MVP in a few weeks really, but before that we had to really think about, be very clear about what problem we’re trying to solve, why we’re trying to solve them, get buyin from different stakeholders. It’s a bottom up approach for us. So I was the leader of a platform team. I wasn’t TTO or anything like that. I couldn’t really enforce change on business so I had to really explain it, not just me and me. I had a great product manager at the time, she was really helpful. Our people in the team as well really helped. But we had to really do a lot of talking to people, a lot of explaining to people why this is important, why what Paul’s trying to solve and why that would help the company reach its goals.

15:29 And while we started taking aside D approach on the problem had the old approach, so a lot of talking that probably took maybe a year from start to end. I wasn’t doing it day to day but took a little while once we had enough buying we and felt quite a certain that this was a good journey for us to start on. Tool itself didn’t take that long, but we identified some clear places where we could value quite quickly through date contracts. It wasn’t necessarily the most important data set in a company, but it was one where maybe the team were already quite data driven. Maybe they had their scientists for deployed in there. They were using data to create value to create some of these products that would drive them you. And they were working quite closely with other team producing this data. It’s quite a good situation to be in because everyone in there was kind of incentivized for this to work.

16:23 They cross-platform, cross team collaboration anyway and we’re just enabling that through a bit of tooling through a slightly different mindset. So we did that quite quickly. And then from there it just came more like a be process, finding next problem we could solve, add a bit of value there, what had a bit more to our platform capability, our next platform, the next problem. And eventually it took many more years but eventually now it’s really part of our culture really when we talk about how we do data, we just do it through day contracts and it’s part of our technical strategy by CTO. It’s just how we do data now.

Speaker 2: 17:01 And did you find a lot of resistance for this topic?

Speaker 3: 17:06 Not as much as I thought really. I think a lot of times, and I get the answers quite a lot, but how would you get say a software engineer to care about data quality and they would never care if they’re busy and things like that. But really if it probably helps, I got software engineering background, but if you self engineer, they like solving problems and if you explain to ’em the problem you’re having and I would love to help solve that problem and if that problem, if a solution is a bit of their contracts and it’s a bit of their work but you understand why it’s valuable and why micro will work on their front, I’ve actually quite receptive to that eventually becomes a game prioritization and which is every company has issues of prioritization. There’s always more to do and can do and you have to get prioritized and then again you’re going back to the value, what’s the value?

17:58 Can I get other team to do the work? It ultimately is a better outcome for company for this value, but that’s just say a reality in every organization you just try and get involved in and have the organization prioritizes. That’s quarterly planning or OKRs, whatever case you do. So that’s its own challenge. But I wouldn’t say there’s massive resistance. I think people were quite open to it so they open to talking about it and once they understood the problems and I felt they were involved in choosing the solution and they are quite happy to get involved really.

Speaker 2: 18:35 Yeah. And do you think this increased the trust in the data team? Because in my case specifically I saw that this kind of data contract also having automated data platform and being more efficient increases a lot. The way that people see the data team, it’s more trust because before it was that people, they broke the pipelines, we don’t have the reports and this and this and that and the changes were coming upstream, it was not even our fault and we spent a lot of time trying to figure out where it was the problem because it was a big company, a lot of people working on different stuff, it was a monolith as well, whatever we had in the backend. So it was complicated. Do you think in your company was kind of the same, you got more trust and people felt more happy to use the data you were providing?

Speaker 3: 19:26 Yeah, I think it did help to increase the trust. I think we saw as we matured that we had that contract, there are fewer pipeline steps necessary. The data was better quality upfront and maybe it could be even used directly. And because there was less complexity in pipelines, we had a few instances as well. So that’s a good sign. So fewer things went wrong, improved trust. I think it improved the empathy people felt for the data teams and one good way of doing this was encourage your data teams to follow an instant process when saying went wrong in pipeline, when do what cause analysis, get people involved in that, including people who own the upstream data software engineers. And that really helped bring them close together because now it wasn’t just our pipelines failed must be a data problem. It was okay, pipeline failed, IT effect is key process that is really important in business.

20:20 Why did pipeline fail? You can’t play what we do. We play the five wise game. There’s different ways you can do this but say well why did pipeline fail? Well this thing change upstream. Why did it change upstream? Well engineer team needs to make a change of database to improve performance. Why is no change management? You keep asking why and eventually you get to the hard question or you get to be the root cause and you start questioning some of the assumptions you made before about why can’t we have more payment upfront? Why are we building on top of a database? So that also helps to improve not just communication but empathy as well. And that again helps get a duty of resistance has people buy and particularly about ourselves data teams, which is really important if you want to roll this out to the whole organization, not just within your data team.

Speaker 2: 21:11 Yeah, and how do you think the data contract itself, talking about the other parts, right, like having ownership there, did it really help you to maintain a clear ownership and accountability of the data across the different teams or departments? Meaning not only technical owners but or were you also putting there business owners just to understand who to contact if something happened?

Speaker 3: 21:38 Yeah, ownership was another big problem in trying to solve. We also had monolith architecture, so there’s a lot of shared ownership of tables in the database and obviously shared ownership means no clear owner. So that’s kind the problem we’re trying to solve. It’s not immediately sold just by date contracts. It requires a bit of organizational change as well, but all day contracts had to have their own upfront and they always had to have a single owner and one person was creating that. So that helped. And what we kept ownership quite simple. So normally it’s a team who owns it and the team we generate with data and particularly our software engineering team, most of our data contracts were around this data comes from engineering. We haven’t really applied it so much to third party data which come with different challenge mainly because for us that’s the most important data for us.

22:25 It’s our internal data that we are producing. So it certainly helps to have owners, owners, we did it just on team level but we made them accountable not just by telling they accountable. We made sure that if they was an alert saved they contract, I breached for lack of a better word, or they were producing data, if I didn’t match contract and therefore it was throwing an error somewhere bay would get me alert. People who put themselves as the own day contract who created the day contract bay would get me alert and they would still with it. So immediately they become accountable for that. They have ones getting the alerts, they are responsible for maintaining that data and it’s clear what it means to be a data owner to only day contract. So it’s a small change really in terms of technical change. It wasn’t that big technically to do to route alerts, it’s quite easy. But we were able to do that because we had enough buy-in that we set the expectation that if you own a date contract but you are accountable for it and you’re responsible for it and therefore you’re going to get the alerts and you’re going to be person who if you are providing three or seven support for a state contract, you ones going to get paid every weekend if it’s broken. So yeah, a lot more accountability for that as well.

Speaker 2: 23:40 Yeah, I think that’s one of the hardest part to get accountability but not only by telling you, hey, you’re the owner and that’s all right, it’s more than that. It’s a lot of conversations that need to happen and also put some boundaries there of what are you in charge of and why there must need some communication if you have a data product owner or project manager in between teams, the technical lead or whoever to try to understand these changes and when they are coming. For us, one of the ways that we had this reinforcement of the contracts was to put in every pull request from the backend from that specific service. They cannot really push unless we have some kind of approval if we see they were changing something on these schema. So basically that was a way to do it and then we will run some quality tests and we’ll see that something will break.

24:32 So we will be hey stop. We cannot continue to do that. You need to let us know before you do changes so we can adapt until the consumers what’s going to happen. Because if something breaks, again the data team is the one who is going to be blamed pretty much, but you need to put these rules sometimes really technically speaking and very kids kind of thing because just spoken agreements, it doesn’t really really work at the very beginning until the humans make this kind of change mentally and they see it naturally. But it takes time to get to that point.

Speaker 3: 25:09 Yeah, it takes time and yeah, it does require the human side really first the communication side and can’t just tell someone about owner without explaining to ’em why they need to be owner, what’s actually mean, and also giving them the tools to be able to own that data effectively. So for example with this case we’re sending them alerts but we’re not just sending ’em alerts to a system they don’t understand. They have been involved in setting up our system, they’ve done it all themselves, they’re using our tooling, they’ve got runbook for how to activate alerts that we’ve written and provided for them. So as a data owner we’ve really got quite good tooling now. So as a data owner, it’s not that hard. It is not a scary thing to be a data program anymore, but it’s just something that is maybe you own a P and that you own your database and we structures code behind that and from cloud resource behind that and you also own a state contract and we’ve got the tooling in place to make that you can own your data without having to be an expert in GDPR or data regulations.

26:06 You don’t have to know about how to do all backups and stuff. You know that all this stuff is done for you. But if we were to break, they are clear run books and they are clear, you’re supported in by ownership, which I think is a lot better. It’s a lot better situation than being told you’re the own office and you don’t understand what it is and you’re already starting off in some friction there and a bit of reluctance. So we really try and support those runners

Speaker 2: 26:37 And I think it’s a shared responsibility at the end of the day because it’s not only the producer but as a consumer if you are for example one domain, you are there just getting info upstream doing the pipeline and then someone else is consuming in another form. But also the consumers, we had this unsubscribe like I told you last year in London. It’s like you also as a consumer need to be on top of it because I mean I’m the producer of this domain but I don’t know what you’re using the data for or how you’re using it for real. So you also need to be on top of things if this is something that is going to affect you, this communication needs to flow all the ways, not only from one side or the other because then you share this sense of ownership and accountability that we just spoke about.

27:21 So that’s also quite important. So it’s a shared responsibility from anyone if you want to really make it work. Because I have also seen some people that now they were, I saw it in a previous company that I worked and I just wanted to throw myself from the window. They were trying to do data contracts in notion. I was like, but why are you doing that? No, because we put here at the table and we put here who owns that? And we put here what’s the schema and this and this and this and that and I’m like, but no one is really maintaining this. I mean this person did it once. It took them forever. I don’t think this person is going to come back and check everyth time there is something new because they will know until something is broken. So what’s the point of having a written data contract and then speaking with people around in conference and everything, some people told me they were doing the same but in Excel for example and I was like why are you even putting yourself into that a lot of work if you are not, I mean it loses the purpose of the data contract but this automation of this kind of efficiency creation and I think that happens a lot because people are still a little bit lost.

28:28 So I wanted to get into the nitty gritty of the technical aspects. So what are some of the best practices for implementing the data contracts within organization and what are the common pitfalls to avoid?

Speaker 3: 28:43 Yeah, I think one of the best practices really is think about who’s going to be writing these data contracts. So you mentioned that people may be writing ’em in Notion Excel and stuff and I used to think, I used to think at least you get something written down so it’s kind of a good start but really like you said on its own it doesn’t really solve any problems. You just get written once and forgotten about most documentation. So you think about who’s going to be writing these day contracts and how they can be used and really targeting information around that. So an example, we knew we wanted to be written by software engineers in our case and therefore we wanted to make sure it’s as easy for ’em to write as it could be. So we chose to write our data contracts. It’s a bit of niche language, it’s called JNet, which is like configuration language and we chose that not because it’s a great rate to find date contracts.

29:33 I wouldn’t recommend it if you’re not using it already, but because that’s what our software engineers were used to using, we used it to define infrastructure as code we defined by APIs. We wanted to make sure they could define by that contract in the same way, in the same repo right next to it with full autonomy change it, no review from myself from the central team, it’s their data contracts. So really I say best practice focus on who’s going to be create state contracts. It might not be probably isn’t the data team, it’s probably the owner of the upstream data. So how would they most want to define the state contracts in a way that’s most capital them and what tool do you need to provide to support five them to support that management of data contract and management of data underneath it. So I think that’s the best practice.

30:22 I think in terms of pitfalls, like common pitfalls, it probably goes back to go minute go about forgetting being on people’s side and you just build touring and hope they’ll come. And again, if people don’t really know why it’s important, they’re not going to put that extra effort in. So you’re asking people to put more effort in upfront and it’s not about being lazy unless say we know why we’re not going to put the effort in, we’re not going to be incentivized to do that. We’re going to find out things to do or we’re going to do a million amount of work needs to do that. There’s the problem we had, we can’t, although it’s been a relatively long journey or say it’s long journeys, we kind of got quite far as far as we want to go really. But we did try and go a bit fast at some point we had this idea of we’ll do a big migration project and get everything over migrated every day contracts by a certain date but there’s no real reason for that date and we just chose the wrong metric.

31:23 We said we’re going to monitor, we’re going to measure how many date contracts, migrated date contracts, how many data sets, migrate date contracts and if your team being told you have to migrate four date contracts this quarter, four these date contracts with quarter and you don’t really know why, then what you’re going to do, you go and choose the four easiest ones and do that. That’s not really the intention we had. We went into the most important ones or we want prioritize most support ones but by the understand the value and therefore they did with this amount of work to check this box. So that’s a mistake people make is we lost a bit of sight of value we were trying to achieve with date contracts and we came in a numbers game, a target game. So yeah, losing track of value and not bringing everyone with you is probably the biggest pitfall to avoid.

Speaker 2: 32:13 Thank you. Andrew. I have a couple of questions here. So from Richard Atkins, one aspect that I’m struggling with is how to centrally host the data contract. Do you recommend using a schema registry for this?

Speaker 3: 32:27 Yeah, it’s a good question. Schema registry if I can be used to host data contracts, well what we did is again, I spoke about it can goes back again to how do you want data contracts to be created and we can think about how they’re managed and we want them created by okay software engineers and we want them version control and git because that obviously we had a usual GI workflow where we had pulled Chris and review there so the actual sort of data contracts for us was in Git and then from there we push it to other places so we push it to a data catalog so we can view those data contracts and search for ’em. We also push it to a S scheme registry so that tools can access data contracts and use them to encode data or whatever needs to do with that, the actual schema itself. So that’s how we did it I think. Yeah, I dunno if the schema actually should be the source of truth for data contracts. I think it’s more representation of data contract. I think centrally the source of truth should be most likely in a gear or some other version control system that humans can more easily interact with.

Speaker 2: 33:45 Thank you. Another question is from the same person. Our company’s looking to adopt a tool such as liquid base for OLTP, database change management schema evolution. Do you recommend data contracts as a replacement for such tool or fit into it?

Speaker 3: 34:04 Okay, so I don’t know liquid base personally, but what it sounds like is we’re trying to apply change management on top of a database, the option database itself and scheme evolution around that. And that’s an interesting solution because if I was a software engineer and that’s my database, I’m not sure I’d want someone else to be putting change measurement on top of my database. I would want to be able to change that database as I need to deliver features I need to deliver or to improve performance need to, for whatever reason I need to change to improve performance. I don’t want to be high performing software engineer teams are changing my database regularly and they don’t want to be told they can’t or don’t want to get approval from a central data team also doesn’t scale very well because you’ve then got a central date team for example reviewing schema migrations in database.

34:56 If I don’t own them, I don’t understand that full context of so lemme pick scales it. Well what we want to do really is move away from just moving databases around and bring up databases to a separate interface, A more explicit interface like an API is and that’s what we use data contracts. We don’t apply data contracts to the database to try and manage change management or try and improve scheme solution. We explicitly decided not to do that and to set create separate interface so people pushing data through data contract to say in that case a bakery table pops up topic, it could be a CAF course doesn’t really matter to somewhere else vast where we’re consume from. So we’re not breaking that link between the upstream database and data consuming. That means data we consume. Don’t have a look at anything like the database upstream.

35:52 So the upstream database, it’s in a way it’s messy, it’s designed for transactional use cases, it’s got terms in there that makes sense to product engineering but not to business. It’s not great quality on its own. What we can do instead is have a separate interface and that it’s all just being a copy of database. It can represent the data in a way that’s much more easy to consume and on its own that is a data product that could be consumed by AON in business, not someone who just knows capturing database and knows the schema or can read that ERD diagram. If it is it’s anyone in business can understand by news about to try and create some analytics or try make decision. So I’ve asked you the question Richard, I’m not familiar with liquid base but I think my assumption that it’s trying to put change around on top of a transactional option database, I think that’s a best say pattern we want to move away from.

Speaker 2: 36:55 Thank you Andrew. Remember the audience if you arrive late that you can ask questions in the chat and then I will basically send it to Andrew here. Let’s talk about automation and scalability. So how can data contracts be automated in large organizations with BAS of data pipelines and what role do they play in scaling the data governance as well?

Speaker 3: 37:18 Yeah, I think in large organizations it going to take a bit of a longer journey really, but what you want to do with data contracts, you want to try and standardize your data platform into a finite number of ways. You can move data around and make data available and if you can do that, not only is it easier if we consuming data, so in that case we’re a smaller company so there’s just a couple of ways to get data through it pop up and big query and must can’t be any ways we support, but we can easily add a few more if we needed to but we try to keep it quite minimal in large company maybe growth acquisition, stuff like that, that might be more difficult. But if you can try and standardize it, then anyone consuming data, they know it’s got data contracts, they know exactly how to consume it.

38:05 So oh say contracts called this, I know it’s going to be over here in BigQuery or over here in pops up, but it’s going to look like this and it’s going to be invisible and tables going to be named. It just becomes obvious where to find back data and ultimately you can make that available via a reporting tool and look at stuff. It all looks the same, it’s all standardized. I think that’s one of the benefits that contracts you can be standardized on that and once you have a bit more standardization then it’s quite easy to then build the automation and it’s worth investing them because you’re applying it to many their contracts, hundreds, maybe thousands of ’em. So you can invest more in automation. The returns are greater.

Speaker 2: 38:44 Yeah. One follow up question here to the one we replied before and then we go back to the scalability. So I understand from what you’re saying that the data contract will be supplementary to the source databases schema rather than looking to replace it?

Speaker 3: 39:02 Yeah, that’s right. I think some venture is in it supplements it in terms of how consuming data. So we are basically moving away from our change data capture architecture and moving to data contracts. So if the future we will not have change day capture, we’ll not have ELT, we’ll have any of that suckling databases, we’ll only be providing data through their contracts and that suppose day contracts need look nothing like the database at all. It’s a bit like go back to the API analogy, I think about your using this strip API and stripe behind the scenes, the sure it’s changing a lot. We use microservice, sure it’s changed a lot over time, but people can build businesses on top of stripe with confidence knowing maybe API won’t change. And that’s what I think we want for data. Maybe not quite to the same level internal, but it should be the same aspiration. I want to be able to build with confidence on top of this interface, the state contract. I don’t really care what’s happening behind the scenes upstream. That’s not my problem. As long as this interface is stable, I could then invest more in building something of value for a company because I’ve got confidence in that in a day contract you don’t get that confidence if you’re building on top of a database itself.

Speaker 2: 40:19 Thank you. Going back to the scalability, I want to ask you how do you prioritize which tables to work first on this, which APIs or which ones to take first in order to be able to scale, as you said, increasingly but without disrupting too much?

Speaker 3: 40:41 Yeah, at the start I think you specify to focus on the ones where you had to deliver very quite quickly and maybe you got good buy-in with the teams already, you got good personal relationships there or they’re quite debt driven already or whatever it is. You’ve got some kind of ready for this, they ready for this change they’ve bought in already. So start small, do with sum value but not necessarily start with the most important one that you’re kind of proven the concept at that stage. But then as you do that a few times and you start to prove the concept, then I think it’s best to prioritize over ones. But again, you’re looking business value but maybe you’re looking at greater business value. So maybe there’s a key business process, but if it fails you have to tell regulator about, and it’s been finding quite a lot recently and the VP of that particular area is not happy.

41:32 Maybe that’s one you start focusing on and you say, okay, well why is this process failing again? Do this become root cause analysis, this five whys or whatever you want to use to do that. Why is it failing? You find out because your data is not dependable enough and because of the option changes, you then got clear use case for that contracts a clear value you can add there. So that means you get more mature, you haven’t got the capabilities in place there. You’ve proved it a few times, there’s pattern that your company seems comfortable with and you’ve got proven success behind you being start going over bigger, going after bigger problems.

Speaker 2: 42:08 Yeah. Do you have an example of how you prioritize things in your company? How did you chose one before the other because of pain points or maybe business value?

Speaker 3: 42:22 Yeah, there’s a few different ways we prioritize. So some cases we have that we have a VP of ticket area coming and say like I’m concerned about my process is failing over time and my team are telling me it’s the data kits breaking. I’m like, okay, well let’s dig into that more. And then normally it’s not because the data team is breaking with data, normally they calls upstream and then we can start applying it there. So sometimes people come to us, other times you can do some sort of business mapping exercise and identify processes that are important in business and then just go through them, maybe score them in terms of how reliable are they and maybe you’ve got some data through your instant data back. If you’re using an instant management processes, you might, some data says obviously this particular process has broken 10 times in the last year or something, so it seems like a good candidate to fix.

43:13 That’s only one a month. If you haven’t got that, then maybe you can try and find, do some sort of surveys or whatever it might be, but try and find something to back that up. But then use that to prioritize. Use mapping of business processes plus become risk associated with that business process and the data depends on, and then use that as a means to prioritize which ones you look into. And for those process, it’s not necessarily about going back a bit to ions with it from which earlier, it’s not to say about just changing the upstream data from CDC to they contracted team in the same structure. If you look at the whole business process, you might be thinking, well our business process could automate completely if we just a bit more work up here and what if we provided different data upfront, can we do a bit less ETL and does that happen with reliability or at least reduce costs or whatever it might be, whatever metrics you really care about. So really looking at it from first principles and saying, okay, we have this new approach to data, how would we solve this problem using this approach today? And I’m really solving that problem as best you can and using their contract it makes sense to use them.

Speaker 2: 44:33 Yeah, I think in this case, because I do a lot of this, but mostly because we attach a data contract for each data product. So I do a lot of event storming just to understand all these connection and how they map to other data products and where do we need to do things just to keep track. I know that probably for really large companies might be complicated at some point because it becomes very, very large. But there are some tools there that are kind of helping to map these data contracts some way somehow and showing the data products in there. But yeah, I mean understand the process and which one brings the most value at that moment because it’s critical for the business is definitely one of the most smart things to do regarding multi-cloud environments because many organizations now they are in BigQuery, well they are in Google, they are in Amazon, they’re Snowflake, you name it. They have a lot of things happening right now. What will be a multi-cloud strategy and how data contract will help to manage all these data consistency across all these cloud environments.

Speaker 3: 45:41 Yeah, I see a question and that goes more to how I see we should be building data platforms or platforms more generally. A lot of my ideas in this area comes from this becomes topic of platform engineering, which is a topic that lots software engineers and people who build data build platforms in general are, it’s coming from software engineering really and from software platforms, but it applies equally to data platforms too. And the most recent ideas there that I find it interesting and we’ve done with site contracts where I haven’t printed as well, not as specifically as this, but with some intention as well. And the idea is you don’t necessarily have to expose your data platform is abstracting away some of the multi-cloud differences. So for example, you’re saying I want, I’m using my data contract, I want a table in a data warehouse in Europe with these backups.

46:44 And maybe your company, if it’s in Europe, always uses Snowflake. That’s the thing you use and that happens to be snowflake. But as a user, you didn’t have to make a decision about snowflake, that decision being made centrally for you. You just using the tooling, you get ’em the best practice there. Your intention is to have a database and data warehouse and the platform takes care of that as extracted away from you by decision making that good. It allows you as a software software engineer to focus on or data engineer or whoever create database to focus on solving your problem. Why are you creating this data and get to work? And also as a governance area, you’re essentially, you’re divine these policies and you are saying this is how we do things and if anyone uses their contracts, they are using the best practices and following their policies.

47:37 So I think it goes to how you build your data platform or how build platform, more general data contracts are great thing to build data platform around or call it sometimes it’s contract driven data platform and it’s about the right level abstraction, right? Level abstraction is very hard as a topic for a whole other episode maybe, but it’s a difficult thing to get that abstraction right, the right level. But I think that should be the aim if you’re building data platform to provide a level abstraction that allows people to get on their work with hide complexity behind them or your multi-cloud strategy of how you might used D clouds to different geos or how you might back up things in different clouds for redundancy or where we might be behind the scenes, try and hide that from people with best. They can not completely do too much then mess goes wrong, I can’t debug it and you’re always on call for ’em. But yeah, try to abstract that in your platform.

Speaker 2: 48:38 Yeah. And well now that we’re talking about how this evolution has been, what do you see in the future of data contracts in the future? How do you think they are going to be maybe adding things for security? Because now security and regulatory compliance are a thing, maybe also in the governance space as well. How do you see it in the future? The data contract is going to keep up with the emerging trends or technologies to enhance their importance and effectiveness?

Speaker 3: 49:14 Yeah, I think that’s a good question because I think basically idea of data contracts is quite simple. So whatever you need to put with data contract, I think you can build tooling to apply those policies to your data contract and to the data underneath it. So for example, I’ve seen companies put in their data contracts, this data shouldn’t be leaving EU for example, and then they can codify that in a policy and that’s response to regulation. But they didn’t have to change their mplementation for that. I had to build some tooling around it and define something new date contract, but the actual deal of that contract is the same. It hasn’t really changed. So I think the date contract idea and passion of it is flexible enough to handle any of those emerging requirements. And I think that’s why we receive them more. I think we receive them more not just for the main problem people trying to solve them, which is around getting people to talk to each other more and have agreements around data and voice equality and those kind of things.

50:14 They’re still important, but I think we see even more just how we manage data and the governance and how we, because I think the regulation’s changing, it’s changing quite quickly. There’s going to be more of it in future. So I think we do need to be prepared for that. I think data contracts is a good foundation to build out upon. Another thing I think we might see in future Hope might see in future is a bit more standardization around data contracts. Not necessarily that we’ll have to find it in the same way in organization, but what we find, what I see a lot of data contracts is define it one certain way. I mentioned we define it a bit of a weird language, we just on it, but people often define it in yaml. It could be anything. It could be in code in Python and code I’ve seen that will be fairly specific to organizations.

51:02 But what I’m doing is converting that to many different representations. So we converted to, for our sche registry, came back to that again. So a pub sub can use that for enforcement schemas. We convert it to some Jason customer for BigQuery to create tables. We convert it to Jason Schema for software libraries. We up converting it to many different formats. That’s not hard to do, but it’s a bit of pain. But what if we could convert it to one standard format and then it just plugs into any data catalog you want or any governance tool or any access management tool or whatever tooling is out there. What if we could just have a standard contract interchange format for that? And as a platform builder, if I can represent my data contract in that format, I just get access to all this tooling and the onboarding about is very easy and it says the offboarding as well.

51:59 If want to change, there’s no vendor lock. So that’s one thing I would like to see in future. There is a friend of D-M-L-J-D-P and myself and many others, George, we are involved in, there’s everyday contract standard and I’m part of a committee on that. There’s quite few of us involved in that. I saw quite good momentum behind it, at least internally in terms of how that’s progressing. We still much more adoption really from vendors and from organizations before it becomes this format I would like to see it become, but in future I think that’d be a great five of that would be very high. So it’d be great if vaccine we can see in future,

Speaker 2: 52:47 Oh, for some reason I was muted. I think that’s actually currently happening because I have seen some data catalogs that already have that feature to show a data contract. They are working on it right now because with a lot of metadata we just can bring things from there and just so face it for people because one of the concerns that I always got about data contracts, how we make it visible for business people as well and readable for business people. And I think this is going to be a game changer when all these kind of things become a trend as well because for business people it’s going to be also much easier to understand and have the accountability we spoke about. Right. Also as well when things are going to change, when they need something else. So it’s coming. And I’m really hopeful of the future of data contracts because I’m a fan. I’m your biggest fan. And just one more thing because we are seven minutes left. So I would like to give advice to any of the data leaders that are just beginning to explore the implementation of the data contracts in the organizations.

Speaker 3: 53:59 So I think on that, what we discussed here has been quite good summary of data contracts. I think the advice, the main thing with their contracts is to make sure you know what problems you’re trying to solve for. And because they can be applied in quite a few different ways, you want to keep that in mind, what problem you’re trying to solve for and make sure you implant their contracts in way that solve the problems you’ve got. So yeah, I think that would be my biggest advice to state leaders is just be clear on what problem you’re solving, speak to everyone involved in solving that problem, which includes people producing data who might not be in the same org chart as used. It might require going through organization barriers, which it’s always more of a challenge, but it’s necessity. And if you’re in data, you’re probably used to that if you’re data leader and just be clear on your problems trying to solve data contracts and how you want to solve them.

Speaker 2: 54:52 But you’re also written a new book, right? It’s a one-on-one implementation.

Speaker 3: 54:56 Yeah. Yeah. So yeah, I’ve just writing a short ebook at the moment called the Day Contracts 1 0 1, and this is really summarizing not being spoke about today, but really summarizing that for data leaders. So if you’ve heard about day contracts maybe today and you’re thinking, well this sounds like it could be useful for us, some of these problems that we’re talking about, they sound familiar. Then this ebook will give you a good view of how they contract actually solve their problems and something good you can refer back to. And from that, you can make a decision about maybe you’re going deeper and maybe you want to read my full book, which as goes into lot more detail, but obviously we take longer to read or maybe you want to go in this direction, but you least can make an informed decision so that ebook can be released in the next week or so. You can get it from DC one oh one.io, so that’s DC day contracts. And the number’s one oh one.io, and that’s a free book that you can download and if you sign up now, you’ll get an email as soon it’s released.

Speaker 2: 56:03 Yeah, actually, our community managers are adding that to the chat right now, so feel free to sign up and get your free copy next week. You said Andrew, right?

Speaker 3: 56:14 I hope so. Yeah. That’s the plan.

Speaker 2: 56:16 Yeah. Yeah. And now that you are becoming an independent consultant as well, so where can we find you and how can people connect with you?

Speaker 3: 56:28 Yeah, so you find me and all my details on my website, so that’s andrew jones.com and that links to my LinkedIn. It also describes some of the services I’m providing. They include workshops. So I’ve got a couple of workshops coming up in the next month or so. One in London and one in the Netherlands. But also provide various other services for helping an organization that’s trying to use that contracts to really change the organization to try and transform the organization into one where you can use your data to provide real value, perhaps revenue generating value from your data. And you do that because you can build a negative confidence. That’s what I’m trying to do. I’m trying to help organizations reach that goal in the same way that I did at my previous organization.

Speaker 2: 57:16 Perfect. Well, that’s all the time we have for today. Andrew, thank you so much for joining us and sharing your insights. And thank you, the audience for tuning in. Remember next month I will be with Dale Mary for another episode of Data Chronicles about data ownership in data mesh. Another important topic and quite painful. So if you want to know it from firsthand, please join on the 17th of September and then join us next time for more insights into the data world. And until then, stay data driven.

Data Mesh Learning Community Resources

Ways to Participate

Check out our Meetup page to catch an upcoming event. Let us know if you’re interested in sharing a case study or use case with the community. Data Mesh Learning Community Resources