A summary of our Data Mesh Learning community roundtable discussion on September 7.
The Data Mesh Learning community hosted a roundtable discussion with data mesh practitioners to discuss data contracts and what goes inside them.
Some of the questions posed during the discussion included:
- What should you put inside a data contract?
- Who should own and govern the data contract?
- How do you format and manage data contracts at your company?
- How do you include both structured and unstructured data in a contract?
- Should we collectively build an open standard for data contracts?
- Jean-Georges Perrin, Senior Data, AI, and Software Consultant & President and Co-founder AIDA User group
- Scott Hirleman, Founder and CEO of Data Mesh Understanding
Watch the Replay
Read the Transcript
Download the PDF or scroll to the bottom of this post
Ways to Participate
- Engage with us on Slack
- Organize a local meetup
- Attend an upcoming event
- Join an end-user roundtable
- Help us showcase data mesh end-user journeys
- Sign up for our newsletter
- Become a community sponsor
Scott Hirleman (00:00):I’m saying recording now.
Jean-Georges Perrin (00:02):Recording in progress to the cloud and beyond. Okay. Admit, let’s admit everybody. Hello people. Oh no, Eric is here again.
Scott Hirleman (00:19):Hey everybody. So J G P, why don’t, as we’re letting more and more people kind of flow in, why don’t you start a little bit with what you were thinking about. We talked about data contracts a couple of weeks ago, so what were you thinking about for this session? Why did you want to dig even deeper into this? Since this is such a complex topic,
Jean-Georges Perrin (00:43):I am getting more and more passionate about data contracts. I think they are really great for a lot of things and we had a very interesting discussion a couple of weeks ago, three weeks ago, well anyway, in our series there, and I didn’t get enough of it, so I just wanted more. So I wanted to put people’s brain here to understand, well what they put in the data contract and what their goal is with the information they put in the data contract. So that’s really what I want and the shape of it will conclude from that. You asked the question, here’s a complex one. Yeah,
Scott Hirleman (01:35):But I mean, you’ve been building an open source standard around this. You’ve been building stuff specifically around this. You’re deeper in this conversation. Is it that you think that people are missing something or do you think that one thing that Zhamak Dehghani has talked about is what we learned from a p i contracts especially was they started overly rigid and now they’re much more flexible. And she talked about, I can’t remember what law it is, it’s like postals law or something like that about the very liberal and what you accept from others and be very conservative in what you do so that you don’t change a bunch of things for people. So where do you think that there’s a gap? Is it that people are going for the easy things first because there is value in moving from what pre-data contracts to data contracts, even if they’re not fully complete? I just want to understand where you’re coming from before we open it up to everybody.
Jean-Georges Perrin (02:46):Well, so I think there are some similarities between a p i contract and swagger and data contracts. Okay. Think it’s a good parallel between the two, whether we should be stricter or actually more loose on what the content of the data contract is. This is what Zhamak Dehghani has a position on it and it’s great because she shared a position on it. I don’t know. I think I also wonder if it’s a position of the contract or if it’s a position of the tooling that is using the contract. And as you said, I’ve invested quite a bit of time while I was at PayPal and now that I’m not at PayPal anymore in open sourcing the data contract. And I think that the standard building an open standard for data contracts would be a great thing. I’m really strongly feeling very strongly about it and I welcome anybody to join. It’s an open standard. It’s not JG standard and join the community in building that. But this is where it’s interesting to know is whether, oh, we use a contract when you’re thinking about something like pistols low or do we want, or do we keep this information as a level of the contract or is it the tooling that is actually being more tolerant to the contract?
Scott Hirleman (04:34):And the thing about Zhamak Dehghani is where she understands where we need to go, and I think more and more people are seeing where we need to go, but it’s also okay to start with mediocre until we get there because it’s better as long as we understand that we’re not locking ourself in. But yeah, I’d love to open it up to folks. And I see Jay, we’ve got on here as well. I know Jay’s been doing a lot of stuff around data contracts as well, but I just kind of want to open it up to folks with what J G P has been saying, which is what are you tying to the tooling versus the standard? What are you putting into a data contract? How are you communicating to people about that data contract? I’d love to just hear how people are doing things. Go ahead Paul.
Jean-Georges Perrin (05:26):And you’re on mute, Paul. There we go. You’re not on mute anymore, but we can’t hear you.
Scott Hirleman (05:34):Yeah, your computer is muting you but not your Zoom. Well Paul’s figuring that out. Andrea, do you want to go or we still can’t hear you, Paul.
Andrea Gioia (05:52):Sorry. I think that if you define a data contract, and the first thing is to have a specification that describe what you want to put into your contract, but the specification only is not enough because I have no sense that specify a data contract if there is not a platform that is able to enforce and monitor in some way that contract. So asking to the data product team to provide information in the form of data contract and then being unable to enforce the data contract both at deployment time. So do a static check on the contract when you deploy your data product and working as a quality gateway and as at runtime when the product is running in production, monitor that the contract is still respected at the runtime. It’s a very important thing. So a contract without enforcing is not a contact, it’s just a piece of paper, lemme say. So for me, having a platform and asking to people to data product team to enter information in the contract that I am able to enforce them and maybe using the platform to automate some kind of process. For example, the lifecycle of the data product, it’s very important. Otherwise it is just a piece of documentation
Scott Hirleman (07:11):Who is putting in that right? Because there’s consumer driven and there’s producer driven. So what are your views on that consumer driven versus producer driven? Andrea?
Andrea Gioia (07:23):My point of view is that the data contract is basically an agreement between the producer and the consumer. But if someone have need to have the last word on that, what to put in or how to define and shape the contact should be the producer because the producer is the owner of the contact, so is accountable to make the contact be respected. And so at the end, if a decision have to be made on what to put the contact on or on how to form the contract is the producer that should have this last word in the discussion.
Scott Hirleman (08:03):I like that because talking about it’s collaboration, but there is somebody who has the final say but that there is collaboration, that communication. So J G P, I know you’ve got your hands up, but Paul attempted earlier, so I wanted to give Paul some space too.
Speaker 4 (08:18):Can you hear me now? Yeah. Okay, perfect. No idea What was fine? Alright. So one of the things that we’ve recently started doing, which I think is kind of interesting and it all fits into what J G P was saying and what Andre was just saying about collaboration and that I haven’t really seen before is in our data contract what we’re adding is we’re adding basically a way to put in semantic meaning. So when I say that it’s not the semantic meaning can really be anything. It could just be a random textual string that somebody says this means blah, blah, blah, blah, blah, whatever it is. So what’s happening is in the output ports at a table or at a particular column, it might say the range for this is 1, 2, 5, and that’s just the semantic meaning for that particular one. Then in the input port, but another data product that might be using that, what we’re doing is they’ll say, somebody will say, okay, I’m actually making a dependency on that column and depending on the range of one to five, and they’ll put that into the input port.
(09:29):So now what happens is when a deployment, it happens to data, product A, that first data product, they can actually check all of the dependent data products that have built off of that. And if you’re changing it and now it’s one to 10, it can say, Hey, you’re going to break this other consumer. So now it’s not only has a textual description of what the semantic meaning is, but it’s also has tooling in there so that if you’re going to break some downstream data product, it’ll notify you and you can figure out what to do from there.
Scott Hirleman (10:00):Is it both ways though, where a 10 would break it because it was one to five, but if you were at one to 10 and now it’s going to one to five, that’s the silent breakage that a lot of people don’t detect. Are you seeing that or not? Great expectations doesn’t do well. That’s
Speaker 4 (10:16):Why I say the string that you put in for whatever the seman meaning is it has no machine capabilities of actually doing anything. You can just type in whatever you want. And when I take dependency on that, I’m basically taking that same string, putting it into the input point of the second data product. So if I change that, so if it said one to 10 and now it’s one to five, that would be a change. So the things wouldn’t match like the semantic meaning would it match any longer and it would say, hey, you’re going to break this person, this downstream consumer, this downstream data product,
Scott Hirleman (10:52):Which it’s not the most sophisticated way of doing it, but it’s also one of those very simple, this is just going to break. And so it’s like, hey, we have a much better chance of catching this than trying to do all of this sophisticated math versus just, hey, this thing changed, therefore. Yeah, I like that a lot.
Speaker 4 (11:11):The other piece of what we’re doing, which kind of fits in this whole way of allowing the first data product to change is in the input port and the downstream products, it’s doing dependencies on, okay, I’m depending on this particular SS l a that’s defined in the first contract in these three fields. So if there’s 20 fields in the first contract and somebody changes 17 of the other fields, but they don’t touch the three that I’m depending on, then the first data product can continue to evolve without affecting the downstream data product.
Scott Hirleman (11:47):You’re the first person that’s implemented that idea that I’ve had for a long time. So I’m really excited about that. I love that
Speaker 4 (11:53):It’s hundred percent implemented yet, but that’s what I’m building on it. Awesome. It’s starting positive pieces of it and it’s coming together. It will be over the next weeks.
Scott Hirleman (12:03):J G P, you’ve been very, very patient, which
Jean-Georges Perrin (12:06):For once a few comments. The thing is I want to stay on, focus on where we’re going. So I know that a lot of people use data contracts in the context of data products or even this weird word of a data mesh. But the thing is for me, data contract is not completely linked to that. So the thing is, Paul, it’s great you’re mentioning that and the role of what it does, and I know you’re seeing the contract and the data product, but what I wanted to focus too much more today is what you really put inside the contract, whether I’m not focusing too much on the usage of it,
Scott Hirleman (12:53):What do you mean by put inside? Do you mean what are the actual things that it’s contracted around or do you mean the schema and the things like that? Or
Jean-Georges Perrin (13:06):Well, it could be the schema and is it going to be one schema? Is it going to be multiple version of a schema? Is it going to be security? Are you going to put SS l a in it or do you put data quality rules, et cetera. So that’s really what I’d like to bring the topic on it. And I wanted to go back on something that Andrea said as well, and it’s okay to not agree because for me the data contract is a document. So whether you print it and it’s becoming a piece of paper or whether it’s going to be a file or whatever. But the thing is it’s not the role of, for me at least, it’s not the role of the data contract to self enforce itself. It’s the role of the tooling that uses the data contract. So you can have common tooling that can say, okay, hey, this is a library that consumes the data contract and then it gives you threshold that you were saying like, oh, it’s not between one and five, it’s now a bigger in five and 10, blah blah blah, all these rules.
(14:06):But the thing is it’s not the contract to itself, it’s a file. It’s not dynamic in a way. And regarding the owner, for me this where it’s really, for me at least, it’s starting to be a little bit of a challenge to be really who’s the owner in the context. And then I’m going back to what Paul was saying, when you’re using a data contract within the role, within the scope of a data product, then in the scope of the data product makes more sense to me that the owner is a data product owner because he owns this role as well. And it’s not only a question of the producer because the thing is, and you said that many times in your podcast Scott, it’s really difficult to motivate the owner, the producer of the data to do anything because it’s not finding the incentive to make them do something is really difficult. A, I’m already managing this workday thing, I don’t want to on top of it, be responsible for the data products which are source aligned to my workday, blah blah blah blah blah blah. So the thing is if we ask them to be the owner of the data contract as well on top of that or the data product, then I think this is going to be okay. It’s going to be a no-no. But that’s a feeling I have there.
Scott Hirleman (15:35):So in your view, is the data contract simply a file? Is it like a YAML file that then is enforced by tooling or what are you actually, so you do think it’s just maybe even just you got a little simple UI that you just drop some stuff in. Is that what you’re thinking? Because then that makes it simple for folks. And one other thing would be is there a data contract for the data product or is there a data contract per consumer? So it might be different things for each consumer.
Jean-Georges Perrin (16:13):Andrea answer Scott, I’m running
Scott Hirleman (16:16):Away.
Andrea Gioia (16:18):The data associated to the data product, there is not one that contact for each consumer because consumer can come and go. So basically you start to define your contract, you take the accountability to respect that contract, and if the contract is okay for the consumer, the consumer can attach their input port to your output port and start to consuming. So negotiating a contract for each consumer create a lot of dependencies in the mesh. It’s very complex to make the contract evolving. So I want to have only one version of the contract for the data product and manage the lifecycle of that data contract, not have 10 data contract because I have 10 consumer. It’s becoming complex to negotiate to them and especially to avoid them over time
Scott Hirleman (17:11):Or maybe via those input ports. You have the consumers say, this is what I care about relative to this contract. So exactly what Paul said of if something changes on that output port or if something is going to change, the ones that don’t have to care don’t care.
Andrea Gioia (17:29):That’s a great idea. That’s a great idea. That’s a great idea because a lot to do impact analysis. So when I have to change my data contact, I know that I have 10 consumer, but maybe five of them are not impacted because the change is not impacting them. So this is a great idea and a lot to do impact analysis before the changing.
Scott Hirleman (17:49):Changing that testing analysis would be really helpful to say, is this going to break things? So Jay, you had your hand up for a bit. Did everything get answered or
Speaker 5 (18:02):No? I think yeah, he covered it well actually, Andrea, so I believe in the same that there has to be a multiple contracts per consumers because everybody’s needs are different. In fact, they can have a sometimes contradicting rules. And this is where somebody mentioned that previously that who owns the contract and who kind of governs the conflicts. And I believe that should be a data stewards who are the governance body of the company. I mean Gene knows this, we can understand, gene can talk about that at the great length. But I think one of the confusing point here, I think we can talk about contracting point is because of the environment we operate in. So if your scale is different, you will have a different challenges. You may not even have a governance body in the organization in that case, yeah, producer always is the owner but usually the consumer is the demanding the rules. So it really depends on how I do it, but I’m loving the conversation the way it’s going. So just enjoying that. Thank you.
Scott Hirleman (19:11):Yeah. Andre, you’ve had your hand up for a while.
Speaker 6 (19:14):Yeah, thank you. Actually it’s another hand so it’s not for a while. Yeah, I have a little story on that. We actually have quite peculiar implementation of data contracts. In fact, we have only one team use data contracts right now in our company. And these data contracts, they’re very special. I think they’re different from what is usually understand by data contract first. They’re driven by data producers and second data producers make a contract with consumers on business terms like consumer ask them to provide, I dunno, some information about lead for example for marketing. And producers say, okay, I don’t know what does it mean? I dunno what lead is. So they describe lead in formal terms. So they describe every attribute that lead has to have. So it’s not about a table, it’s not about data product, it’s some object. And one table can have a lot of different objects and for example, we have this lead class described and then they have property testing on it. So they describe every attribute of lead and they run property tests that in fact generates thousands of unit tests for this class and they see what lead can be. They see all the corner cases in that contract in the border. So that contract that they were given by consumers or by management or by whatever.
Scott Hirleman (21:25):Interesting.
Speaker 6 (21:25):That’s a real story.
Scott Hirleman (21:28):And have you seen that that’s working well or is that you’re still kind of testing it out or what’s going on with that?
Speaker 6 (21:39):Actually? Yeah, it’s working well for us, but I believe that it driven from different factors. And first of all from current organizational situation because in marketing they have a lot of management directors and different departments which want to use data which is actually provided by central team, which is not data mesh at all. So this team, this team, they doesn’t know about data match but they have to. And when this thing started with this contracts, they really started to formulate what business objects they need to deliver, say not lead, let’s say opportunity. So what is opportunity? What attributes should it have? What is it actually or I dunno, user who is the user, what user’s attributes.
Scott Hirleman (22:58):Yeah. And I think one little thing that was hidden in there that you talked about, which is when people want to use data, the whole point of a data contract is to drive trust. How does that work? So Yulia, you’ve had your hand up for a while.
Speaker 7 (23:16):Yeah, thank you. Hi everyone. I just want to reflect on our discussion for the past 20 minutes. And it sounds really complex because data is complex eventually, right? We are talking about possibility of having multiple contracts for one single data product and that could be all around use cases, business use cases, business requirements, team requirements. And also we can have different forms like G G P suggest, if you can put it just on the paper, have it there, start from there. And I was wondering, do you actually see, given this complexity and flexibility that teams require on that, do you guys actually see any tool in the future that can provide this data contracts at scale to secure all the use cases?
Scott Hirleman (24:19):Yeah. Or is there a blanket of multiple things, which is kind of mean you’re in the observability space. We’re seeing people want a single observability tool and nothing can give you overarching observability of absolutely everything. It’s the same thing on engineering observability. Your datadog doesn’t solve every aspect of observability. So I think it’s a good question. I’d love to hear, let’s jump to Almon. He’s had his hand up but I want other people to be answering that as well as we’re going forward. I think it’s a really good question of are we asking for too much for the kind of God tooling, right? The single one to rule them all, the single rule,
Speaker 7 (25:01):At least one that will have covered at least 50% of the use cases we just talked through. Yeah, good point.
Scott Hirleman (25:08):So Salman, you’ve had your head
Jean-Georges Perrin (25:10):Because all the use cases was really wow,
Speaker 7 (25:15):They were different. In fact very much different.
Jean-Georges Perrin (25:19):I’m just
Scott Hirleman (25:20):Kidding. Well he’s just saying you said you just took it from a hundred percent to 50% so he’s feeling a little bit more comfortable that we can get to 50 on one thing. But Solomon, you’ve had your hand up for a while.
Speaker 8 (25:28):Yeah, hi everyone, it’s a really great discussion. Thank you for arranging that. So I have a question as well as comment on the last question. We are trying to figure data contracts out for some time now. I would say a couple of months and the information available on the internet is overwhelming. Like scoping Exactly and tailoring it to the organization need. That’s kind of a challenge because looking at this data mesh data product kind of approach, we are basically a decentralized organization and we have to convince other teams that this is something that is really useful. So this is indeed a challenge and that’s a great point. If we have a central tool or an easy tool where the integration can happen, that would be great. That’s a great idea. My question is basically about the enforcement of this. So I’ll just try to quickly summarize this.
(26:35):So basically we have real time framework and we use proto for defining the schemas. So the ingestion is real time into the databases. And I see some folks, there are articles that use even proto B in itself to define some validation rules. But some folks use it as a service like ingest that in a schema registry or somewhere into a database and then use it as a service. So what is the right approach? I’m a bit confused there. Should this be some kind of centralized or just keeping it in a file? It was mentioned earlier and committing it to get basically the code generates on top of this profiles, so this is good enough. What is the approach that is usually followed in the industry or used as a practice?
Scott Hirleman (27:33):And I think the one thing that I would say is what I’m finding for driving adoption is just make it as easy as possible for folks to get started. Even if it’s not perfect. The fact that we can give tooling to software engineers so they can do this stuff is the thing that’s going to get them to do it if we make it overly complex. But Matias, you’ve had your hand up for a while. I would love to hear more from what’s going on in your head.
Speaker 9 (28:00):So I have some thoughts on a few things that were discussed. So the first thing is a lot of the data contact discussion is about obligations, commitments from the data product owner. What is that person or team going to deliver to data consumers? What I’ve seen with some companies that I’ve been working with is that it can also be about the obligations that the data product owner wants to put on the data consumer because the data producer may want to enforce, okay, I give you access to my data product. If you agree to use it for only a specific purpose, not for another purpose. It can be important for G D P R. There may be things like you may use that data product for one year or maybe for three months and then you need to ask me again, you can use my data product but you need to confirm you’re not going to share it with other people outside of your organization and so on.
(29:08):So there’s that other side to date contracts, in my opinion, it can be bi-directional, right? And maybe it can be together in one data contract for the same data product, maybe it could be two separate ones. That’s then a part of a implementation, but it’s commitments on the usage. And yes, there is a legal aspect to that, not just what’s the content and the quality of the data product itself. That’s right. The second thing is I think Julia asked about tools as far as I can see, and I’m working with various tools, vendors and data governance vendors and so on. There’s probably no perfect tool at the moment that can cover probably not even 50% of the use cases that we are talking about here. Keeping it simple, Scott. Absolutely, that is key. I’ve seen some companies exploring workflow engines or data catalog tools that have customizable workflows to kind of mimic some data contract capabilities. Is that the perfect answer? I don’t know yet, but that’s what I have seen. And the last point is on the person who spoke before me, enforcement is really interesting. There’s probably no automatic enforcement, but at least a company needs to have a clearly defined procedure. What is going to happen if a data contract is breached, if it’s not fulfilled, if that is not defined, why have the contract to begin with?
Scott Hirleman (30:53):And that bi-directional use, Eric bro has talked about that a few different times, especially important in financial services. And Andre just did something he kind of scooped. What I was going to talk about is I actually heard somebody talking about blockchain with NFTs and things like that. And I’m smart contracts and I’m not an N F T guy at all. I think blockchain and all that stuff is kind of goofy, but they were talking about that the systems themselves, like the original person who sold an N F T or whatever, when that N F T gets sold on additionally they get a percent of the revenue. So you’re able to track that usage through the system. And if we can have that same kind of context and that you do have that exact thing of where it is encapsulated about what it can be used for downstream and that somebody is like, I want to put this into my data product, that’s a different conversation and that’s a violation of the contract.
(31:56):And that violation doesn’t necessarily mean it’s bad, it’s just it generates a conversation. It generates a can I do this? Can I exfiltrate this? I am giving you the place where you can deal with this data. Can you exfiltrate it? You need to come back to me. We’ve got to have that legal conversation. And if you go outside of that, it’s on you. It’s not on me. This was something Sarita Bax talked about when I talked with her at JP Morgan Chase, that the consumer takes on the risk once they make an agreement, they have taken on that risk. So yeah, I think there’s a lot of really, really good things you said in there. Jay, you’ve had your hand. Sorry, I’m going off J, it’s pretty usual. I’m talking too much.
Speaker 5 (32:41):No worries, no worries. I wanted to answer on Salman’s question actually, and I completely agree there that there is a lot of misinformation about data contracts and it is like everybody has their own definition. So definitely there is a lot of misinformation about that. And second is he asked about whether there should be a pro above specific contract or not. I think the contract should be a technology agnostic. It should not be tied with anything specific to that technology or not. It cannot leverage a specificity of how Profo defines let’s say, a data type versus how the human contract would have that. Okay, I want numbers and should be a floating point. Something like that. So just wanted to quickly answer on that.
Scott Hirleman (33:30):I think that’s a great color. And then Josh, you’ve had your, how the hell do I say your name? I get it wrong. I think every single time.
Johen Christ (33:38):Yeah, well actual pronunciation is Jochen (Yo-hen).
Scott Hirleman (33:41):Jochen. Okay.
Johen Christ (33:42):Which might be pretty hard for non-term speaker. Yeah, but everything’s fine. So no, my point is I think we are talking about data contracts a lot because we are starting more and more to exchange data between different teams or different organizational units. So this is something that has started with the data management approach, shifting away from one single data organization to different teams, different ownerships. And what my point is with data contractors say are if we see them as a document or a specification, say a great tool for collaboration to bring people to together to talk about their data exchange format. So when you go into a YAML and you look at the definition of columns and the naming of the column and actual type of the column, you start a lot of discussions. Also when you talk about the semantics, if you say, what is this articul ID or this customer id? What is it if a customer is just a guest customer noting and was it still a customer? If you go very specific and formalizing the detail, I think data contracts are really, really a good tool to bring people or teams together in talking about their actual requirements even before the first data product is implemented, right? Yeah.
Scott Hirleman (35:24):Andrew Jones has talked about this too is like the data, I’ve talked about a data sharing agreement where you have something that’s wrapped around the contract. The contract is at the computer to computer level, it’s at that machine to level. So you have to have that enforcement capabilities, you have to have that stuff at the machine level, but you need to have that conversation to actually share context with each other. That’s the thing that just why can’t we get people to talk to each other? I don’t get it. So J G P, you never have to raise your hand in these things, but go ahead.
Jean-Georges Perrin (36:00):I know I don’t have to raise my hands, but the thing is if I want to say something then I’ve got to raise my hand anyway. I like the discussion, but I still don’t know what I’m going to put in my data contract, but that’s maybe a detail, but I was kind of the focus of today’s session details,
Scott Hirleman (36:22):The where do you think data contracts. Oh okay. I was going to ask a question, but Paul and Yulia both jumped up so you can rock paper scissors for who jumps first. I don’t know,
Speaker 7 (36:37):I just have a short comment. I just saw a team of nine people just data engineers debating for two months is what kind of tools they should use for data contracts. Not even starting conversation with production teams. That’s it.
Scott Hirleman (36:55):Names Paul. Oh Paul, we can’t hear you again. I don’t know what’s going on Andrea. It might be, it was the same thing when Andrea raised his hand last time. So Andrea, do you want to go up? Paul’s figuring out his computer. Maybe it’s just when Andrea, I don’t raise his hand. It causes Paul’s audio to cut.
Andrea Gioia (37:20):Okay. Okay, no problem. Just to start to answer to the question of Jean George, what to put in into the data contract. My idea is to start with a minimum viable data contract. So to put it inside the data, contact the minimum information that you need to create the discussion between consumer and producer and to do basic enforcing, usually I divide the information within the data contract into different blocks, the information related to the promises, so what the data product owner promised to its consumer and to the expectations. So what it expect from its consumer, this actual block of information. And for me the most important information to put to begin in the definition of a data contract is for sure the owner of the contract, the policy that you will follow when the contract is changing. So for example, for how long before you communicate to your consumer that you are going to change the contract, there is a great period in which you keep so online the old interface, all the content to using a deprecated output portal for example.
(38:40):This is an important information. And then usually we put as I think everybody at least the schema of the data that is exposed. So to know the structure, the column, the type of the column, and maybe some constraining over this schema. This is the very basic information to start with. And then you can put basically everything because the contract is a service agreement. So you can put S agreements, security constraint, and the definition of the a p I user to assess the data you can put to whatever you want or the information that describe the interfaces and the agreement over the service exposed user to expose the data can be put inside the data content. But in general, I suggest to start with a minimum viable data content, not start to big and start to use it. So it’s better to start with something that is sustainable in the organization and then grow the information that you put in the contact as soon as your culture around data contract improve. And as soon as well your platform that put enforcing is able to use the information that you’re going to put into the datacom.
Jean-Georges Perrin (39:59):Eric?
Speaker 11 (40:00):Yeah, just a few quick questions. I suppose there are some things like open a p i specifications, J ss o schema that are relatively a rich and robust and they have things like versioning capability, which I think are fundamental to data contracts. So the first question is I wanted to get the group’s perspective on how they’re using those, if at all. The second question is we talk about metadata and providing some of that in a data contract, but a lot of stuff, whether it’s table structures or otherwise, but a lot of data, at least in my world, a lot of data is not in just databases. So being able to communicate things in a fashion that would be amenable to a table structure when stuff, a file and some of the stuff and maybe hierarchical in a file as opposed to flat or table oriented. So the second question is how is the group dealing with the vast myriad of different types of structures that you want to put contracts and even perhaps even images are in some of the stuff that’s going back and forth in communication. So how you deal with the variety of different formats when you think about a data contract. So I’ll pause there and listen.
Jean-Georges Perrin (41:25):Matt? Yes, I think it’s you and
Speaker 9 (41:28):Okay. Yeah, so just very briefly on your first question, Eric. Yes, absolutely. I’ve seen a couple of companies use Jason Schema as their first approach to data contracts. So yes, that is being used. They’re storing this then in a version control system like GitHub, and that is a key piece of what they are starting with. Does that cover all of the use cases? No, but yes, it’s being used on the second part. Of course, the more diverse your data product, the harder it is to have suitable data products, especially when you talk about unstructured data videos and stuff. Then it depends on what do you want to cover. Then maybe it’s about metadata, about the video, about the image. And that again could be in adjacent document, right? If it’s about, I don’t know, the resolution of the image, et cetera, maybe that’s data that is attached to this unstructured format, which then might as well be in a semi-structured format.
Speaker 11 (42:37):Perfect. Thank you Matias.
Jean-Georges Perrin (42:38):Sure.
Scott Hirleman (42:43):E G P, did you want to wrap up? We’re kind of hitting on that 45 minutes that we typically wrap up and any or Yulia asked if we should all have some homework. Is there anything that you want to give everybody homework for?
Jean-Georges Perrin (42:56):Well, I would like everybody to have a look and I’m going to share that to have a look at the open source data contract if you haven’t looked at it yet, that I’ve been working, not myself, not only myself, but quite a few things. And this would be great to have some feedback and of course a few likes on the GitHub repository. Always good. But I think it was an interesting discussion. I still don’t know exactly what everybody, I think it’s only Andrea that actually answered my question, but it is good. It’s a story. Okay. I think we also had a few discussion about what the format of that is because I’ve seen things in Excel, I’ve seen seeing in Word, I think Js O or yamo seems to be more usable. But that’s a good start. I think why I’m asking all this thing is really, I think that’s the data contracts. I think everybody here is interested in the outcome of the data contract. What’s going to be the data contract, how are we going to use it? But the thing is really what are the needs? For example, my needs as a user is probably different as Julia’s needs from a tool vendor perspective. And that’s what I’m kind of interested or Matias as well because, and should it be an open standard or should it be owned by Databricks? Okay, I’m just provoking Matthias here.
(44:37):So that’s the kind of questions that we need also as a collective group to form because we have brain powers that we can put towards that and create something that is actually benefiting to the industry and be ambitious about it. And you has some tool as well around data contract and is humble enough to not mention them. So that’s kind of a little bit where the discussion can go. And I’ll share a few links with the audience of this group. I don’t know if we’re all friends on Facebook, Facebook on LinkedIn, but we can also be friends on Facebook, but it’s really not as interesting but on LinkedIn and we can continue this discussion a little bit there as well. But I think there’s always a lot on and we’re already over the overtime, but I think we’ll have to continue all this discussion. And as you see, my background is different this week. Again, it’s going to be different next week again because I don’t like virtual backgrounds and it’s going to be different the week after where Scott and I are going to be in
Scott Hirleman (45:52):London. Big Data London?
Jean-Georges Perrin (45:54):Yes. Okay. Yeah. So we’re going to make a live even in London. Okay. A scoop there. Okay.
Scott Hirleman (46:02):We have to figure out what time that is just because I don’t know that I want to do that at the end of day two at six 30, but we’ll figure that out. We’ll let people know. We’ll make it work. We’ll make it
Jean-Georges Perrin (46:11):Work.
Scott Hirleman (46:11):Awesome.
Jean-Georges Perrin (46:12):Only five in London anyway. Okay.
Scott Hirleman (46:16):And for anybody watching this later we’ll have some links in the show notes as well for on the YouTube.
Jean-Georges Perrin (46:23):Alright, thank you everybody.
Scott Hirleman (46:25):Thanks next week.
Jean-Georges Perrin (46:26):Thanks.