Implementing Data Mesh at PayPal

Jun 22, 2023 by Melissa Logan

A summary of the discussion during our Data Mesh Learning community panel held in April

In April, the Data Mesh Learning community hosted a panel discussion to explore how PayPal implemented a data mesh architecture. The conversation included best practices, challenges, and insights to help any data mesh practitioner better implement their organization’s data, products, and data mesh.

The Data Mesh Learning Community hosts bimonthly virtual meetups to discuss data mesh topics.

Implementing Data Mesh at PayPal

Panelists:

Laveena Kewlani, System Architect, PayPal
Kruthika Potlapally, Lead Software Engineer, Clairvoyant AI/PayPal
Jean-Georges Perrin, Intelligence Platform Lead, PayPal

Moderator:

Eric Broda, President, Broda Group Software Inc.

Watch the Replay

Read the Transcript

Download the PDF or scroll to the bottom of this post.

Ways to Participate

To catch an upcoming event, check out our Meetup page.

Let us know if you want to share a case study or use case with the community.

Data Mesh Learning community resources

Engage with us on Slack
Organize a local meetup
Attend an upcoming event
Join an end-user roundtable
Help us showcase data mesh end-user journeys
Sign up for our newsletter
Become a community sponsor

You can follow the conversation on LinkedIn and Twitter.

Transcript

Eric Broda (00:00:00):
Welcome to the Data Mesh Learning Communities panel on implementing data mesh at PayPal. So first, let you know before we get started, let me just set the stage for today’s context. So, so clearly, I mean, we’re all here for a reason. We know that data is the fuel of the modern enterprise. And in financial services and in particular in PayPal, data truly does power the business. It’s a foundation that enables better business decisions, improves operations, creates innovations, and delivers outstanding customer outcomes. But now, you know, as, as we all know data mesh is a modern approach that helps us drive these outcomes more effectively and efficiently. And advocates for kind of several principles. They’re well documented but I’ll go through ’em a very high level cause I’m sure we’re gonna talk about these during the discussion. But first off, there’s domain ownership where data has clear boundaries and empowered you know, with, with empowered and knowledgeable data, product owners.

(00:00:52):
Second data is treated as a product, making it discoverable, understandable, trustworthy, and interoperable. Third data is self-serve, where platforms are available to make the data easy to find, consume, share, and all that with min manual, with minimal manual intervention. Last but not least there’s federated governance, but let’s data product owners have the local autonomy that they need to, to deliver the services that they desire. Now, obviously, as I mentioned, as a data mesh community, we probably know all these things very well, so I’m not gonna be talking about those, but what will we be talking about today? Today we’ll be focused on data mesh implementa implementation details. We’ll be talking about how PayPal has implemented their data mesh, what techniques worked, and perhaps even what didn’t, I suppose. And along the way, our hope is that you, as a data mesh practitioner, you gain some insights to help you better implement your organization’s data, products, and data mesh. Now, today, we have several members of the PayPal data mesh team with us and they’re gonna share their insights based on real world experiences. So if I could ask each of you to introduce yourself and tell me a little bit about what you do for a living, I suppose. So, Lavina, why don’t we start with you. Hi,

Laveena Kewlani (00:02:09):
My name is Lavina, and it’s been almost an year for me to be associated with PayPal and working on Data Mesh Pro project, their implementation as an architect, solution architect. And there are a lot of challenges, learnings, and adventures. I would say in doing so. And, but it was fun and I’m looking forward to share my experience in terms of different aspects of building it from my perspective in PayPal to everyone around here.

Eric Broda (00:02:46):
Excellent. Thank you. Lna. Kika, tell us about yourself.

Kruthika Potlapally (00:02:50):
Hi, everyone. This is Ika. I’ve been working as a software engineer as part of CLA one, but I’ve been working with PayPal since the past years and some of their projects. And since the past one, one and a half years, I’ve been working with Laina and JG for building a data mesh platform in PayPal as Lavena said, a lot of learnings, lot of experiences, and it’s been a fun journey. And I, including all of us, we are looking forward to share our experiences with all of you.

Eric Broda (00:03:25):
Excellent. Last, but not least, we have Jean George per, or JG as they call him. Tell us about yourself, my friend.

Jean-Georges Perrin (00:03:32):
Yeah, so Jean George or jg, depending on your level of French or your comfort level with it. You have no excuse. Okay, my friend, because as Canadian, you should be completely bilingual, . So even if there’s data in my title I I self-identify as a software person. That’s probably why I think I, I, I, I found my niche a little bit with the, with the data mesh from my accent. You probably got on my name, you probably got that. I’m French. I moved to the US about about a little less than 10 years ago now. I used to be an entrepreneur and I joined PayPal about a little over a year ago now. Yeah a little bit before Lavina. I, last year, at the summer of our, our project, I led about five teams, 40 people, and we designed, implemented, and we deployed data mesh. And and that’s been an exciting experiment an exciting project, and we are pretty proud of all of that. So and on the side, I’m a lifetime b m champion, and I love knowledge sharing. I wrote a few books including this one. This is a Chinese translation of data mesh for all ages. And yeah, and I’m also a speaker at many conferences, so.

Eric Broda (00:04:56):
Fantastic. Back to you. Thank you very much. Levina Ika, Jean George. On an administrative note, I should say that each of the panel members here are talking on their own behalf and not for their employers or clients. As for me, my name is Eric Broda. I’m the president of broda Group Software. It’s a, a boutique consulting company focused on accelerating financial firms, enterprise data mesh journey, and I have the wonderful opportunity of being the panel host and asking the great questions of this fantastic panel. So with that in mind, why don’t we get started and I’ll I’ll start with you ika perhaps tell us a little bit about PayPal and more importantly, obviously focus on that data. How important is data analytics to PayPal?

Kruthika Potlapally (00:05:45):
Paypal is a data-driven company, and in fact, moving towards becoming a super data-driven company. And on a daily basis some of the products, I mean, everyone knows PayPal as a product, but there’s so many other business units as well, like Ven, zoom honey, and so many more b Braintree, all working, all contributing or creating I wouldn’t say the numbers, but then large volumes of data on a daily basis. So it’s very important that we work with the data, we manage our data and keep our data secure.

Eric Broda (00:06:26):
Awesome. Now, Levina, perhaps you can tell us a little bit, what, what does the PayPal analytics landscape look like? Like what platforms are used? How much data do you actually have in your environment in, in, at least in the environment that you’re working in? Oh, you’re on mute Levina.

Laveena Kewlani (00:06:45):
Yeah. On daily basis, there are almost like billions of transaction. And the visualization usually happens on Looker Quick and Tableau and in general, the analytics is done on Notebook. And as we mentioned, that there are applications which are tracking this transaction from different directions. So that divides the data and the same data or the collective data from different sources into different domains. So to give you an estimate number of how much data is being rolling on daily basis, it’s like, just imagine how much one transaction creates a data point, and how many points, how much deep, deep like terabytes of data will be collected on billions of, on the basis of billions of transaction. So it’s a lot to handle, and that’s created a need of something that automates the data driven approach for analyzing it, making it much more visible, visualized page like make, making a much more visual to the stakeholders at a, you know, domain level, which is more useful. So, yeah.

Eric Broda (00:07:58):
Perfect. Awesome. Now let, what, let, we’re gonna start and wind things back up a bit for perhaps Jean George, you’ve been there the longest. Tell us tell us you know, about PayPal’s data mesh journey. Where did it actually start?

Jean-Georges Perrin (00:08:17):
I thought our agreement was you, you would make Lavina and critical talk and not me. Okay. So now, now I’ve got to talk as well. Okay. So I’ve got to think of that again. No, but so data as, as, as I both said, and as anybody can guess, data is, is really critical to our business. And we, we were already pretty much state of the art when, when I joined, but, but the, the landscape and, and the requirements around data is constantly evolving, right? I, i thing is there’s more regulations, there’s more compliance requirements. There is more trace trackability of the data, traceability of the data, all that. So it’s not that we were in a bad position and we needed to go to a good position was we were in a good position and we needed to be prepared for the future, okay?

(00:09:16):
And, and, and anticipate for that. So the idea was not, and when I joined PayPal, it was not to say, Hey, let’s let’s let’s below data mesh, okay? I didn’t say, let’s do, let’s do that and please give me billions of dollars to build it. So it was more like a we, like a traditional IT or data project, okay? We analyze what was a need and we said, okay, well, we actually add self service. Okay? So we didn’t need to build self-service, but we needed to keep self-service. We needed to increase our governance and build tighter relationship with our enterprise data governance team. We needed to we needed to think a little bit differently about, about data, right? Like a lot of companies, we were thinking about a, is this a, a critical part is a pipeline and everything is rolling around the pipeline, so let’s, let’s let’s think a little bit differently that part and maybe thinking about data as a product.

(00:10:24):
Okay. And, and then we also needed to limit the scope of, of the data projects that were, and we, we were part of a, of a business unit, okay? So we didn’t, we were pretty horizontal, but we didn’t care about all the different verticals of the, of the data. So when you, when you’re thinking about all these things, okay, it may, it just relates to the four principles that Jamag put as the data mesh. Okay? But, but we found out, I wouldn’t say it’s a hard way, but it was, we found out based on what we were doing, that, hey, this thing we are building is, was very, very close to, to, to a, to a data mesh. But we started the project back in, in January. I joined in December, in January, we were all, all, all ready to, to, to, to start and produce lines of code, but it’s only mid-March that we realized that, hey, this is a data mesh.

(00:11:20):
We, we are building a data mesh. Okay? so, and, and, and, and the analogy I like to keep is a bit like the human and the monkey. Okay? So we’ve got 2% DNA difference or something like that. And so, so we were, we were, at that time, we were probably the monkey, and we saw the human and say, okay, well maybe you should do the extra 2% to become a human . We, so, so, so, so that’s, that’s, that’s kind of what we did. And we realized that mid-March and and since then it was, it actually, you know, it, oh, wow, you, you switch in middle of the project, it’s actually made it easier for us because Jam Mac already published so much that it was kind of easy. You had a roadmap by reading a book if you understand a book. But so yeah, perfect. That’s how it started.

Eric Broda (00:12:09):
Perfect. So, so the dates you mentioned that was la so you’ve been at it now for a year and a bit, the march and such was last year, so, so there’s quite a bit of experience here, so, so let’s Levi Levina, perhaps you can, let’s go to you now next. You know, data mesh at its simplest is really about an ecosystem of data products. But, but what, what components does are in PayPal’s data mesh ecosystem, is there a data fabric? Are there key platforms? Tell me kind of what the architecture looks like.

Laveena Kewlani (00:12:38):
So when it comes to the platform we try to keep everything as a microservice so that the connection between one service to the other in order to create a content, which Zack mentioned had different you know, utilities to fulfill, like dictionary, observability, et cetera, et cetera, to fulfill. So we curated a strategy where we take the requirement, basically serve the person who is going to define the data product and see the end result, like in a general e t L pipeline, you see the output and the input. We also did the same things and curated different components that would cater us at a different level, be it at a mesh experience level or a data product experience level, or at an infrastructure like a PayPal infrastructure, like correlates with PayPal infrastructure level. So we made out those we designed those components, went for the feedback from the users. We rolled out, we rolled, we again got the feedback. And that’s how the components, so some essential components are like for example, you get the notifications and you get the checks on the data points, et cetera. These are some very basic things that we define, did we define at different levels of our quantum and mesh.

Eric Broda (00:14:08):
Okay. Now, I know at PayPal, I think it’s called data quantum, which is the equivalent of a data product, if I’m not mistaken. But let me, lemme just ask a follow-up question. You mentioned you, you use microservices extensively. Is there a microservice for every data product? And if so, what would that microservice be responsible for? Or how, how, tell me about the microservice architecture and how it’s aligned to your data products and data mesh.

Laveena Kewlani (00:14:32):
So the concept is not to repeat or do redundant work, like create microservice for each data product, rather create a com, create components of MI microservices that can be utilized by different by like all the data products. So that’s what the approach was like. We have pipelines that are curated of different microservices so that our our, our, our, like, it’s a decentralized approach for each and every aspects that we need to touch when it comes to data and data product that we laid out in different you know, components. Each component is each component, we defined a microservice, and that that microservice is, that entire orchestra of microservice is triggered when a requirement of a data product comes in to, from the input to the output. And then we also have a loop feedback loop that, again, is a microservice.

Eric Broda (00:15:39):
Fantastic. I, I love the architecture. But ika, let me ask you a question. So we’ve talked about data products and data quanta. How, how does PayPal define a data quanta or a data product? What, what, what is the definition that that rings true to to PayPal?

Kruthika Potlapally (00:15:56):
So in, at a very basic level data quantum is based on a domain a business unit that we are try trying to solve a problem for. It contains a collection of data sets and solving a business problem for. So so that is with respect to the data from a PLAs platform perspective, it encompasses the different planes, the observability the discovery plane, and the control plane and interoperable data model. As I mentioned before on a very, at a very basic level, these are the four different components. And this is how we come up with the entire data product.

Eric Broda (00:16:49):
Okay. Now h how many, if, if you don’t mind me asking, in general terms, how many data products or data Quanta has, has PayPal been able to implement? So far?

Kruthika Potlapally (00:16:58):
We began with six data products. Now we are expanding to about 40 and a lot more are being added are scheduled to be added in this year.

Eric Broda (00:17:12):
Wow. So sa safe to say that data mesh, the benefits of data mesh are loud and clear at PayPal, and it’s taking PayPal by storm in effect. So that’s fantastic. That’s fa fantastic news. Levina, let me ask you the next question. So, so we’ve got six to start six data products or data Qantas. And, and we’re, we have 40 coming up very, very soon. How, how does PayPal go about choosing their data products?

Laveena Kewlani (00:17:36):
So choosing, it’s basically done more of a work by data product which is another team which takes care of it. Like what is a priority. We build platform, so they utilize the platform, the prioritization is done by another team, which is more of a business team. So, yeah, okay. It’s not on us. We just cater them the service that, okay, this is how you do it.

Eric Broda (00:18:06):
Okay, perfect. So, so, so having chosen your data products, maybe ika gimme a, gimme an overview of how they’re delivered. Like what does a data quanta or data product life cycle actually look like?

Kruthika Potlapally (00:18:22):
So it all starts with creation of a data contract. I think everyone is aware, I mean, a bit of an information as to what a data contract is, is it contains all the information that drives the data product. So it could be information about all the source tables, the target tables, the schema documentation, the sdes, the data quality rules, who the ownership lies with, and or how the u the usage policies. So data, the data contract acts as the so single source of truth that drives a data quantum. So we begin there, and we, and then the data engineering teams as I said, as a quantum income or a product income versus the different data sets they work on building the transformation logic. And we add these several different players of observability control and the discovery brings to it. So in a nutshell, it is it starts off with data contract building that contract between the different teams involved and interaction of the transformation logic with the platform providing these different plans. So all of this together is implemented as a data product.

Eric Broda (00:19:45):
Now, just a follow up question, who, who actually builds the data product? Who are the who, where are the software engineers, if you will, or data engineers actually reside? Are they in, in Jean George or your team, or Lana’s team, or where do they sit? Are they in a separate team?

Kruthika Potlapally (00:20:00):
So the platform engineers are the software engineers building the platform, providing the, in integration between all of these planes. And whereas the data engineers are the ones that are actually working on the business logic and the transformation logic.

Laveena Kewlani (00:20:18):
Okay, perfect. So that’s another team Actually, go ahead. Yeah, that’s another go another, yeah, that’s another team that gets the business business priorities from the different domain owners who has to see like different aspects from, of data, from different perspectives. And then they curate the data products. And as I mentioned that as per the life cycle the process that we have curated out it’s, it’s quite unique and interesting, and that’s the reason it’s patented. So it’s patent, yeah, yeah, the lifecycle. That’s was the entire hard work around the microservice, the integration, the triggers and notification, everything. So that’s the reason it’s patented.

Eric Broda (00:21:06):
Wow. Congratulations, John. George, you, you have not told me about that. I think that’s a marvelous milestone. I’m definitely gonna have to look that one up. So, so, so,

Jean-Georges Perrin (00:21:17):
So, so, so just to add to that, the, the, we’ve got this policy about, you know, pirates and open source, and PayPal is very open about, about, about those things. And I think that patents are actually an help to innovation because it guarantees investment, but open source is also a major help to innovation. And I’m at ability to announce that we are going to open source our data contract template coming very soon on a GitHub near you. Okay. So we will, we hope that some, some people will take it and own it and modify it and fork it and bring feedback so we can have some kind of a, hopefully almost an industry standard around around data contract.

Eric Broda (00:22:07):
Well, folks for the audience, we just heard a potential news release being

Jean-Georges Perrin (00:22:13):
Created. Well, CNN is on it. Okay. So, so,

Eric Broda (00:22:16):
Yeah. Well, this, this is, this is big news because you know, PayPal is luminary in the industry, and when they move a lot of people follow. So that’s a huge achievement. And I, for one, am absolutely positively looking forward to how how you’ve implemented that data, data contract. Now. Now let me go let me go to Levina that the tech stack for a typical data product, there’s kind of 2, 2, 2 schools of thought on this. One is the data product owner is the king of the castle, if you will. And they are able to select a tech stack that works for them, which may be different than the enterprise. There’s some downsides to that. But there’s also the, the other upsides also on the other side of the, the continuum. We have enterprise debt, you know, standard products and kind of an edict that says you’re gonna use the enterprise standard approach or tech stack. Where is, what does a typical data product or data quanta stack looks l look like, and how much flexibility is there in that architecture?

Laveena Kewlani (00:23:19):
So given that we were, we are majorly like PayPal, majorly have used in PayPal G C P Tech Stack. And as you mentioned, that data and the team of data engineers is more inclined towards the different tech stack. And then there’s this enterprise you know, platform that is trying to secure the safe place with respect to security. I think in this matter, we had been a little lucky because there was a quite good sync between different u different type of users. So one is user, basically were data engineers who are curating the products, using the business requirement, creating data contracts, and then there are people who is going to utilize that, or at a visual basis, on a visual basis using Tableau, Looker and anything. So we integrated, so in a, in general, like when we were our product defining strategy with respect to architecture and services was based on whom are we going to help?

(00:24:23):
So, every quarter we used to decide, okay, this time we are going to help bunch of engineers who are going to create data contracts. How are we going to help it? Do we give them a platform where they’re able to create the data platform, or do we help them to, you know, create the entire pipeline together? So every quarter we had a role, and after consulting with them, we came with the tech stack and also basically in you know, accordance to the PayPal strategy as well. But things had been quite a sink. It was, it was a little bit challenging to get things in order with respect to everyone, but eventually it happened. And as I mentioned that we also have an end user who’s going to use that. You can say business analyst. They also integrated the end product product with different tools, basically just like you integrate your you know, Redshift database to a Tableau or Luca. So the same strategy was being applied to the end product.

Eric Broda (00:25:29):
Okay. Now, now you mentioned,

Kruthika Potlapally (00:25:32):
Yeah, in addition to what Lana said we actually integrated with a lot of enterprise based frameworks as well. And the framework, finally, the platform that we developed is actually platform agnostic, although we have used GCP based tools, but then it can, the code based can easily be changed to any other cloud-based service or on-prem data platform.

Eric Broda (00:26:00):
So, so it’s very interesting you mentioned that. So my, my experience in building data mesh, one of the, the hardest challenges is reconciling data, product owner flexibility, and the ability for them to do what they need to optimize for speed and agility versus meeting the needs of the enterprise, which is trying to, you know, have some measure of enforcement and governance around consistency and standardization. Now, now I’ll go ika, perhaps your, your perspective. Levina mentioned there, there is some, you know, there’s, there’s always a balancing act in trying to do that. Tell me how, how, how you did that balancing act. What, what worked and what didn’t work. How did you convince folks of of the approach and perhaps the contracts or the standardization when perhaps they were doing something different or had a different vision? How did you reconcile that? Oh,

Kruthika Potlapally (00:26:49):
It’s been definitely a challenge. It is, it hasn’t been yeah. Initially when we begin, we, we, we began, data mesh was a very new terminology, and the kind of architecture even for PayPal, because a lot of most of the data platforms are streaming or batch. They’re more batch driven. Whereas when we came to implementing data mesh, it was more stateless architecture using microservices, which was kind of very different from what most data engineers or big data engineers or cloud engineers have implemented so far. So initially it was challenging, but the way we, the approach that we took, as Laina mentioned, every pi every quarter all we focused on certain problems to solve for our users. For example, if we, if we, if we were catering to solving problems for a data scientist, we would discuss the strategy and the approach, discuss their problems, and come up with solutions with respect to data mesh and get their feedback.

(00:28:03):
So we had constant uua t sort of a thing in evolving how the platform worked. So the kind of strategy that we took was more product thinking wherein we cater to what our users require, be it data engineers, or be it data scientists or data analysts. So and how we kind of got them involved is at every step we used to discuss their needs, their issues, and how we would solve the problems using data mesh. And it required a change in thinking and strategy definitely. But over a period of time with a lot of training and user manuals people got, people need, I mean, the engineers or the analysts made some, they got onboarded to the idea, and it, it wa it’s, it’s been in your journey. It didn’t happen overnight. Definitely.

Eric Broda (00:29:04):
Okay. Now Levina as a, as a, oh, sorry. Go ahead.

Laveena Kewlani (00:29:07):
Yeah, so to add, like, you, it is just a simple example. You said, how did two words come together, right? So very classic example is that I, I saw that some people were very UI oriented here, and some people like data and user, they are very python, like they have a very IC way of doing stuff. And we catered everything using the same microservices, stateless architecture. We were able to cater everyone. We, I mean, it’s I, I don’t wanna brag it, but we’re able to make majority happy because we were able to come up with the products that were fulfilling everyone’s requirement.

Eric Broda (00:29:52):
So, so ultimately what you had was if I were to characterize it, sim simplistically, you, you, you came up with a very flexible set of tools Yes. That solve very practical problems. And that’s how you actually got your, your end users and other engineers, et cetera, on board. Fantastic, by the way, I think that’s, that’s been my experience, and I think that’s the actual only way to do this. The, the older style, I characterize it older style where there’s an enterprise standard in thou shalt use it. I don’t think that, at least in the data mesh world doesn’t work. And I think e even in their traditional IT world, it, it, it is working less and less and less. But let me, let me, I wanna continue just a little bit on the the lifecycle side. So, so perhaps levina, I’ll start with you. Describe, describe the DevSecOps capability or how specifically the, the continuous integration, continuous deployment or delivery where you get a data product from ideation, but more specifically from a development something that’s kinda code, and how do you move it through test and then actually move it into production? Walk me through the, the C I C D or the DevSecOps process that, that you’ve adopted for, for data mesh and data products at PayPal.

Laveena Kewlani (00:31:10):
Okay. So on a very simplistic way, the were the entire architecture that we built across the entire orchestra that was built at different levels of data mesh, be it infrastructure level or mesh experience or data product level, basically were developed at Dev as well as fraud level, like at in, in different environments. And once the tests were getting passed at the dev environment, we were moving on onto the fraud. So this was the strategy, like we had our entire setup hosted not just on one place, but different places so that we can do a very you know, thorough health check because as JG said that we are planning to make our data products basically public so that we don’t, we don’t want any mess ups to happen at that level, at public level.

(00:32:12):
And when it comes to the DevOps and SecOps there is some limitations, PayPal limitations due to, which I cannot tell you how the internal security passes were taken care, but at a general level, the setup that was built, as I mentioned to you, it was a it was decentralized, and that was one of, that’s one of the reason we were able to adapt to the orchestra that we have, or the internal security setup, governance setup, and the compliance that we have for data and enterprise data platform within PayPal. We were able to set with them if it would have been you know, a centralized structure, which usually is practiced in data centers, I think it would have been very difficult. So in order to yeah,

Kruthika Potlapally (00:33:10):
Sorry, go ahead.

Laveena Kewlani (00:33:11):
No, go ahead.

Kruthika Potlapally (00:33:14):
Yeah, and to add on, we, we integrated with the existing tools that are of PayPal. We didn’t really have to build any of the custom C I C D strategy for our use for our platform. Okay. So it made things simple. That’s everything was governed and according to the standards of paper.

Eric Broda (00:33:36):
Okay. Now you, I’m gonna switch topics a little bit, but you mentioned the, the word governance. And, and I would say all too often, maybe far too often, governance is used as a bad word. So in, in, and I, I, I would say, you know, in some respects, perhaps fairly, because in the old world, it was, you know, you build something, I’ll look at it and I’ll tell you if, if it meets the standards, and inevitably it doesn’t, and then you have to retrofit stuff. So, so the old approach and I’m dramatically simplifying mind you, but the old approach you, you really, it, it wasn’t, wasn’t the right approach for data mesh. In fact, data mesh suggests a federated governance approach. In other words, every data product perhaps may have a unique, hopefully standardized, but somewhat per bespoke, perhaps governance approach. So, tell, so tell me about how you will start with you ika. How, how do you govern the data products? How do you know that they’re doing the right thing? How do you know that they’re secure? You know, how do you know that the data quality is appropriate? Walk me through how, how, how da, how PayPal actually does data product or data quanta governance.

Kruthika Potlapally (00:34:47):
So for data governance specifically, we collaborated with our enterprise data governance team, and they have become our best friends. So through the entire journey, we became good friends with many different teams, but data governance most specifically data governance. So we took in their, their standards and the compliance and governance rules, and we set those in our, as part of our platform in providing them securing the data products. And as I said, mentioned before, a lot of most of the go governance techniques or policies, and the ownership lies with it comes through the data contract the usage policies, the SLA standards, the data quality checks, all of these are approved or it is given to the product owners, whoever is owning the data product. It, and it is curated by them. And based on their approval is how we come up with the creation of a data product. So, in a nutshell, it is driven by the data contracts, but in close collaboration with the standards set by our enterprise data governance team.

Eric Broda (00:36:07):
Well, that’s a great news story. Levina, let me ask you a, I’m gonna follow, have a follow up question related to governance. So, so one of the, the, my experience anyway is, is if, if you can make data governance real time and if you can automate it, that’s kind of the key to success. And it sounds like there that, that is kind of the spirit of what happened at PayPal, but tell me how, how, how, how did you automate or create realtime governance and to what degree did he get there at PayPal?

Laveena Kewlani (00:36:38):
So just to add to something on governance basis, what Tika mentioned that we had a collabora, we had a very close and friendly CoLab collaboration with enterprise governance team. You will be surprised to know that we were able to use their services also in our architecture. So it’s not just the data contract that was, you know, getting any manual feedbacks from them, or they were coming in tweaking it. It was all automated because we were able to there use their you know, rules set up and everything, which they, what they have for governance, prospe, prospect, like how the data should look like, et cetera, et cetera. Were able to do that. And they were providing the health check, like they had, they already had established this strategy for checking the health of the data from the different sources in, like, in different aspects.

(00:37:35):
And we were able to utilize that, that result, that result as as well. So that saved us a lot of computation because we were already getting a curated you know, report card from them, and we were using that report card. Cause you know, that’s optimal. And they have already established a setup for that realtime system. So we just integrated that. Now how we, in like, we created a realtime aspect was most of the services as, as we mentioned, are they were like, we created as, as we mentioned, that we created stateless and microservices and the entire, or the entire service was being, being utilized by different products at different time. Okay. So it’s not like we didn’t do any batch processing. It was more of a triggered based system

Kruthika Potlapally (00:38:31):
Yeah. In addition. Yeah. what’s good is usually for most of the data use cases, a lot of them were batched and just to onboard all of these rules or or all the data quality checks and having them scheduled were through again the batch process. And some of them, a lot of it could was manual too. So our approach, we automated the end-to-end data quality observability governance aspects, including the scheduling strategy. And I think that reduces we tried helping our data engineers reduce their time by about 20 to 30%.

Eric Broda (00:39:21):
Fantastic. You mentioned the word observability. So ika, I’m just gonna ask a follow-up question in that regard. How, how does, how does PayPal monitor their data, mes their data products, and, and, you know, how did you implement observability and what, what does observability actually mean to PayPal? Maybe start there.

Kruthika Potlapally (00:39:38):
Yeah. observability is in the way we have implemented, or what it means is we track and look at how consistent and reliable the data is. Is it, if it matches the requirements, if it matches how, I mean, basically the quality of data, is it meeting the standards? So can we trust this data for our use cases? Can any user, is it according to the enterprise standards, and can we use these data sets for our business use cases? So how we implemented observability is we have many different checks. Some of them are data quality checks, some of them are system level checks, some of them are at control level. We have many different checks, as Laina mentioned. It’s at the mesh level, data product level, and as, as well as at the infrastructure level.

(00:40:42):
So some of these tools were custom built by our team, and some of these were already as part of the PayPal ecosystem, all we had to do was integrate with them. So we combined all of this, and what was interesting is we came up with a statistical measure a ratings score just like how we have with Amazon or any of these services to tell how reliable and consistent the data is. All the users have to see is just look at the ratings, and they will eventually get to easily get to know if it is meeting the standards perfect. Or the reliability levels.

Eric Broda (00:41:28):
Perfect. Now, I’m just gonna pause here for a minute. For, for the audience. We would love to some of your questions you are able to submit your questions and they will arrive to us. So we’ll, we’ll try and spend some time near the end of the session to answer questions. I haven’t received any yet, but I’m sure there will be some. So this is a quick administrative note. Now I’m gonna continue levina on the observability side or the, more specifically the data quality side. What I’ve seen in the past and what I’ve done in the past is, is, you know, in any data pipeline, what we use is products like great expectations, for example. And we, we use some of the fantastic capability in that tool to actually understand the profile of the data. And the, and you can obviously ascertain quality metrics for that. That’s how I’ve seen it done, be in and done it in the past. Is that a similar approach to what you are using at PayPal? Or, or different tools or does that resonate?

Laveena Kewlani (00:42:30):
So some, as, as Tika mentioned, some of them were already existing by the our governance team or the platform. We simply integrated them, those health checkups from the, from the database directly. For example when you combine, as Tika mentioned previously, data pro data contract can be a single data set a source, or can be a multiple data source. Now, getting a co like a health checkup sort of something, or data quality health checkup, sort of on a combined product was something that we curated. And we had different statistics that were, again, approved by the governance team because they already have created a standard setup. And of course, we also integrated with the product owners and the end users like you, when you check like how a data analyst, you know, looks at the data, that is the source. A, the data is coming from the source, A data is coming from the source B and it com. Like like you have 50 students in a class, you are getting the report card of each and every student. And then when you look at the report that it is school level, you look for a particular class. So that was the approach that was given a small section, which was seen at a particular dataset. The same approach was applied at a bigger combined domain level. So there was not much difference in, in that approach.

Kruthika Potlapally (00:44:05):
Yeah. In addition, PayPal already has existing data quality frameworks, very similar to great expectations. And yeah, we looked at both and we felt because this is enterprise accepted in the organization, we went with the existing frameworks that PayPal already has.

Laveena Kewlani (00:44:25):
Okay, perfect. And also to further add, like you know, whether we are doing it right, right, or not, so it’s not just the user of that particular domain who has the access to that health. It is also like all the other users all the, like, it’s not the owner who has the you know, access to the entire health or the quality data, quality checks. It’s all the users who cannot just interact with the different data products, but they can interact with different user history of different users who are using a particular data product.

Eric Broda (00:45:02):
Okay. Perfect. Now I’m just gonna have one more question then. There’s a variety of questions that I have now from the audience. So I’m just gonna ask one more question to we’ll start with Ika. So, so clearly you’ve been at the data mesh journey for a while. Any any lessons learned or advice what worked and perhaps what didn’t work? And then I’ll ask the same to Levina, and then we’ll go to some audience questions.

Kruthika Potlapally (00:45:28):
So, some of the lessons that we learned is when there is a change, it is not, it cannot, it doesn’t hap change is not accepted overnight. It requires a lot of back and forth with the users and at the same time with the enterprise teams. So starting this con conversation early off and bringing that awareness as to why this change could be useful to the organization as and as small to a team as well and getting that acceptance and the faith and confidence becomes a very crucial thing. If you, if you do not have the AC approval from your users or the enterprise teams, it is going to become a big hurdle to pass. So this is something we tried bringing early on, and, and it’s still in progress. That’s one of the lessons that we learned as part of our journey.

Eric Broda (00:46:23):
Well, Levina, what do you think, any, any words of advice, lessons learned?

Laveena Kewlani (00:46:28):
Yeah, as you have the forward loop, like in, in a general, you know, neural network or AI or something, you have this forward propagation and backward propagation. What I realized that this forward propagation is like when you design the product and et cetera, that’s a forward propagation. And you kinda get the feedbacks of everything, everyone around for whom particularly you are getting the product ready for. But once the product is ready people think that their job is done. But no, you have to have that back propagation that makes the neural network very intelligent. So we also use that strategy, you know, and that help us to curate a lot of additional products, you know, surrounding data mesh that are going to help our users at different levels. The a data engineer who is creating a data contract or a visualize like data visualization engineer who’s vi who’s visualizing the data. So we just don’t have a data mesh around. We also have tiny, tiny products who, who are helping in you know creating a better experience of the entire data mesh you know, era that we are living in.

Eric Broda (00:47:49):
Perfect. Well, I’ll tell you what I’m gonna, there’s a variety of questions. So I’m gonna, I’m gonna kind of summarize one or two. First one here is from Jerome, Jerome rather Zuma. And I think there’s others that are related to this, but how do users interact with the data mesh? Ika, why don’t we start with you?

Kruthika Potlapally (00:48:10):
Yeah. As Laina mentioned, we built multiple pla tools or user experience tools to our users. Some of them are more into business analysis and visualization. So we have a UI created for them for discovery and ana basic analysis. We all took cater to our data scientists or data analysts who predominantly use notebooks. We have libraries created for the notebooks and also for bus visualization engineers or BI engineers. We have integrations with Tableau, Looker these visualization tools. So we have tools catering to each of our end users. It is not one one deliverable to I users. It’s multiple tools to to customize to that fits to customize to different users.

Eric Broda (00:49:09):
Okay. Levina, I’ll ask you the next question from Martin Harrison. What do you use to create a catalog or your community, I presume a da a data product catalog. What, what how does PayPal manage that? Obviously you have six of them now going to 40, perhaps hundreds in the future. Do you use a, a, a, a catalog? How do you catalog the data products?

Laveena Kewlani (00:49:31):
So cataloging in, when it comes to the creation of the data product is basically very business driven. So we have the different business stream teams who prioritize the data, domain data checking and all the visualization at a domain level. They come to us with a requirement, and a data engineer from a business team helps them to curate outline process that we have created for creating a data product. Now, when the data product is created using a single data set or a different data set for numerous data types, the entire pipelines create a end product. Okay. That end product, again, is catalog using the same tools, which Tika mentioned here, the u especially the UI that we have created. It’s, it’s marvelous, like it helps you in listing all the catalogs and product of the catalogs. So at the end, we have a control of visualization, but the front end request is mostly business oriented.

Eric Broda (00:50:40):
Okay. Now, related question. This one’s from Raul das. How, and it may be related to the catalog, perhaps how do you manage to make data products or data, quanta, discoverable, ika, we’ll start with you.

Kruthika Potlapally (00:50:56):
Yeah. So we have our UI has a powerful search functionality. It is very Google-like very user-friendly, wherein you can use billion operators or the wildcards to be able to search through the data. So the, and in addition to this, we are working with our data cataloging team to either integrate to integrate in the existing techno frameworks that we have, or make those frameworks by learn ha having learned the lessons from here and evolving those systems. We are figuring that strategy out right now. But our data cataloging and our discovery is based on search search engine, just like Google.

Eric Broda (00:51:46):
Okay. Now, just oh, sorry, Levina, go ahead.

Laveena Kewlani (00:51:50):
Simply, you just have to write your keywords and you’ll get out all the products related to that keyword and you can browse through it.

Eric Broda (00:52:00):
Okay. Now I’m gonna, there’s a question from Hugo around what tools you use for your transformation or your pipeline. So, so is, is D B T for example, the, the go-to product, or do you have other tools? Are they custom built? Levina will start Pearls scripts, pearls scripts, old school. Jean George is old school. Hey, come on, man. I mean, this is reliable. Levina. Do you agree with Jean George?

Laveena Kewlani (00:52:29):
Yeah, it’s basically very old school and as we have to abide to the enterprise levels, so we have to use a little bit opo,

Eric Broda (00:52:38):
Really. Okay.

Kruthika Potlapally (00:52:39):
No . I’m just, I think we’re all kidding. But yeah gcp all the technologies, yeah, Popa, BigQuery, spark, all the big data technologies

Laveena Kewlani (00:52:58):
Everything is there.

Eric Broda (00:53:00):
Okay. Okay.

Laveena Kewlani (00:53:02):
, go ahead. Everything is there. So,

Eric Broda (00:53:05):
Yeah, was gonna say, I said, everything I’ve heard is p PayPal is very, very advanced, and you almost had me there. I could not figure out why you would use Pearl for the life of me, but I wasn’t gonna put you on this, on this, on this form and put you on the spot, but you had me fooled. One, one of the questions it’s, it’s related to pipelines. And I’m kind of summarizing or reading between the lines. There’s a school of thought that says, I have data products, and there’s a pipeline that joins the two. And then there’s another school of thought that says there are no pipelines external from data products. They’re all inside a data product. Which school of thought does has PayPal implemented? Levina, we’ll start with you.

Laveena Kewlani (00:53:49):
I think it’s very data product oriented in here. And even at the mesh level, they’re basically at a data product level of a data mesh. It’s very it’s very product oriented. It’s not it’s not like a pipeline, same pipeline can cater everything. So it is very product oriented.

Eric Broda (00:54:15):
Okay. it’s within the

Laveena Kewlani (00:54:17):
Data product

Eric Broda (00:54:18):
That’s, so, so it’s within, now that’s, that’s a very different school of thought relative to what’s out in the industry. So, so, so lemme kind of build on that. In the industry, there’s a lot of discussion around D B T, not so much around data products. Okay. And, and even D B T, I think if I, if I, if I read the, the most recent announcement I think’s coming out in late April or very soon or may they are moving towards having explicit data contracts and even the notion of a data mesh. So, so, so, but the old school says, I got data products, but I also have a central group that manages my data pipelines, or D B T. And, and I would argue that while, while I can get the job done it, it, it does reinforce the whole centralization approach, which has its own set of problems, not necessarily insurmountable, but you’ve chosen to do the opposite. So, so, so levina, walk me through how you, how you perhaps got the, the team to, to go from, I’m gonna have centralized pipelines to this new data mesh constructing where I can wrap the pipelines inside the data product. How did you evolve the thinking towards that?

Laveena Kewlani (00:55:26):
So it was not one night thing basically to get them have that thinking to migrate from a centralized or decentralized approach. But when we showed them the time that is being consumed, when you create a domain aspect of a product, combining three or four different, you know, data products together to create a curated report for a end user, which is usually a management person who has to take decisions based on that, that data point we showed them the amount of time that they are taking you know, when a centralized team is taking care of everything rather than a data product, a, a single data product pipeline, and when, and combining those, like, like you have different pipes from which the WA water is flowing, and you have a nozzle which combines them together. So we, that approach. So it was like, when we are creating the data, when we are creating the data products, like meshing the contact, I would rather say like, you have different contact and they’re interacting with each other. You have water flowing from d from two pipes, but when they combine together, the force, the power and the entire statistics changes, right? So we showed them that that how it is changes and how they are spending so much time, you know, managing those changes and how the decentralized approach is going to save their time by 30 to 40%

Eric Broda (00:57:02):
Outstanding.

Kruthika Potlapally (00:57:03):
In, in addition, we, again, use the same principles of ownership. Yeah. The data, the enterprise data sets rely the, they’re owned by the respective teams. And we do not, we are, we were trying to eliminate the data Lake, lake or centralized data layer of bringing data from different sources into one location. So, perfect. Yeah.

Laveena Kewlani (00:57:30):
And there was a lot of redundancy. Also, there was a lot of redundancy also. Like we saw that different data products were being consumed by different teams in different manner. Why can’t they just use the same product in a different way? They just have to submit or create a data product that, you know utilizes the same aspect of data. Perfect. So that’s a lot of time and computation and money, of course, the computation and the storage that is needed at the end to keep a domain or to keep the same data product on a simple redshift of you know Redshift cluster. It’s just the prices are just bombastic.

Eric Broda (00:58:17):
On that note we, we are rapidly running out of time, so I’m just gonna close here very quickly. First off, obviously, thank you l Levina Ika and Jean George. What is been absolutely clear to me here is you know Jean George, while you’re the, the leader of the team you have some outstanding engineers and architects who I would characterize, yeah. Have built everything that I can see. You, you’ve, you are, your, your team is the epitome of the best practices implemented within data mesh. And I think you folks are a well-deserved leader in the data mesh field. So congratulations for that. Real quickly here if you want to hear more from Ika Levina and Jean George in two weeks time we’re gonna have another meetup on becoming a data mesh engineer. So, Ika levina, perhaps less so, Jean Jorge but they’re gonna tell you how they became these outstanding data mesh engineers. So that’s in two weeks time on sorry, four weeks time, May 25th, my apologies. So, so we welcome you back to that. So now we’re, we’re at the close. So I just wanna say once again, thank you very much to our eminent data mesh practitioners, and as always, I’ll give Jean Jorge the last word, anything that you wanna say in closing, you should be very proud of your team here, my friend.

Jean-Georges Perrin (00:59:47):
I am, I am, I am super proud of my team. They, they did an amazing work. I, I look forward to, to, to, to more work and more deliverable with them. It’s been, it’s been a, it’s been an incredible journey. It’s been a, a great learning experience. And we’ve, it’s beyond the scene. You know, I had, I had people in I, the US all, all over the us, India, and China. So the human aspect of that combined as the richest of culture. And of course a few, you know, you put a French, it’s, you put a French in the middle of that, it’s like, bit like you put an elephant in, in the middle of a, of, of a, of a, of a porcelain store. Okay? So it, it, I, I did some, I did some gaffs here and there cultural gaffs. But unfortunately, fortunately there, everybody’s still there. And we had a, we had a really good time and a, and a great moment altogether. Thank you, Eric. I know we have other projects we will share with that as we go. Thank you for the podcast. Thank you for, for that. Thank you for inviting my team. They’re, they’re great. I told you they were great. So yeah,

Eric Broda (01:01:06):
They are. Now on that note to the audience, thank you very much for your time and participation. I’m very hopeful that , I’m very hopeful that you’ve learned a little bit and hopefully enough that you’ll be able to accelerate your organization’s data mass journey. So, with that in mind, I’m gonna close off and look forward to having further meetups in the future. Bye now. Thank you.

Jean-Georges Perrin (01:01:32):
Bye bye-Bye. Thank you.