A summary of the presentation during our Data Mesh Learning community session on Life Sciences held in June
In June, the Data Mesh Learning community hosted a presentation about data products within precision medicine, featuring two speakers from Tag.bio.
Our first speaker, Jesse Paquette, presented challenges in the life sciences industry related to data management, and the potential benefits of applying data mesh principles, specifically decentralized domain-driven data products, to improve data accessibility, interoperability, and automation in research and development processes.
Our second speaker, Sanjay Padhi, explained how Tag.bio harnessed the power of the mesh in its platform. He highlighted several use cases, including the deployment of the Tag.bio platform in pharmaceutical industries for tasks like cleaning up and harmonizing clinical trials and analyzing real-world evidence. Overall, Sanjay emphasized the power of data products, their connectivity, and the various capabilities of the platform for data analysis, machine learning, and AI applications.
Multimodal Data Products for Precision Medicine
Presenters:
- Jesse Paquette, Chief Science Officer,Tag.bio
- Sanjay Padhi, Chief Technologist and Executive Vice President,Tag.bio
Watch the Replay
Read the Transcript
Download the PDF or scroll to the bottom of this post
Ways to Participate
To catch an upcoming event, check out our Meetup page.
Let us know if you want to share a case study or use case with the community.
Data Mesh Learning community resources
- Engage with us on Slack
- Organize a local meetup
- Attend an upcoming event
- Join an end-user roundtable
- Help us showcase data mesh end-user journeys
- Sign up for our newsletter
- Become a community sponsor
You can follow the conversation on LinkedIn and Twitter.
Transcript
Melissa Logan (00:00):
So thanks everybody for joining today’s Data Mesh Learning Community Meetup. Data Mesh Learning is an independent community of over 7,000 data leaders who are on their data mesh journey, looking to join, make connections with other people who are going through the journey ask questions and meet each other, and provide support and resources for each other as you go through your data mesh journey. We host regular virtual meetups. We have a library of content on the website that we will be expanding over the course of this year. We have one-to-one end user round tables for people in the community. It’s a way for you to connect confidentially with other end users who are on their data journey and other programs to help people connect and learn. Everything we do is made possible by our sponsors, which include Agile Lab, next Data, Reltio, Starburst, and 10 others.
(00:52):
And today we have special guests from Tag Bio here, Sanjay Padhi and Jesse Paquette will discuss how to build, deploy, and maintain multimodal data products for precision medicine using a purpose-built data product engine and scalable data mesh architecture. We are very excited to have them here to talk about this topic. If you have questions for them, please ask throughout the event. We’ll be answering questions at the end. Just type ’em in the chat window on YouTube, and we will get them asked. Without further ado, I’ll turn it over to you folks to share your knowledge with us. Thanks and welcome.
Jesse Paquette (01:28):
Thank you, Melissa. Really happy to be here. I’ll just go right into it. I’ll be able to describe my background and as well as tag bio briefly. I’ll start the presentation and then halfway through, switch it over to Sanjay. But here we go. Okay, so you’ve heard the title and, and you’ve seen the, the bios. I’ll, I’ll quickly describe my background in a bit more detail. So you know where I’m coming from. I, I’m not entirely sure of the audience here, whether the audience is more centric around data mesh, which is what I’m presuming. But I’m, I’m guessing there are some folks here who are like me, who have been in the life sciences space for quite a while. And those folks might recognize the organizations that I’ve worked for in the past. So I’ve got about 20 years experience in the industry.
(02:19):
You know, we, if you can, you can, I, you can’t click on the video to, to, to follow these links to LinkedIn and Medium, but you know, these can be shared afterwards. And I’d appreciate connecting with anyone for, for follow up. I’m sure there’ll be a lot of interesting questions that come about from this. So I started at Gene Data in Switzerland where I was a Java developer, and then I got a master’s degree in bioinformatics in Denver. And then I went to the U C S F Cancer Center where I was a bioinformatician. So I worked very much on the services side of, of bio data and biomedical data. I went then, then went to i d in Palo Alto, where I was leading the bioinformatics effort there helping that software company target its product towards life sciences.
(03:03):
And then in 2014 I co-founded Tag Bio, along with Tom Covington, our C E O. It’s very warm right here. I’m in, I’m in Brussels right now, so it, it’s one of those hot days in Europe that I’m in a stuffy room, so I might start to sweat here. So I hope you don’t mind. I wanna highlight in 2015, we filed a patent this is of course before jamani and the landmark data mesh paper at ThoughtWorks in 2019. But it ended up looking rather similar to what was described as a data products later, or thinking about data as a product. So in our patent in 2015, this is what we filed. One of the main diagrams is what we called a dataset hypervisor. So the idea was that this was going to sit on top of a data source, and it was going to provide access to algorithms and provide access to a p i methods in a way that made things very harmonized so that each data set presented its own layer of applications and a p i just like all of the other data sets.
(04:05):
So the, the concept of decentralized data was something that we had from the beginning here in 2015. And we’ve spent about eight years building it out. So for those folks who aren’t as familiar with life sciences and the data challenges there, there’s, there’s the business side of life sciences. So I’m gonna divide life sciences into two main data domains, although, you know, can really be subdivided into hundreds of data domains. It’s a, it, it’s a really diverse industry. If you take a major pharmaceutical, they have tens of thousands, if not hundreds of thousands of internal data sets from all range of, of the business. So there are other use cases that have been described in life sciences that focus more on the business side. And by the business side, I’m talking about manufacturing of the, of the compounds of the, of the therapeutics marketing of those therapeutics and sales.
(05:01):
This tends to be similar to other industries that do manufacturing, marketing, and sales. And those data sets tend to be similar, although, you know, it’s always, there’s always special differences. We focused more, and I focused more in my career around research and development. Research and development is different and it’s distinct for life sciences. It involves really specialized data types DNA sequencing, RNA sequencing compound screening. These produce very specialized data sets in, in very specialized formats. They require specialized individuals to be able to analyze those data sets. And what’s very important is also it’s regulated. So even it’s got personal health information and, and there’s what you say consent forms around the clinical trials about how you can share the data and what you can use the data for.
(05:58):
And then the evolving rapidly is, is really important because any technology that was implemented, say five years ago is now obsolete with regard to the new emerging types of omics sequencing. Single cell omics, for example, has emerged in the last five years as one of the most important things that these organizations are doing because immunotherapy is, is proving to be one of the most promising therapeutic areas in, in the entire industry for almost every type of, of disease. And so I’ve listed a bunch of different types, and there are hundreds more specialized different types of data. And making a single data lake is, is a big challenge for, for all of that because of the specialty with regard to file formats, sources of the data silos, ownership, consent, things like that. So, I’ll get into more bit about the use cases.
(06:52):
So we have the business side, again, the business side looks like other industries where you have data lakes, warehouses, typical E T L tools and BI dashboards. For the business users, for the research and development, you have the end users being researchers. They have PhDs in, in, in chemistry or biochemistry or, or cancer biology. And so there’s a lot of things you need to do with the data, not just store the data and analyze the data. But d n a sequencing, for example, has a major d n a sorry, major processing pipeline component to it. So when the DNA is sequenced, it’s sequenced in a bunch of little sliced up pieces of D N A and that all needs to be aligned and mapped to the genome. And then you need to be able to call where the variants are, where, where the patient’s genome is different from the reference genome.
(07:39):
That’s d n a sequencing, and there’s r n a sequencing and all sorts of proteomics. So every different complex type of r and d type data in, in life sciences has a typical processing pipeline that needs to happen. And processing pipelines are relatively mature in the industry. There’s a bunch of vendors, there’s a bunch of open source tools. The major cloud providers, a w s, Microsoft, Google, all provide these tools as well. Clinical trials is a different domain. Clinical trials is heavily influenced by the fact that sas, the software company, the statistics company has had a defacto monopoly on, on clinical trial submissions to the F D A and a European medicines agency. And so a lot of that is a legacy process for organizations where these pharma have a bunch of biostatisticians who are only trained in SaaS, and that’s all they do is clinical trials analysis.
(08:37):
When you get to analyzing data and asking questions in the research and development process that very important that most analysis use cases are ad hoc, it’s really hard to automate things into, say, for example, a VI dashboard because the person, the researcher might have a question about a different gene, or how does this gene work with different data set? And because these data, these cases are mostly ad hoc, you have to do these things again and again and again. And once you’re analyzing the data and life sciences, you find that most of the time you have to redo a lot of things. So this is the problem statement that I’m addressing here. What we wanna do is we wanna prepare a system that is really useful for lots of different diverse multimodal data sources. That enables reproducibility and very quick automation of, of these common analysis use cases.
(09:25):
Even though it’s ad hoc, it might fit a caught common pattern that we can automate. And I don’t wanna criticize bioinformatic bioinformatician statisticians too much because I am one and, and, and I, you know, my peers are all bioinformaticians and, and computational biologists. But there’s a challenge there because they end up representing a bottleneck in the process because they’re so important to the process, because they understand the data, they understand the analysis methods for the data, they can perform all of these methods, and they have to do this again and again and again. And because it’s a human component in, in what could otherwise be automated by software or, or systems it’s a challenge to figure out how we can leverage the skills of those folks but in the same way try to automate much of what they do.
(10:17):
So here’s a typical use data story. So the researcher might be a cancer biologist, and they have a multimodal data question. They found a specific gene expression biomarker to be significant to their clinical trial, but they wanna ask the question, well, what about the other clinical trials that we did that were similar? Was this gene important in those clinical trials as well? So they have to go to a team or an individual that can help them do that. A bioinformatician or biostatistician, they end up having multiple meetings to clarify the question, clarify the scope of the data sources that they want to ask. Then those folks have to go out, locate the data they have to requery, remodel and, and reanalyze each one of those data sources, and then iterate with the researcher to make sure that that’s the right question that they wanted to ask.
(10:59):
And the answer makes sense. And then’s usually follow-up questions. So the time required in that process usually takes, excuse me, about one to three months. That’s a long time to answer one question in data. And again, it’s that human element even just scheduling meetings that that can really take most of the time. You might have to wait a week while the person’s on vacation, for example. And then another researcher comes in and says they have a similar question about a different type of biomarker. And what happens is that tends to make everyone do everything all over again. The intermediate steps are not well preserved in the first process, and therefore they need to be redone. And then finally, if you see this at the bottom, this is really important to the lifeblood of pharmaceuticals, of course because they’re spending hundreds of millions of dollars in each one of these clinical trials.
(11:49):
And even if the trial fails, they’ve generated a lot of high value data that should be reusable. I’m gonna briefly cover, go through this slide. I’m gonna quickly do a time check. So there’s a concept called fair, fair is an acronym saying meaning findable, accessible, interoperable and reusable. And so fair predates data mesh. It was sort of driven by I think a lot of the same need that that drove jamach and the ThoughtWorks folks to you know, sort of present their manifesto in 2019. So it’s very important, of course, you know, if you think about each one of these concepts individually, you know, you have to be able to locate the data. You have to, it has to be accessible to you with of course authorization. And then the data as, as you access it needs to be interoperable.
(12:39):
You know, if you access one of the clinical trials, it should be just as easy to work with as a, as another clinical trial. And then the reusability is, is of course, very important. So this, this sort of covers mostly storage and presentation. And so, and as such, if you look at the next point here it’s easy to achieve at the meta layer. So you can basically attach meta tags to a bunch of different data sources, and you can put them into a data catalog. And this has been done in the life sciences for the last 10 plus years where there have been data catalogs made. And, and these organizations allow people to go and search for the data sets to see if they, you know, they have the right applicability to their question. But they don’t actually get to drill into those data sets at that point.
(13:23):
Then they have to go to a person and say, okay, well, I found the dataset that I want to use, and I have, I have authorization for that. Now can you go and manually, you know, talk to the right people and get access to the data, et cetera. So fair’s always been easy to achieve at the meta layer, hard at the data layer. But then we have the error now of data mesh and the idea of decentralized domain driven data products. And this is really effective. We found this to be very effective in pharma. And so, you know, the idea of ownership is very important so that the, the researchers on a clinical trial or on similar types of clinical trials help design the data product to make it accessible for their questions and, and the questions of their peers.
(14:06):
The data has to be modeled specially for each type of data, each different multimodal data source. And this is why data lakes and data warehouses aren’t a very good idea in life sciences. And oftentimes there are very precise, very specific algorithms that need to be used on each data type, not just in the processing step, but also in the analysis step. And so the data product should also have plugability for these algorithms. And then additionally, and this is I think critically important. This is what I’m gonna cover in the rest of my talk, is that the a p i of each of these data products needs to be harmonized so that you can talk to each data product, not just the meta layer of the data product, but also to be able to access the data and access the algorithms that the data product contains in a harmonized way so that you don’t have to read a manual to understand how each data product works.
(14:55):
And then in life sciences, what is critical, and I’m sure it’s critical in other industries as well but in life sciences, it’s absolutely paramount that everything is version that the data’s version, the algorithms are version, the data modeling is versioned, and all of the API methods are versioned, not just so that everything works in a software system, but so that you can go to the F D A or you can go to a regulatory agency, or you can go to your own peers within your organization and say, this thing that I did before, I can do it again and I can do it again on new data, for example. And it will do exactly the same thing as it’s done before. So we’ll briefly describe the tag biodata product. So what I just described there about the ideal data product that, that maps to the fair model as well as data mesh.
(15:42):
This is what we have developed. So we, A tag biodata product is a software system. It’s a software application that, that sits on top of long-term data storage. It is not itself long-term data storage. And so it represents the single or multimodal dataset. And, and it’s, it’s domain specific. So it’s model, it’s, it’s one clinical trial with all of the patient records. All of the the biomarkers measured over time, all of the outcomes and all of the, say, gene expression or other types of omics data would be a good example of, of one data product. That data product is deployed with our tag bio software layer, and therefore it gets a harmonized a p i, which serves up metadata about the data product as well as data itself. And all of the algorithms are invo via this third point here, which are these a p I methods.
(16:29):
Each one of these API methods we call a protocol, and the protocol once called via the API is versioned, and it’s testable. So it, it, it represents robust software that you can, you can trust in the, in the process to be able to run the algorithm on the data and, and send back a response, which contains the, the answers to questions. And then data contracts have emerged more recently. And, and what we see at, at tag bioo is that the, the signature of API methods available within each data product represent the data contract. So you don’t, you can, you can in fact change the schema of the data underneath because you access the data and the questions that you would ask through the data via the API methods of the data product.
(17:15):
So I’ll get to some diagrams now for, for the folks that were kind of wondering how this might look. We use this diagram this triangle diagram to represent a single data product. The data product contains a mapping side. So that’s the source data mapping in and then is modeled. So we use this little logo here in the middle. This is represents the data model within a TAG bio data product. The data’s mapped in in, in a way that, that simplifies and makes more useful the scheme of the data. If you’ve got 30 tables in your source data warehouse, this is gonna bring the data into a, a more unified view where you’re going to be looking at a single entity type and potentially map data onto those entities. So the goal is that the data in the tag bio data product is going to be modeled such that you can easily run algorithms because the algorithm developers will be able to get the data immediately in a data frame which would be a, a panda’s data frame, or a, an r tibble in the TDI diverse model.
(18:16):
And so that’s really important. So every time someone wants to build a new algorithm or new visualization on top of data, they don’t have to query the 30 tables to figure out how to join all of the information to do the right thing. And then the a p i, of course, is harmonized so that all of the data products have the a p i speaking the same language. If you think about it, it’s very similar to a, a worldwide web model where you have every web server being able to communicate with a browser. And so you can point your browser at any web server in the world that you have access to, and your browser is gonna be able to properly communicate with that web server. We do the same thing with our data products, so that any client, whether it’s a code client or a front end client or a, or just a data scientist, they can talk to any one of these data products in the exact same way.
(19:03):
Even though each data product has different data in a different data model with different algorithms, underhood design principles, we wanna make sure that the right people and the right systems are working with the different components. So the data mapping layer tends to be the DA data engineering exercise. And so we’ve segregated that process from the folks who might be doing data visualization or bringing data scientists who might bring in be bringing an ml or other types of algorithms, very sophisticated life science algorithms for, for the data products. They tend to be two different skillsets. And so we’ve separated the data mapping layer from the algorithms layer, and then the API layer is also covered as well it, it repre I’ll, I’ll get to the next slide. It’ll, it’ll describe it. We have a J S O N configuration for mapping the data in kind of similar to D V T, the ideas, you’re gonna define this source schema, and you’re gonna define how all of the tables join, how they they map to entities, and how all the different columns from the source tables are parsed into the new data model.
(20:07):
A lot of this tends to be auto-generated by our system. And then on the algorithm side, you can plug in algorithms, but this tends to be the data scientists or the bioinformaticians who are taking our code or Python code that they’ve already written and plugging it in as an algorithm into each data product. And then finally, each one of those algorithms is, is formulated as a question that someone can ask via the API via our J S O N configuration layer for the API. I won’t talk so much about the clients at the moment but we do have a web client, which I’ll be demoing, and we have a, a developer studio and, and folks can interact with each data product via their own I D E for R or Python. We have R and Python libraries, which you would install to be able to talk to the data products.
(20:55):
So I’m gonna briefly go through this slide, but I don’t think I have a lot of time. So I’m going to jump through this and, and show that we, we’ve really worked with a ton of different types of life sciences data. So we have a bunch of turnkey data mapping for things like OP or clinical trials, the CDI formats electronic medical records lots of different types of omics data sets. So we have a bunch of turnkey data mappings, which will automatically provide a data product from a data source that is in one of those schemas. But we do see, still see a lot of ad hoc or novel schemas coming through. So we, we ingest a lot of data from organizations data warehouses and data lakes. So if you think about TAG bio and where does it fit within the, a data ecosystem, we would sit on top of the data storage layer on top of the data marts or data warehouses, for example, if you use Databricks or, or just a relational database that’s where the, the tag bio system would, would go.
(21:54):
I’m just gonna cover this briefly, but this is a data mapping schema that we have. You can see it’s J S O N, so it’s similar to D V T. We also have a YAML option if you want. And so it defines tables and joins and parsing. But we can cover that more if there are any more questions, because I don’t have a lot of time. This is a an A P I method. So this defines a protocol, and so it defines metadata about the protocol and what, what the people can expect, what the parameters are that the users or the clients can choose from. And then once the protocol or the API method executes, what is it going to do in this context? It’s going to execute this plugin function that was written by a bioinformatician which exists in this, this file path here.
(22:37):
This is the, the example of the r markdown that was written by the bioinformatician. And here’s a screenshot of what it produces after the API method is run. So the idea is also we wanna make this robust software. So we want to bring in standard development processes for, for software development. So we’ve got J S O N configurations at, at the data mapping, again, the a p I layer. And we have custom algorithms that are written. This is all code, the J S O N and the R and the Python. This is all code. And so all of that code for every data product goes in its own GIT repository. And so the data product sort of is defined by the versioning system of the GIT repository. And then our C I C D process or any C I C D process that you might want to use, is able to very quickly automate the deployment and testing of these from the get repository layer.
(23:30):
So the developer simply just has to do a push to the master branch or do a pull request, and then that kicks off the process. And here’s a quick diagram of that process where the, the trigger occurs, and then all of this automation with testing and, and deployment health checks. And, and you can deploy to a master environment into production, or you can deploy to a staging environment kind of rushing through. Cause I wanna get to a demo, and I know Sanjay’s got some stuff to talk about. So we’ve got a Kubernetes cluster that these data products are deployed into. And data products, of course, can talk to each other in the data mesh paradigm, that that works very well. But also any client that’s authorized on a data product can talk to that data product.
(24:15):
This is all of course, deployed within an organization’s network. It’s not a software as a service. So the data never leaves the network. The sy, the tag bioo system is deployed within our customers network. So now I’m gonna jump over to, to do a quick demo, but Sanjay, if you could keep me to time. I think I’m doing okay, but I’ve got, I, I, I don’t wanna take, I don’t wanna overlap and, and take too much of your time. So this is an example of our front end. So this web application allows a researcher, or a physician or a business user to be able to click to be able to ask, answer their questions. We call, we internally, we call them clickers, and we distinguish between two types of user groups. We have clickers and coders. And so I’m gonna show the clicker interface, but we also have the coder interface and, and we provide this Jupiter hub.
(25:06):
But, but coders can also bring their own development environment, their own I D E for example. Our coders would use our studio and they would simply import the tag bio libraries, and they can start to talk to data products from their own comfortable environment. One thing we’ve realized is that coders, you know, for as skilled as they are, they don’t typically want to learn new technologies or tools, and so you have to bring your technologies to their environments. And so, and we can, of course, you know, there’s a whole lot I’ve discussed so far. I can’t go into everything into great detail, but if you have any questions, you know, please feel free to reach out to myself or Sanjay afterwards. From the main page, I’m going to go into a new analysis. I’m going to go to, not that one, I wanna go to clinical trials.
(25:57):
This is a published clinical trial from Pfizer, so it’s not Pfizer proprietary data. I’ve clicked into the data product, and we’re starting to see the apps that are shown here. These are the API methods, the protocols that I can start to u use to answer questions. So some of them analyze the outcome endpoints of this, this cancer clinical trial. With regard to survival, we can interrogate certain biomarkers with, you know, specific biomarkers. You can see that there’s an R icon on these. These are going to use custom r plugins that were written specifically for this type of dataset or this type of data product by a bioinformatician. We have clustering in the umap or T or p c a. If you’re from the life sciences omics type industry like I am, then you’ll be familiar with a lot of methods.
(26:45):
So just quickly to go in and show one example this is going to query across, I’m gonna choose E G F R. This is going to query across two different data products to ask the question and quickly use a, so we, every data product tends to come with its own cohort builder. This cohort builder, you can see is very special, specialized for this data product. But every data product also has a its own cohort builder that it delivers to a front end or, or to a, a web application like this, which is quickly going to use it. And while I run that, it may take a bit, I’m going to show another component of the system. So once we have all of the versioning of all of the different data in each data product, all of the code within each data product, we have all these version numbers recorded.
(27:36):
And every time I do an analysis in one of these front ends my analysis history is retained and saved. And so the analysis I did today which is probably just the one I did just now, yeah, the results already shows up here. And so I can recall the results exactly as they were done in the past. I can rerun this analysis or reconfigure it if I wanted to change my question. Or perhaps the data’s been updated in the data product, the versioning information has changed. And so I’d wanna rerun the analysis, the same analysis I did before on new data. And I can do that here in, in almost one click. I have to click twice. So to go back to what the results look like, this is a very specialized report very specific to this type of cancer clinical trial utilizing the T C J, the cancer genome mat list data.
(28:30):
And if you can see through where, I’ll just come down to the hazard ratios for E G F R and you can see we have the Kaplan Meyer curves. And so all of these visualizations are things that in a, a typical BI dashboard you wouldn’t be able to use because it’s very highly specialized to this industry and to this type of data product. And so just briefly before Sanjay goes, I’ll, Sanjay, you’re almost ready. I’ll show that we have a completely different type of data product here. I’m going to go to this one, which is healthcare costs. And so another side of our business is to work with another side of the industry, which is healthcare on the, the payment side, on the, on the patient care side. So looking at patient medical histories looking at, at the, the appropriate treatment for a patient or the appropriate treatment given cost for a patient.
(29:23):
And so this is a different data product. We’re accessing it using it, the same harmonized, e p I, but it has completely different apps because it’s not related to a clinical trial. It’s a, it’s a 80,000 patient data set with, with MS D R G. So the, the, the healthcare claim category and, and all of the, the costs and charges associated with that. Our engine that was developed a long time ago works with any different type of tabular data source. In the past, we’ve worked with data in, in many different types of industries. You know, for, there was a time when, when we as a startup, were going to try to, to, to challenge the entire world of data and, and show every industry that they could build their own data products in this way, we found that just the use cases in life sciences right now, and the business opportunities there for us are, are really presenting a, a distinct challenge for, for us and for, and for the growth of the company.
(30:23):
So we’re, we’re, we’re seeing that, you know, in order to be able to sort of do all for all different industries, we’re going to do really well in, in life sciences. And so that’s, that’s where we’re growing right now. And we’re really excited to, to be there. It’s also my specialty as well as a number of my colleagues at, at Lake Sanjay, for example. Normally in, I’m not presenting to just a silent audience, and so usually they’re interrupting questions. I don’t know if there have been questions so far, but I’m happy to answer the questions at the end, let’s say. So I’ll, I’ll, I’ll go and browse what the questions might be while Sanjay is presenting his side. Sanjay, is there anything that you’d like me to add before I switch over to you?
Sanjay Padhi (31:04):
No, no, that’s fine. Thanks.
Jesse Paquette (31:07):
Okay. All right. I’m gonna stop sharing now, and you’re on.
Sanjay Padhi (31:12):
Hi, I’m Sanjay. I’m the chief technologist in TAG Bio. I will present how do we harness the power of the mesh? What Jesse just described was was the data product and, and how to get insights directly from the data products. So today we, we’ll talk a little bit about how do we interface with the rest of the system. So if you, if you look at this particular diagram the platform provides a single sign on. We also have logging, tracing, and other management such that we can, we can analyze the log itself. We ingest data from all sources, Amazon, Redshift, Databricks, data warehouses, and so on and so forth. What you saw is actually the a p I layer connected to the mapping layer, as well as algorithm as a single unit. And once you define the unit then you define a set of data products, which are associated with that.
(32:08):
But now, since it’s harmonized and served as a a P I layer, you can interrogate using machine learning, and we support almost all machine learning framework from cloud, like SageMaker or, or, or auto ml. One can also interrogate by by business intelligence and virtualization visualization mechanism. The data product itself publishes itself as a, a, a product. So one can search using their respective search and discovery platform. So for example, if you’re looking for T C G A, and you can, you can find it, and we like, like Jesse said, we deployed usually at the customer site under their V P C with their cloud. So, few more details on the use cases. For example, we are deployed in many pharmaceutical industries for, for example, cleanups or harmonizing clinical trials, or real world evidence adverse effect. There are many data data sets.
(33:08):
We, we, we, we support, for example, fields it’s effect, F D a dataset for adverse effect, patient outcome data data products, and so on and so forth. Like I said, we, we deployed to Pfizer, Regeneron, Carex Notch and, and INS bio, and several other bio biotech companies. In, in terms of so this, this is another one study I like to highlight. This was a study of renal cancer and renal cancer or kidney cancer. It’s one of the most common kind of ca cancer found in adults. It’s really a fast growing cancer. And one of the way to figure it out is actually a biomarker analysis that’s really useful to detect the presence of this cancer. And there is a protein called PDL one that acts like a break to keep the body’s immune response under the control. So if you study the PD-L1 it with respect to a given drug kind in this particular one, it’s a public study.
(34:06):
As you can see, you can see AOC survival. And this particular study took several years to build it. But using data products where we are, we have baked in the algorithms, baked in the data, harmonized data, data mapping layer. And a p i one can correctly do the analysis here also, in order for anyone to do it, we just need to send you the URL and you will know all the parameters. I don’t have to send you the data or the code or anything. So that is one of the major power of building no code or low code data mesh system with baked in algorithms like this. The next topic I’ll I’ll, I’ll try to show is this one, T C G, I think Jesse already said it uses Umap, as you know, umap if you use Clusterization on the expression data, those clusterization are directly proportional to the cancer type.
(35:01):
So not only you can automatically clusterize it given baked in algorithms, but also you can follow up analysis like can I do a cohort comparison? Can I, can I study a gene expression compa comparison? Can I do pathway analysis? And here the data product can talk to someone else, some other data product to get even the annotation data. For example. The other area is healthcare. We are deployed at U C S F. It’s a value for, yeah, value improvement. U this technology is saved, you know, more than $10 million there using the, the, the data product consists of more than 5 million inpatient outpatient billing data, e m R data and, and, and, and many of the comparative data in, in there. We also deployed at for example, Parkinson’s Foundation. Here it is, we not only the clinical part, but the genome, ovarian PA parts are mapped into the data products.
(36:01):
So data part, not only are they having the mapping, but there is a, there is a model there, and this is like a multi-dimensional data frame. Ias in for clinical parameter, it’s a 2D frame, but each patient can valiant information. So it is, it is model as, as, as, as a 3D model of this particular case where both clinical and omic variant information are there. So that one can study if there is a a particular mutations in a r KK two, a particular mutation, then can that responsible for Parkinson’s disease, so, and can verify with clinical parameters. So, so far we have seen, okay harnessing so far we have seen the power of data products. Now we’ll start seeing how we can harness the power by collecting to data products. So in this particular example, one can not only study in, in, in data product itself, but also interconnected data product, also distribute a data product.
(37:01):
And, and I’ll show you a couple of example. So how do we connect it? So one of the way tag bio we do is using a p i. So each data product, think of it as a node, which has its own internal ip, but also a gatekeeper. So you can talk to each other by asking that IP plus your token or your authorization token. And that’s how each product talk to each other is the same way how applications talk to servers by a p i, key or eth. And, and the systemic has we use to publish a data product. So for example, this is a data product. It has, it has its own metadata information. It publishes it, like what, what was the comic version? What is the description is what kind of data product is, what is start time and, and versioning of the container and so on and so forth.
(37:53):
And then it can publish to a general publishing chasm like like lucid lucid frame, and where you can search for that better part, which goes back to the platform. And one can use J S O N payload for that. So the same academy we also use to talk to each other. For example, if you want to ask a, a question, how many people have diabetics in hospital in Boston and in San Francisco, I don’t think there is any platform in the world where you can answer this question without bringing the two data together at a single place. And neither San Francisco folks, nor Boston folks are, are going to give you data to or move the data across. So one of the power of this is you can actually run, execute at each of their systems and combine the outcome like how many people have diabetics total.
(38:42):
And that’s the mesh allows. So you can do some kind of f feder, learn directly at their data products and combine the results without exposing anything. Each data product has is, is, is tightly controlled. The ownership always remains with the data Product owner one can also allow organizations based on their authorization authentication as well as the, the, the control from the order. So here is one example where we’ll try to explore for the, the power of data mesh. So let us consider, you have three d data products are there, clinical trial adverse effect. This is F D f, adverse effect and health economics.
(39:26):
The way we connect it is again, the same a sorry the same a p i layer and from, and here’s one example. Let’s look at Regeneron as an example. Regeneron has many, let’s say market product like Eylea, Dupixent, and a little bit higher. This is from the public information. And, and, and, and one of the, if you consider depiction, it’s one of the major drugs from Regeneron, one of the indication is a atopic the dermatitis. You can get these informations also directly by interrogating the clinical trial data product combining with F D A and combining with health outcome. The question would be how many clinical trial, for example, Regeneron has? And you can ask f d A, what are the adverse effect due to Dupixent, for example, or, or what are the indication which Dupixent treats and for any given adverse effect, what is the outcome or how much it costs to treat them?
(40:25):
So in this particular example if you go to our site, again, these are all public data products. So one can use it. This is the f d a data part where you are asking a question very simply that for Dupixent, Alia and others, what are the major, for example, indication is, and as you can see, Dupixent is the larger, but it can, it can be also used for fast asthma and, and all. So, and, and then your next question is, okay, for a atopic dermatitis or eczema what are, if there are adverse effect, what are the cost? Someone can use the third data product by connecting to that. And here, like it’s, it’s like $2,000 if, if you’re using surgery for, for as asthma. So there, there is intrinsic power there if you start building mesh and using the power of the mesh to get various questions asked.
(41:23):
And that’s where the platform allows that essentially, the platform also allowed. We talked a little bit about AI and, and machine learning. The platform also allows a mechanism where you can do machine learning or AI training. So one of the advantages that data is not harmonized, it’s, it’s in a, in a data model we talked about, it’s like a data frame, but multi damageable data frame. Thus you can actually feed them for training inference and generative ai. So I will talk about a little bit about those three. So the platform looks like that you have a set of data products that was a step building block one where one creates a set of data products, then the platform allows to use analysis apps, the one Jesse showed, but also you can build your AI training model and use inference at the app level, but also you can use gender to ai where you can ask questions and it’s going to give you the results based on that particular data product or a combination of data product.
(42:24):
So here it is one here in this case, we, we, we, we allow, for example, all the machine learning models one can use from public, also your own container, your own model, which we can bake it inside the container model in order to build or do the training. So here is an example. For example, the platform allows you to write your own notebook code and submit it now to, to, to build a training or you schedule run it at, at a given schedule using compute type you like. This uses notebook system, but the inference is done at the app level. Here is in the right side, you can see a K n accuracy as a function of K for a, a, a given data product. And as you see, the maximum value is is coming to K 12. So you can use inference in, in the app sector and use notebook system with in-build connectivity to cloud for training.
(43:19):
We, we also started looking inter generative ai. As you know, generative AI enables users to quickly generate new content based on variety of inputs like inputs and like, like for example, image, text sounds and so on and so forth. So what we try to look into that is we try to take large foundational model, if they’re public, like Bloom or L L M A or or whichever is the kind large more public model or even the private model. If you, if you want use custom or private data product, given it’s in a data frame, you can use prompt tuning or soft prompt in context learning or selective fine tuning, or that means just change a few layer up here or do rep parametrization to build your own model. Then we allow that particular model to be hosted in the same platform.
(44:12):
So the data as well as model, as well as all the fine tuning mechanics does not leave the customer site or your own site. And then you can also do the inference within the same site. There are various ways to build private model. For example, there’s a addictive method. Adapters is one of them. Soft, prompt mechanism support, also selective selective parameter efficient tuning and, and, and so on and so forth. Here is one example I can demo it, but you can also ask question. Instead of building apps, here we are asking a question, create a pair plot in CBO of the data product. Remember, I didn’t even say which data product colored by species. So since it is, it, it is connected to that particular data product called ffc, iris or, or Iris data product. As you can see, it uses long chain and, and connect to a, a, a fundamental model in order to plot it.
(45:12):
Same thing. You can ask other questions like for a given. This is another data product called covid registry. This is synthetic data data from oap. Here you’re asking how many types of drugs were used plot top 10 drugs you with PLT birth, that means in a horizonal way and make y access in the small fund. As you can see, it u it uses agent executor chain and plot it for you. So in summary, I think I’m running out of time. We should have a dedicated talk maybe at some point on generative AI and data mesh, but these are all implemented in, in our platform. We can also play around with that. So talking bioo in general helps customers to fine tune, rebuild generative AI model with private biomedical data, like clinical trials within the mesh, and it ensures privacy, ownership and governance. We support two kind of ways. One way is we, we support AI or generative AI applications using proprietary public data. So these are apps, so you can do fine tuning with the system prompt design and build apps, but also you can build your proprietary app or even public model. We, we, we provide model registry, both private and public registries to be used by scientists and ML developers.
(46:34):
So in summary tag Bioo has three main entity. Entity. One is the data product we talked about where we convert static file or static set of files on a storage to build a data product, which is the application layer. And data product is domain driven, harmonized, decentralized application layer. Once we have a data product, we connect the each data product to, to gain gain knowledge. The examples I talked about, what are the maximum clinical trials come from a given authorizations? What, what is the drug used? What are the indications? What are the side effects? And, and, and what’s the cost for those side effects? So is powerful. You can, you can actually build data data mess exchange where you can allow people to do analysis any number of time, given that the, the algorithms and harmonized data models are part of the system.
(47:29):
And the third part is how do you interrogate the data product? We, we, we provide generative way where you can ask a question and it can build the plot for you, or you can click through the system or you can also code. We provide develop a studio, a notebook based system. It is available both in, in, in, in AWS and Microsoft Marketplace. If you wish to try, please use tag, bio tag doc bio slash try or, or let us know. We can help. I think I’ll stop now and see if there are questions. Are there any questions we can answer?
Jesse Paquette (48:45):
Yeah, I think Sanjay, we could continue to talk about maybe challenges that we’ve experienced. I, I think for most of the folks in the audience that are familiar with implementing data mesh you’ll know that as we sort of described, there’s the micro layer of the data mesh, which tends to be at the data product level. And building one data product is often a useful way to get into building the data mesh. But there are also challenges on the macro level. And, and so I covered sort of the micro level, the individual data product, and Sanjay covered the macro. And once you have a number of, sort of a critical mass of data products in the data mesh, then you can start to achieve these things like Sanjay was discussing. With, with leveraging all of these different data sources to ask a really complex suite of questions.
(49:36):
And this is I think where the, the truly transformative power. I think there, there are two major advantages to the data mesh in this industry. One is scaling up because the data mesh that sorry, the data warehouse, the data lake paradigm did not scale well in pharmaceutical, at least not in the research and development side because of the diversity and the the differences across all of these different data sources, whether it has to do with silos or locations or ownership but typically schema and, and technology is a, is a major factor. So this data mesh solution, the data mesh paradigm applies very, very nicely to, to pharma. So I think it, it helps these organizations scale to be able to use all their data by bringing them into data products. But then once you have that critical mass again, of data products in the data mesh, then you can start to ask questions that you never thought you’d be able to ask before.
(50:41):
Because of the, the technical challenge of, of even just starting to, to organize all the data, to be able to ask that question or say, let’s ask a question across all of our clinical trials or ask a question across 28 billion patient claims and, and combine that with patents and combined that with adverse events sort of as Sunday described, these are massive undertakings that take pharmaceutical companies years to do with very large teams. And these things can be achieved very quickly with, with the data mesh model. And then when it comes to the new emerging technology, the hot technology of, of the day, which is generative ai you could use this for building data products, which is something we’re working on right now to be able to use generative AI to automatically infer and build data products. You know, all it has to do is make these J S O N mapping layers or the J S O N A P I modules.
(51:35):
But what Sanjay and team are working on right now is oh, we have a question. Okay so I’ll, I’ll take a stab at that, but then Sanjay, maybe you can cover that also. So we’re working with, with organizations that have their own bureaucratic structures, and we as a, as a company, can’t typically break down the organizational silos that might exist. But for example, I can give an example of one customer that we worked with where we were working with clinical trials data. And so the clinical trials data is owned by a certain group. And so for one clinical trial, all of the clinical, the C disc data for that clinical trial is in a certain data storage, but all of the omics data related to that clinical trial is in a different location owned by a different group and you know, in a different technology and used for different purposes.
(52:34):
And so that’s a, a big challenge, but what we’re able to do is, at the data product level, the integration occurs so that, that ends up being owned not by the, the group that owns, by the clinical trial data and not by the group that owns the C R O omics data, but instead by the oncology group that, that has the questions about the data. And so it ends up transferring the ownership away from the centralized, or e even decentralized IT organization and more to the research organization that actually really cares about the usage of that data. So the, and you know, it has to do with the data model that’s built and the snapshots of the data that are taken, but specifically the data product, the code base and the get repository and all of that is the ownership of the oncology group the domain owners. And, and so that’s where the a p I methods, the algorithms, the visualizations, all of the questions are defined and they have ownership completely over that.
Sanjay Padhi (53:35):
And just, just quick remark, Jerry, each data product has a owner, right? So for example, this is a data product in the system. You can go there and see, okay, fine let’s look at the data product and let’s pick up something like product.gov. Right now it’s a public data product. You can actually create a private data product and have, these folks are the owner right now. Anytime anyone other than these folks wants to use it, they will not be able to use it. And, and, and if I go back to the system in, in the clinical trial.gov. As you can see the other than those three folks for clinical trial dataset will not here, for example, there’s, there’s a lock here. Now, a third person who, if I want, I can add them as a single user or as a group. And then with their O A P I token, they can connect to this particular data product or get insights. So this way, each data product has a owner or owner can, can allow whoever, or disallow, whoever they like in, on top of single sign on mechanism. That means, you know authentication authorization this comes after that. So this way we keep them separate. The second question was that, by the way, Jerry, please tell me if, if I, if I missed anything, feel free to free to ask again.
Jesse Paquette (55:01):
The second question is about when you have potentially different data products owned by different organizations, maybe in different secure networks or in different cloud providers. Yes. That’s question for you.
Sanjay Padhi (55:12):
Yeah. The second question, assuming that they are in a, you know these data products and let’s say in a different cloud provider I have a diagram for that. So here, for example, in the one data product can be in, in as Azure, other is in aws, third in, in Oracle, then they all are connected by the a p I layer, for example, this one, right? So you see this particular a p I layer is being asked. So we have a gatekeeper which translate the a p i layer plus your token such that you are secured. So you have to not only get authorized to access that from the owner, but also provide your own token for the accessing for the data products. This way you can ask multiple questions. So it is all done by the single a p i layer.
(56:00):
Sometimes some people use private links such that these connection from the a p I layer does not even touches the internet. So you can use private link, but each of the data product are within their secure network, connected by this API layer with the respective token. And it can be on-prem, it can be on on cloud. So multiple data product can be at multiple different place. And that’s why in, in one of the question I one of the example I gave Imagine somebody is in Boston children’s Hospital, have a data set, and somebody in San Francisco, the way in our scheme, I, you, you combine them, is actually do some kind of a federated length. That means execute this command and both of them and, and, and, and the results will combine it instead of bringing data from one to the other. Bruma, please let me know if this answer your question. If not, I can, I can dive deep into it.
(57:11):
And in fact just to answer you, in fact, for a really long time we were running at the same time, both let’s say systems in a w s in Microsoft Azure, as well as in, in, in, let’s say U C S F on-prem systems. And, and, and for most cases we were, we were doing interrogation between a w s and, and Microsoft Azure data products. So so, so this allows it in a much more secure way with respect, authentication, authorization plus the a p i token. Okay, thanks Tappa. So thank you everyone for, for joining the, this, this meeting. And if you have any, any other questions, please feel free to ask me or Jesse or send, send us an email you know, our email addresses here. I’m more than more than happy to, to, to talk to you or help in any ways we can.
Jesse Paquette (58:22):
Thanks everyone for attending. And, and thanks to Melissa and Janice for organizing this with the Data Mesh Learning Group.
Sanjay Padhi (58:29):
Thank you. Bye.