Microsoft Community Insights

Episode 12 - Navigating Data Platform with Michael Tobin

June 10, 2024 Nicholas Chang

Send us a text

What if your raw data could transform into powerful insights that drive your business forward? Join us for an engaging discussion with Azure consultant Michael Tobin as we unpack the essential components of modern data platforms. From the intricacies of data ingestion to the importance of robust infrastructure setups like hybrid connectivity and role-based access control, Michael shares expert advice on navigating the complexities of data management. We also explore best practices for creating landing zones using Terraform and Bicep, ensuring a structured environment for development, testing, and production.

Speaker 1:

Thank you, hello. Welcome to Microsoft Community Insights podcast, where we share insights from community expertise here today with Azure. My name is Nicholas and I'll be your host. Today In this podcast. We will dive into data platform, but before we get started, I just want you to remind you to follow us on social media so you don't miss an episode to help us reach more amazing people like yourself. Today in this podcast, we have a special guest called Michael Tobin. Sorry if I pronounce it wrong. Can you start by introducing yourself, please?

Speaker 2:

Yeah, hi everyone, my name is Michael Tobin. I'm an Azure consultant at ANS, so I'm primarily responsible for the delivery of infrastructure on Azure pretty much all things infrastructure, so that's landing zones, infrastructure, past services, uh, migrations and data platforms specifically as well. So what we're going to be talking about today a little bit okay, brilliant.

Speaker 1:

So, uh, before we get started, uh, just those who don't know what data platform, can you briefly explain what it is?

Speaker 2:

yeah, of course. So, um, a data platform is essentially a set of technologies. You know it's not specific to azure or anything like that. You could have a data platform in avs, you could have one in gcp, um, but it's a set of technologies that are designed to manage, process, store and analyze large volumes of data. That data could be structured, it could be semi-structured, it could be unstructured, and it's to really get sort of analytics based on your data set that you're ingesting into a platform.

Speaker 1:

Okay, so how does that differ from traditional database?

Speaker 2:

Yes. So the kind of difference with a platform end-to-end is a database might be a source for your data platform and then you want to basically take that data. You want to do a process which is called ETL, so that stands for extract, transform and load, and sometimes your data if it's raw or unstructured it might not be useful to report on, so you might not be able to get the analytics you want out of them. So with ETL, that sort of involves obviously the process extract, which is taking your data from source systems, putting them into a staging area for your data platform. Transformation, that's running sort of logic and business rules against the data. You know that could be like sql queries. It could be things that ensure the data is in the correct structure because there might be issues with the data you're ingressed in. And then which is where you move that transformed data into a target repository. That could be. If you're looking at something like a data lake, that could be the next layer in your data lake, for example. Okay, great.

Speaker 1:

So in your experience, what are the key components you see that's required in data platforms?

Speaker 2:

Yeah, absolutely so. Your key areas are an ingestion. So you need some kind of ingestion. Typically that's using what's called an Azure integration runtime or a self-host integration runtime, so that allows you to essentially pull your data either from Azure or an on-premise source. It doesn't have to be on-premise, it could also be other cloud providers with a self-hosted integrated runtime. Another component is you need storage. So in Azure that's blob storage. It's something called Azure Data Lake, azure Data Lake Gen 2. And that's essentially large data storage that's based on blob storage and allows you to containerize data into sort of different layers. Now there's there's different concepts around that. Generally day-to-day I see the the medallion architecture, which is a Databricks model where you've got your. You've got the kind of bronze, silver and gold free, free containers in a blob storage container typically. Which is bronze is your raw data ingestion. Silver is your kind of first level of ETL, so you've filtered, cleaned and augmented the data. And then gold is where you've got kind of business level reporting.

Speaker 1:

Okay. So before, when you introduced your service, you worked on some landing zones. So what's the difference between a typical landing zone and a data platform landing zone? So what's the difference between a typical landing zone and a data platform landing zone?

Speaker 2:

yeah, so generally, um, what what we do is that there are some some data landing zone architectures, but we, we recommend having a landing zone in place before you sort of look to do a data platform.

Speaker 2:

It's really important for, especially when you're ingesting data from on-premise, it's really important you've got your hybrid connectivity set up, your private dns is set up correctly. You know whether that's a private dns resolver or domain controllers. Uh, we, you know we strongly recommend having a landing zone and some people because there's a big focus on data at the minute. Obviously, with ar coming up, it's really important to have that data processed and it's a business level we call it um. So it's really important to have that. And some people come into kind of projects like this without a landing zone and say, oh, I just want a data platform, but it's really important to have that sort of that key landing zone deployment done in the first place, you know, not just for the hybrid connectivity element, but to have a good model of role-based access control and and your firewall and things like that. So it is really important to have a landing zone in place when you sort of come to look at deploying a modern data platform and the elements of it okay, brilliant.

Speaker 1:

So how did you? For example, when you create the landing zone, do you just create it with infrastructure it's called right, terraform and bicep and then you just put all those data resources, like data lakes and stuff, in within the landing zone yeah, so.

Speaker 2:

So we typically do it. Um, it's kind of there's. There's not one size fits all, but what we generally recommend is you, you have a dev, test and prod environment and we typically recommend, uh, separating them out into different subscriptions. So, uh, they would probably sit under the court management group, uh, and then you'd have a subscription per environment for your data platform. So it fits into the landing zone. But in terms of the resource deployment, we usually keep that separate to landing zones. Okay brilliant.

Speaker 1:

So security is quite, very important for data platform as well. So what are the best practices for securing data as your data platform when you create those?

Speaker 2:

Yeah, absolutely so it kind of splits up to two parts for me and obviously there's a lot more than just two parts, but a very high level networking security and URL based access control sort of strategy. So when it comes to networking security, pretty much all the resources that I see in a typical data platform can be connected up with Fiverr Endpoint so a storage account can have Fiverr Endpoint. Generally we see key vaults used in data platforms to obviously store things like connection strings to databases. Obviously that can be scored with private endpoint. Azure Synapse has four different private endpoints attached to it. That's three for the back end and then one for the front end UI. So if you want you can have that only accessed from your internal network, which obviously we strongly recommend, because Synapse can be exposed to the internet and you know it is possible to do and if you have a good R back strategy and things like that, it's not too much of a worry. But generally we recommend locking that down. Private endpoints are really the key here and just making sure that they're configured correctly and public access is turned off. Obviously it comes in with policy as well and your governance strategies. So generally we recommend having the policy in Azure on for denying public access to past resources. That's a really important one here, because a lot of data platforms are based on these past resources. So, again, storage account, key Vault, synapse they're all past resources and they all have public enabled by default. So locking that down is really key. It's really important.

Speaker 2:

The second thing is thinking about your kind of audience and who's going to use your data platform. So, do you have data engineers? Do you have data architects? Kind of defining personas and then making sure you build out a role-based access control strategy that sort of fits those different personas. So, does a data engineer need access to a database, for example? Probably. Do they need access to Synapse, for example? Probably. Do they need access to a key vault? Probably not. They might not need to add credentials in, whereas an admin, an administrator, might need to do that in, whereas an admin and administrator might need to do that. So it's just about making sure the roles are defined and what everyone's doing and building out our back groups in entra. And then you know, further on top of that, if you've got p2 licenses, integrating that with pim, integrating that with access packages and things like that. So it's really important to have a good role-based access control strategy, as well as your networking strategy yeah, because I know that data platform involves like large amount of data and it's best.

Speaker 1:

I think it's very crucial that you secure your data, whether it's you need to have, yeah, scaling features to scale it up, scale it on demand also, according to organization yeah.

Speaker 2:

So in terms of scaling, depending on kind of what tools you're using you've got, you've got a lot of different options. So when it comes to synapse, you've got dedicated sql pools and you've got serverless ones. So the serverless ones will scale your dedicated ones. You generally need to define the skew yourself. When it comes to other tools like data bricks, you've got clusters and you can set a minimum amount of minimum and maximum amount of workers, which are just virtual machines essentially, and you can define the SKU as well. So things like ETL processes, like I mentioned at the start, sometimes they're, you know, some smaller ETL processes probably don't need too much compute and obviously with clusters and things like that it allows you to. The sort of serverless options are probably the best way to go in terms of cost saving. Databricks have just brought in serverless compute into public preview in Azure. That's getting rolled out pretty soon, so that'll really help. It'll probably save a lot of costs when it comes to scaling?

Speaker 1:

Yeah, so for those who don't know what ETL, do you want to explain what it is for the viewers?

Speaker 2:

Yeah, of course. So I did touch on it briefly before. But it's essentially a process which stands for extract, transform and load. An extract is pulling data from a source system, so that could be ingesting tables in a database into a storage account. The transformation process is usually kind of defined by data engineers and they'll typically work with different use cases to shape that data in a certain way.

Speaker 2:

So it could be doing things like validation, so something an example I've got could be that you can pull in a table which has loads of people's postcodes in, but you know, someone might have put two spaces in the postcode, which gives incorrect data. So it's taking data like that and just making sure it's in kind of a fit shape, and then load is taking that data that's been cleaned up and then storing it somewhere else. Essentially, so that could be a data warehouse, it be a data mark. It could be a different type of storage system. Again, usually what we see is is it comes into the storage account in a container, it then moves to a different container and then a third time, after the sort of last level of ETL, moves into another container. So three containers in the storage account bronze, silver and gold and then as it get, as it goes through that etl process, each time it kind of progresses up into from bronze to silver, to gold yeah, because I know where I work.

Speaker 1:

We currently use purview, but could you use purview in data platform?

Speaker 2:

yeah, so. So purview and data platforms go hand in hand. So there's um, there's there's connections from Purview to things like Synapse. There's connections to Purview, to data storage as well. Obviously, it's a massive, massive thing these days, and what I see used the most when it comes to Purview is the data classification feature, so you're classifying the data that's been ingested into the platform. Obviously, purview is a massive product and it obviously encompasses a lot of other things. You've got the data loss protection and things like that.

Speaker 2:

But, yeah, we see a lot of Purview deployments go sort of hand-in-hand and you can also hook Purview up to your source systems that are on-premise, so it's not just limited to Azure. So that's really handy. So I just touched on it briefly earlier. But you've got these virtual machines that are called self-hosted integrated runtimes and when it comes to data factory and signups, their sort of function is to pull data into your platform. Now Purview has self-hosted integrated runtimes as well, but they have a slightly different job. Rather than pulling data in, they're used to scan on-premise data so you can categorize your on-premise data as well. So we do kind of see those go hand in hand with data platforms and that's that's that's becoming really useful for organizations who need to categorize their on-premise databases and source systems, as well, yeah, because I know so, when you categorize your, your resources or database, always use labels in preview.

Speaker 1:

Yeah, exactly, yeah, very helpful, ok, so what's what's the best way to monitor a data platform for like performance wise?

Speaker 2:

yes, so we've, we've, we've then been past services. They are pretty scalable, but, um, there's a lot of monitoring you can do, and especially with things like databricks. Um, there's a lot of logs you can ingest into log analytics workspaces, which we generally recommend centrally managing through having a single log analytics workspace in your landing zone and then trying to ingest your data in there. Generally, we see logs like pipeline triggers who's running pipelines? Have they been triggered automatically? That can be really useful to kind of monitor who's who's kicking off what pipelines. Is there anyone manually kicking off etl pipelines or processes? Um, like I said, in terms of scalability, there isn't too much you need to do because a lot of these are based on past resources. Um, but, but yeah, when it comes to monitoring, definitely recommend having a good sort of log analytics strategy, making sure you're pulling in the right audit logs, especially from a security perspective as well, making sure your users have the right access, and things like that. That's brilliant.

Speaker 2:

How can someone new learn about Data Platform? Is it any other resource that you recommend? There's some really good resources out there, and I think it's just worth noting that it's not limited to Azure. Obviously, there's loads of tools out there. There's Amazon, redshift is the AWS equivalent of Synapse on Azure and there's also there's a lot of tools out there. I think that's probably the hardest thing about learning about data platforms is trying to sort of narrow the sort of tool set that you use, especially with so many products out there. There's a lot of SaaS products out there now as well, like Snowflake, which is a warehousing and database sort of rolled into one SaaS product. So yeah, there's a lot of good resources out there.

Speaker 2:

Definitely MS Learn, especially around Synapse and Azure Data Factory. They're sort of the two most common products we see used. So it's worth noting it's not a one-size-fits-all, but the sort of common architectures I see are Data Lake and Synapse that's kind of the and obviously Key Vault as well. They're kind of the three core sort of uh parts of a data platform. But then you've also got data lake, um data factory and uh data bricks, for example. Um. Now the key difference between synapse and data factories is that um data factory is more of an orchestration tool, so it doesn't do um etl processes like synapses. It doesn't have the analytics part but it does do orchestration, so essentially data movement, so you can use it to pull your data in from on-premise and you can use it to push your data into something like Databricks or push it into a different sort of system. So there's some sort of key differences there that are definitely worth looking into. Okay, that's brilliant.

Speaker 1:

So, as this episode episode almost coming to an end, we would love to hear about the individual yourself. So are you going to any events like, whether it's technical events, tech events or like internal events?

Speaker 2:

yes, so, yeah, so hoping to go to the next, uh, yorkshire zero user group. Um, I think that's coming up next month now I went to the last one in Sheffield and that was really good. I sort of like to try and get to them as much as possible. The Azure user groups are really good. At ANS we have an internal event called TechCom every year, which is where we do talks from everyone sort of across the business, and that's really useful to learn about what everyone's doing across the business. But yeah, that's, that's pretty much it for me. What about you, nicholas? Any any events coming?

Speaker 1:

up for you, yeah, so I'll give myself a little plug. So, yeah, go on, there is. So there is. So I'm part of the organizer for Expert Live UK and it's coming to London. We're going to start a digital group, so it's coming to London next month, on the 20th. So if you're free, you can welcome to join.

Speaker 2:

Yeah, that'd be good. I'll hope to get down there.

Speaker 1:

Yeah, so it's just that one, and maybe I might be going to. I'm to netherlands for the expo live now. It's quite a big one because yeah that's a big one employee yeah, yeah.

Speaker 2:

Yeah, I'm familiar with the netherlands one.

Speaker 1:

It's quite a big group yeah, so how can someone get in touch with you for any question regarding data platforms?

Speaker 2:

yeah, absolutely so. Yeah, so I'm. I'm available primarily on linkedin. Just michael tobin, that's a m-I-C-H-A-E-L-T-O-B-I-N, and then I've got my blog as well. That's just hosted on mtobinuk. So, yeah, primarily LinkedIn is probably the best place to get in touch with me and feel free to connect or reach out if you don't have me already. Okay, brilliant.

Speaker 1:

Thank you for joining.

People on this episode