Microsoft Community Insights Podcast
Welcome to the Microsoft Community Insights Podcast, where we explore the world of Microsoft Technologies. Interview experts in the field to share insights, stories, and experiences in the cloud.
if you would like to watch the video version you can watch it on YouTube below
https://youtube.com/playlist?list=PLHohm6w4Gzi6KH8FqhIaUN-dbqAPT2wCX&si=BFaJa4LuAsPa2bfH
Hope you enjoy it
Microsoft Community Insights Podcast
Episode 45 - LLM on K8s with Seif Bassem
We start by weighing the trade-offs: managed AI gives you speed, safety, and a deep model catalog, but steady high-volume workloads, strict compliance, or edge latency often tilt the equation. That’s where AKS shines. With managed GPU node pools, NVIDIA drivers and operators handled for you, and Multi-Instance GPU to prevent noisy neighbours, you get reliable performance and better utilisation. Auto-provisioning brings GPU capacity online when traffic surges, and smart scheduling keeps pods where they need to be.
The breakthrough is Kaito, the Kubernetes AI Toolchain Operator that treats models as first-class cloud native apps. Using concise YAML, we containerise models, select presets that optimise vLLM, and expose an OpenAI-compatible endpoint so existing clients work by changing only the URL. We walk through a demo that labels GPU nodes, deploys a model, serves it via vLLM, and validates responses from a simple chat UI and a Python client. Tool calling and MCP fit neatly into this setup, allowing private integrations with internal APIs while keeping data in your environment.
Hello, welcome to Microsoft Community Insights Podcast where we share insights from community experts to stay up to date with Microsoft. In this episode, we'll dive into LLM on AKS. And today we have a special guest called Stef Bassum. Could you please introduce yourself, please?
SPEAKER_00:Yeah, hi Nicholas. So my name is Save Basum. I'm a cloud solution architect at Microsoft. I'm based out of Egypt and I've been at Microsoft for around eight years. My main focus is really Azure infrastructure, cloud adoption framework, well architected framework, and a bunch of other stuff. But now I'm uh I'm uh diving it a little bit more into uh AI infrastructure and and um and ai in general. Okay.
SPEAKER_01:So today theme is all about LLM on AKS. Uh before we get started, do you want to just like what what inspired you to for you want to talk about LLM on AKS? Have you always been involved with cloud native, like Kubernetes?
SPEAKER_00:Sure. So um to answer your question, no, I've I've not been involved a lot in in cloud native, but um I've been working a lot with Azure Arc in the past years, and I've worked with customers who have specific requirements, uh like regulatory compliance, security latency requirements. So they they were looking into um having piece of Azure on premises or on other clouds, and now with AI, there are still similar requirements, which I'm gonna talk about. So um I got really interested in how to host large language models on Kubernetes and specifically on AKS. So I thought um this this would be an interesting area to look at and and try to see what kind of technology is available and how to help our customers.
SPEAKER_01:Okay. Yeah, because I remember it's very similar to Azure Org, like with Azure locals like on-prem to the cloud. So Kubernetes is just pretty much how you can actually integrate AI into your into your cluster.
SPEAKER_00:Indeed. Yeah, that's that's that's what we're gonna figure out today.
SPEAKER_01:Okay, so have have you what kind of like what have you done so far with AI on Kubernetes?
SPEAKER_00:Yeah, sure. So um with Kubernetes, there are lots of things going on in the open source world in terms of inferencing engines. Um, of course, open source models have been skyrocketing in terms of innovation and catching up with the frontier models. Um, you would see also the um the AI gateway um uh effort and and how to properly uh route uh AI um uh requests, LLM requests. Um, and there are lots of customers doing even um training on Kubernetes, and they're doing fine-tuning. So this uh this area has been uh really booming. And um what I'm gonna do today is just give an intro about this area for those who are new to this this area, explain um the landscape, when to think about hosting uh LLMs on Kubernetes versus using the managed services, and give like a high-level idea on how it works and show an a demo um on how to deploy actually an LLM on Kubernetes.
SPEAKER_01:Yeah, because I remember I I recently did a hackathon and we were trying to investigate how we can actually like make like uh migration easier, like if you want to migrate to different like uh clusters or easier with what like because if you do different versioning, it'll be like a lot. So we try to use how AI can simplify that process.
SPEAKER_00:So yeah, definitely. And and if if uh if you've seen uh the the the migration events at Ignite, at the last Ignite, you're gonna see that there has been some innovation there in terms of helping migration and modernization using copilot, where instead of figuring out all the dependencies, you the you would let the copilot uh my modernization agent look at your code and come up with a plan and help you do that. So definitely in the coming months, this is gonna be much, much easier and more streamlined.
SPEAKER_01:Yeah, because another use case that we could think of is like upgrading, like different versioning of like a uh AKS that sometimes it takes consume cons time consuming. So where we just like make sure it's some service and make sure it's it's correct upgraded. So we could like further look into that as well as an option.
SPEAKER_00:Yes, indeed, indeed. Uh so let me I have a couple of slides just to do a quick intro.
SPEAKER_02:Um sorry, so let me share my screen better.
SPEAKER_01:I think you should have permission. Um just uh the screen that looks like a computer.
SPEAKER_00:Yeah, it it it's because my slides is a little bit big, so they it said I cannot share it, so I'll um I'll I'll try to share the screen. So just let me know once you can see it.
SPEAKER_02:Got it.
SPEAKER_00:Okay, perfect. Um so before getting into like hosting uh on Kubernetes and open source, etc. Let's examine now um what is the the PaaS landscape in terms of LLMs.
SPEAKER_01:Well with PaaS I mean like the managed services that provide so PaaS could be like container apps or container inst services.
SPEAKER_00:No, even before that, even before that, so I'm talking um about the the open AIs, the anthropics, um, Microsoft program, where you you just want to consume AI models and you you don't want to think about GPUs, you don't want you you just want um uh a very quick method of consuming large language models. So ideally with those platforms, you would go there, um, you'd get a big catalog of uh frontier and open source models. You would select the models that you need, and then you get an endpoint and an API key, and you just put this in your code, and now you have an LLM uh integrated in into your applications. Yeah, so this is what most customers um did once the AI boom has has started. Um this approach has um a lot of benefits and it has some challenges. So obviously the benefits is convenience. As I showed you, it's just get an API, select the model, get an endpoint, and then you have immediate access to your LLM. Um very um low touch in terms of operation. So no GPUs, you don't see GPUs, you don't deploy GPUs, um, you just get uh an endpoint. Um it's easy to scale, of course, because if you find that you're getting you're hitting the limits, you can just go and move a slider and say, I want more tokens, um, more limits, etc. Of course, if your subscription has has those. Um you get built-in monitoring, you get built-in evaluation, safety controls and responsible AI, all of those are built-in that you can use. Um, and of course, um you get a lot of um uh open source models and frontier models. So at Ignite, for example, we announce Anthropic um models are available in Foundry. So you're always getting um uh options. And lastly, of course, security. Um on Azure, for example, you can configure like a completely isolated um instance. Uh you can do private endpoints, you can um you can do um isolation, you can uh using APIM and etc. So it has lots of benefits and it makes sense for a lot of a lot of customers and a lot of scenarios. Um, but of course, there are some challenges with with that approach. Um so if you're um in some situations where you have like consistent high volume usage for your LLM, the the pair token pricing can sometimes be um very, very expensive. And yeah, it you have to evaluate it.
SPEAKER_01:If your like infrastructure is based on Kubernetes and then you have lots of logs and then you want to charge use LLM to troubleshoot something like a part one restart, it will cost big a lot because you'll be looking at the token as well.
SPEAKER_00:Exactly, exactly. And those are the kind of situations where this model might introduce um more cost, more um like high variable cost, um uh because we're looking at at tokens, as you said. Um flexibility as well. Um, so having a big catalog of models can be great, but sometimes it might not have the models that you want. So uh, for example, if you want um to bring your even your own model, you uh if if you want a specific open source model that is not available in the catalog. So um, of course, it it can be a limited scenario, but if you some some customers um train their own model, so um it would be challenging to use the the managed platforms. Um one of the big ones, of course, is data and compliance. So, as I was talking about the security piece, um there are um great security capabilities on the managed providers, but again, some customers have specific um regulatory standards, they they require, for example, full control of all of all over the tech stack going to the hardware. Um they have some privacy concerns, etc. So um sharing the data or sending the data to a managed provider or the LLM provider might not be um suitable to them. And of course, latency. So imagine scenario where you have um you have a factory or you have a retail store and you want to provide um AI intelligent application at the edge. So maybe this place does not have internet connectivity at all, or maybe it's um it's a very poor internet connectivity, so you're gonna have a huge latency. So having something at the edge would be also um important in that situation.
SPEAKER_01:Yeah, they do have something called foundry local, you can put your modules locally and then just use it. But I'm not that I'm not sure if it's like you can use it for production grade like infrastructure if it's if it's local.
SPEAKER_00:Yeah, foundry local is a great option, but uh if if you're talking about like an enterprise app with uh uh retrieval augmented uh generation, um tool calling agents, I think it's not there yet, but it has a lot of great use cases where you can even run it on uh I think that they said that Ignite you're gonna be able to run it uh on Android. So and you can it has an SDK, so you can develop applications that use Foundry Local. So it has its uh it has its scenarios and it's gonna evolve, but I think now for enterprise uh applications it it's not suited for that at the moment. Okay, so given those um benefits and challenges, um now there is uh a trend in terms of hosting um larger language models on Kubernetes. So let's let's first see why Kubernetes is something that um a lot of customers looked at to host large language models. So of course the first thing is scale and portability. So you can scale to millions of different pods um and you can do parallel processing. Um it's infrastructure infrastructure agnostic, so you can take workloads on Azure and on Azure Kubernetes service, and then you can deploy them on another cloud Kubernetes service or even on-prem Kubernetes service. Um it allows you to um have proper management of GPUs, uh nodes that have GPUs, so automated installation of the drivers, um, management of the different NVIDIA software that runs. Um you can block the GPU nodes to only be um visible to the scheduler for LLM applications. Um you have observability, you have centralized control. So Kubernetes um has a lot of benefits to make it um a platform to a very performant platform to host uh large language models.
SPEAKER_02:Yeah.
SPEAKER_00:So given that, let's let's see when is self-hosting a large language model would be a good option. So obviously, we looked at the benefits and the challenges um of using managed services. Now let's look at the benefits and challenges of self-hosting on Kubernetes. So um ideally cost efficiency for high volume workloads. So um if you're running lots of inference, like lots of chatbots, lots of agents, lots of batch jobs, you have very consistent high volume workload. Um if you do the calculation and see the cost of the um GPUs that you own. Um if you, for example, do reserved uh instances or using spot instances on Azure, um, and compare this with the pair token pricing. Um, sometimes you would find for those high volume workloads, hosting your own uh LLM would be more economical. But again, you need to do the math. Um, flexibility and choice, you you can deploy whatever open source model that you have, whether it's built in-house or on hugging face, or whenever it's hosted, uh you can simply host it there. Um in terms of privacy and compliance, if you have, as we said, strict regulatory or um sovereignty requirements uh and full control, then this would be a great option. Uh latency, as we said, um rate limits. Um I'm sure once Chat GPT was announced on Azure, everyone was very frustrated with the rate limits that you get. So um, if if this is not working for your application, um you would host it on your Kubernetes service and you can define your own rate limits for your consumers and for users. But again, you own the hardware, so it gives you more uh more room for uh for usage.
SPEAKER_01:But you still have token limit, it's very similar.
SPEAKER_00:Yeah, and you can you you can configure them. So when we look at the inference engine that we're gonna look at today, you can configure that, and you can even configure API keys for your open source um model. So it's not open for everyone. You you still can have some controls and some um uh like token limiting, rate limiting, you can apply that. But again, you're in control and you specify what you need. Um, and of course, Edge AI deployments. If, like we said, if you're doing in a factory or in a remote location, then this would be a great option. Um so we we talk when we talked about hosting on Kubernetes, I'm mainly talking about um open source models. And um as you know, in the past months, open source models have uh have lots of have improved a lot, and they had lots of innovation like DeepSeek, it came and and and caused uh a big bang. Um uh the open source model from uh Chat GPT, from OpenAI, uh GPT OSS, and and others. So in most um in most scenarios, you might find that the open source model would be great. Uh obviously there are scenarios where you would need this the the powerful frontier models, but the open source model are really catching up. So if you're do you're hosting your own LLM, you're not falling behind, but you're actually getting more uh more benefits in terms of um saving on inference cost. Um you as we as we were talking, you get more customization, um you get this portability. So the open source model you can take it, deploy it on-prem on a Kubernetes cluster, and and you can have it the same on AKS, on Azure, and other clouds. So it gives you um like a multi-cloud, and also some of them are multimodal, and data privacy and governance. So the open source model um really uh fit into the scenario of uh self-hosting um large language models.
SPEAKER_01:Yeah, because I'm at the moment currently I try to use the cloud model on like Co-Palette, and it's quite amazing compared to GPT 4 and 5, like Cloud Sonic and those things.
SPEAKER_00:Yes, and definitely try Opus 4.5 from Android.
SPEAKER_01:Yeah, that's the one I'm using. Yeah, it's very specific.
SPEAKER_00:Yeah, yes, it's amazing. Uh and uh as you know, it the the models are gonna only get better, and um so uh having using open source models for the right scenarios can give you that speed and flexibility where you can innovate faster. If a new model comes, you can test it uh very fast on your cluster and you can deploy it if you have a uh a pipeline, and so this gives you a lot of options if if the scenario is is right.
SPEAKER_01:Yeah, but I guess the only
SPEAKER_00:concern is like when a new mod has been released you also have to test it first this speaks before it goes to like production clusters as as well definitely yeah and and that's why lots of customers now are building the their ai or their llm uh pipeline so you see terms like llm ops you see terms like ml ops and um like cicd and devops in in in the cloud where um if you introduce and change you have to test them uh do unit testing and deploy in in in in a dev environment etc in llm word it's somehow similar but of course in terms of unit testing you need to do evaluations and you need to have human in the loop and you need to do a lot of stuff so it's it's definitely um a different uh skill but it's um the same concept is is is there um okay so give given what we said um it seems like self-hosting is is is really great so um before we get in very excited there are some challenges with hosting llms on Kubernetes and some considerations that you need to think about before um thinking about that so um first of course gpus so um as you know gps are very expensive um once you have a gpu you need to install the right drivers um plugins operators it has some software for CUDA if you're uh if you're familiar with it um managing this and and maintaining the lifecycle of and the versions of this software is is a very big challenge it's not straightforward um another big challenge is efficient utilization of the GPUs as they are very very expensive you don't want them sitting in your data center with like 5% utilization that's a a very bad waste of money so how to efficiently use the GPUs and make sure that they're always working they're always um uh handling inference requests or training requests is it's a very important um thing to think about um how to schedule and scale the AI workloads and how how those workloads can um uh use the GPU uh there are some techniques where you can slice a physical GPU into logical uh GPUs uh isolated GPUs so there are a lot of techniques that you need to think about and and um and and depending on the models of the GPUs and the capacity you have you have to think about those models like time slicing and multi-instance GPU profiles etc. I still think uh getting someone like a Microsoft or use AKS or like EKS for example AWS one uh it's a lot cheaper than sample isn't it in terms of cost like all those like recommendation you have to consider it as well like the GPU yes indeed and and that's that's like a big turnoff when when uh when you have to think about all of those stuff but the the good thing is in for example I'm I'm gonna talk about Azure in AKS it's um it's you don't have to worry about most of those so in Azure um it's very easy to have Azure all of the bootstrapping of the software for in for the NVIDIA um GPUs and manage the whole life cycle it's very easy to do the uh like the slicing that I mentioned of the GPUs and and do the proper scheduling. So there are features I'm gonna talk about um that would make this easier. So if you're thinking about hosting on your on-prem Kubernetes cluster versus like AKS AKS is gonna make your life much much much easier um to abstract all of this complexity. Another big challenge of course is as you know the models are not um small files so if you go to Hugging Face uh if you try to download if you use even foundry local or olama and you try to download the model locally minimum it's gonna be like two gigs the like the minimum the smallest three gig or a few megabytes yeah quite a lot.
SPEAKER_01:Yeah that's why it takes a lot to download take a few minutes.
SPEAKER_00:Exactly exactly it and and the the the the the very powerful open source one it can be 10 gigs or or even more so how do on your cluster how do you download how do you store them how do you manage the life cycle again this is another challenge that that we're gonna talk about. Selecting the right inference engine so the inference uh so when working with large language models there is usually two types of activities there is the training where you're still building a an LLM and there is the inferencing where you're actually sending requests or sending messages to the LLM so that it can generate the most probable token and give you back the magic that it does. So in the world of Kubernetes the there has been an explosion in terms of the inference engines available today we're gonna talk about just one called VLLM uh there are much um other um inference engines out there but today just just VLLM um and those are the engines that you that that you can tweak and you can um properly man uh efficiently use the GPU using by tweaking the options in in those engines and I'm gonna show I think a very simple example on how to do that. Observability of course is important if if if you do if you deploy on Microsoft Foundry you already have observability in using App Insights you you would see the like the number of tokens the the time to first token you you see a lot of metrics out there with with just clicking a couple of buttons. In Kubernetes also you need to think about how you're gonna actually do that how to how you're gonna monitor the performance of your GPU as is as hardware and how you're gonna look at the performance of the inference engine um how much time does it take to prove to generate the first token and how many requests do you get etc etc so again this is something that you need to think about as well. Finally the AI gateway piece so the what I mean by AI gateway is um the first layer where your users are going to send the requests um the AAI gateway usually implements something like rate limiting um counting the token maybe blocking a user if they surpass the number of tokens similarly like when you use Microsoft Foundry and you get this four 429 um uh response so which type of implementation do you get I think at the moment which they try to lean towards APIM correct uh minus the APIs for AI correct exactly and and and at Ignite we announced that actually uh a pim ap is gonna be integrated into Microsoft Foundry so you can even deploy it from there to make things easier so how to find a similar open source solution that you can deploy on Kubernetes cluster and how to configure it etc this is also another challenge that you need to think about. But again if you're using AKS you can of course use apabilities that you need. Okay so let's look at this is a diagram I've seen that really summarizes how um how a workload of um connecting to an LLM hosted on Kubernetes would would look like so um you would all you would for example uh find that this ingress controller and this is usually ap or a similar open source alternative um and then on your worker nodes you have an engine like VLLM uh TGI I think is Nvidia's um and you would need uh like storage for the models that you're gonna download um you're gonna need observability usually using Prometheus on Grafana um and if you have uh if you're doing RAG um you're gonna have to have a vector database deployed etc etc so as you can see it's not a very um simple architecture uh as I mentioned ATS makes things very easier but if you're doing like a vanilla Kubernetes you have to think about all of this together and how to stitch things together together um using open source software so now let's transition to Azure and see what we're gonna do today so today in the um the next steps and the demo I'm gonna show in high level how to use AKS and why to use AKS to host um an LLM so in Azure there are mainly um three methods to host or to consume a large language model uh of course foundry we're not gonna talk about it today um and there is Azure Kubernetes service and Azure container apps we're not gonna talk about container apps but um both of those um uh are platforms where you can host an LLM uh both of them have their scenarios uh but for for today let's just focus on um AKS okay yeah because if you want to like pull a container out contain module from like storage you want to catch it somewhere right otherwise every time you'd be pulling it it would still take storage. Exactly yeah and and container apps really has its scenarios for so for example it's a it's if it's not a very complex LLM workload um you actually get the benefits of scaling to zero um you also you you get build by the by by the second where you're actually using the GPU so you get like serverless GPU so it has some some very good benefits but again it depends on your scenario and what you're trying to build.
unknown:Okay.
SPEAKER_00:So let's talk about AKS and what are the main AI capabilities that AKS provides so I talked about the the complexity of managing GPUs. So with AKS um there is a feature I think it's still in preview called managed GPU pools where you can configure your note pool as a managed GPU pool and once you do that AKS is gonna install a provision VM with GPU deploy the driver deploy the GPU operator the the plugins absolutely everything and and manage its lifecycle so you don't need to worry about that so it's a very very cool capability um you also have um simplified GPU operations by GPU operations I mean um efficient utilization of the GPU so I mentioned one of the most famous techniques called multi-instance GPU and this allows you to take one physical GPU and divide it into multiple logical GPUs so as if you have um like four GPUs but you only have one and those instances are really isolated. So each instance would have its own uh virtual memory its own caching everything so they're they're they're they're very they're isolated so give this gives you um uh like uh the advantage of not having like a noisy neighbor if one app is very uh chatty with the LLM the other one is not um you you don't come into those situations where one consumes the whole GPU so AKS makes the the those kinds of operations really simple. Okay so they will be balancing the GPU workload like if you pull the like you pull a module to use one of the yes okay you you you would say for example you you have one NVIDIA card in your cluster you can say I'm gonna divide it into four and each application is gonna just use one of those logical GPUs so it has its own memory there is no uh communication between the two uh logical instances so it it gives you uh peace of mind um uh advanced scheduling and placement of course also um uh AKS provides um AI aware scheduling uh of the workloads it also um has a capability where you can uh auto provision GPU nodes so for example if you're you you're deploying a very um resource intensive uh AI workload um AKS can actually go and provision a new node or multiple nodes with GPU provision the driver everything so that your um your workload can keep scheduling and getting the resources it's it needs so that's also a very cool capability um and of course inference traffic management uh uh like how to um do rate limiting token uh routing etc depending on the usage so there are lots of implementations that AKS supports so that allows you to do that kind of smart routing between your different different nodes um before we dive into the demo there is one thing that also I want to talk about that I'm gonna use in my demo which is um the Kubernetes AI tool chain operator short for kaito this is an um an open source project it's not specifically for AKS it's it's for Kubernetes but there is very uh tight integration with AKS um AKS makes it very easy to deploy what kaito does is really automate the work the AI workflow at uh all of it like um downloading the models uh managing its lifecycle um deploying the models with the inference engine and helping you with monitoring um even the added uh rag recently um fine tuning etc so it it help it really helps you to streamline the software piece of the LLM so we talked about the hardware um with AKS uh Kaitu uh allows you to manage the models in a very easy way it even allows you to do this um uh GPU nodes auto provisioning and and I'm gonna show you an example of on on how this would would look like um but to to give you um some of the highlight feature is they treat the models as containers so remember when we were talking about the big files that you have in in Hugging Face if you go to like Lama you're gonna find like 50 files or something. So Kaitu allows you to containerize those models uh in container images so it's very easy to manage their lifecycle instead of just plain files they're they're converted into container images it also has like built-in templates called presets uh for example it has a preset for uh uh deep seq this allows you to very easily deploy deep seq and it it configures deep seq using the best configuration um based on your cluster so you don't have to like uh tweak the GPU utilization or the swap memory or whatever it does it does this for you for a specific set of models um and also it it uh as I mentioned it uses VLLM but you can switch to a different inference engine like the hugging face um inference engine so again if you want to get deeper into configuring the inference engine you can switch between uh both of them uh we talked about auto provisioning so it it can auto provision gpu nodes and I'm gonna show GPU nodes is the infront engine engine is that the like the module like llama hug and face so so the inference engine is the one that allows you to get requests and get responses back so the the kai to containerizes the model and then it serves the model using the inference engine and um in the demo I'll show you the uh I'll show you how this looks like in in the actual pod so that you understand how it works. Okay so um the final thing before we go into the demo is um the the the cool thing about Kaito is that it makes the LLM a cloud native application. So it treats Kubernetes not just as a compute or GPU provider it makes your large language model a cloud native app. So this is for example how I would deploy this Falcon um 7 billion parameter model. So you see it's it's a it's a it's a YAML manifest I just specified the model what type of instance what I want it to be deployed and that's it. Once you apply this manifest kai2 is gonna go download the model um containerize it provision the GPU have the drivers and everything deployed and then your model is ready to be served so this is how easy it is to um to use it that's why I wanted to use Kaitu. Of course we we can just deploy VLLM and do everything ourselves but I wanted to show the easy path for folks who want to get started. Okay um I think enough slides let's jump to the demo and see how all of this would look like um so let me but you did you say kai2 is integrated with AKS or you just fully it's about to get integrated it it is it is fully integrated with AKS and I'm gonna um as you're deploying a new cluster you can actually say uh deploy Kaitu with it so it's already in Azure CLI and PowerShell you can deploy it okay so um here let me just zoom in um uh you may need to share your other screen if you're on other screen uh you stop sharing uh sorry for that let me uh share my screen again okay I hope it is uh yeah okay okay perfect so um as you can see here I have a couple an AKS cluster I have um one GPU node this is the the the the node that we're gonna uh Everything there. Um and if I describe this node, you're gonna find that I applied a specific label for that node so that once um if I go a little bit up, I added this label to this node so that yeah, yeah, so that what when I do the kai2 deployment and when I select a model, I'm gonna say target any node that has that label so that I'm I'm very specific. Um the first thing that we talked about is the GPU software. So um if I go to look at what what I got deployed into my AKS cluster in terms of the GPU, you're gonna find that AKS helps me to deploy a couple of pods that are related to um NVIDIA, specifically NVIDIA. So uh I have for example the feature discovery uh pod. This allows um the the GPU to be advertised to my nodes so that uh uh sorry to my pods. So when I deploy a pod that needs a GPU, it can discover the those nodes and deploy and schedule um to those um to those nodes.
SPEAKER_01:Can you be specific, like which node to like?
SPEAKER_00:Yeah, you you can be specific. You can say um any GPU node, or you can say uh with Kaitu, you can say my preferred nodes are this and this and this. So it can you can be very, very specific. Um, and you also have a bunch of other software like uh the CUDA uh software. Um, you have something called the device plugin, and you have something called DCGM exporter. This is um this software exports the metrics of the GPUs, and we're gonna see a dashboard that shows the the performance of our GPUs, like the temperature, the memory, the the CPU, etc. Uh sorry, the GPU. Um so all of those are AKS help me to deploy. So it makes, as I said, makes things very, very easy to deploy. So um before we deploy um a model, let me um show you, for example, you get those pods from installing a GPU operator, right? Correct, yes. Okay, and AKS can automate the GPU operator. So you can either install it manually or it uh if you use the manage note pools, it's gonna install it for you.
SPEAKER_03:Okay, okay.
SPEAKER_00:Uh so this is um this is the IAML file for Kaito, where this is I I I want to deploy the Phi for mini model, uh the Microsoft model. Um, and as you can see here, I specifically said this is my preferred node. So I want it to be deployed on this specific node that I want. So this is a preset. So this is um when I deploy this, Kaitu is gonna find the best configuration for my engine to serve this model. So I don't have to be an expert, I just use this and it's gonna do the best configuration um for this specific model. In most situations, you would need to actually configure more stuff. So for this is another example um where I'm gonna deploy the GPT OSS model, and as you can see, the YAML is much bigger, but as you can see here, I'm doing more stuff here. So um I'm really configuring the engine. So for example, I'm setting the swap space to four, I'm setting the maximum GPU memory utilization to be uh 0.85, I'm specifying the port where I want to serve the model, um uh how many instances of the GPU, how many memory should this uh use? So as you can see, I can be very specific and I can customize my model deployment like I want. But again, I I did not want to go much into um deeper into this, so I'm gonna just do the five the phi model to make things easier. Okay. Okay, okay. Uh let's get back to our cluster. Um, so let me show the the kai2 pods that I have. So if I if I go here, um kai2 has a concept of workspaces, so it creates a workspace for my model and then deploys my model there with the taking the workspace would be like storage to store the module. Yeah, you can think about it like that. It's it it stores the module, it it has the inferencing engine, which is the LLM in that case. So um if I do this, for example, kubectl get workspace, another typo another typo I cannot type today. So I just have one workspace which has the phi for mini, but you can deploy multiple uh multiple models. Um so let's see what's in in in the model. So you asked about how um how this would look like. Um so if I look at the logs of this workspace, um you're gonna find um the API that serves the model using VLLM. So you can see, for example, all of the endpoints that are exposed by um by VLLM. So you can, if you're familiar with the open A open um open AI API, you can see, for example, the chat completions endpoint, the embeddings endpoint, uh tokenize to tokenize my prompt into tokens. So this is the server that um serves my uh large language model. Um and I'm gonna show you once I send it a um uh a prompt how things are gonna look like. So um let's first look at um so we have the model deployed. Let's try to um uh consume that model. So what I will do is that I'm gonna deploy a simple um chat application. Um, and first I need to get the um the endpoint for my um workspace or my model, and this is the IP address. So, what I will do is that I'm gonna deploy a very simple container that has oh sorry, um, and as you can see here, I'm providing the IP and the port of my Fi for model. And what this does is um this is um uh an open source um chat application uh that allows you to chat with the model, any model that you you hosted locally or or in the cloud.
SPEAKER_01:But then that's pulling the module running it from in a Docker, right? So you need on my machine, yeah. Uh okay. Doing that on my machine for I guess you can do it through uh ACR or something, or yeah, yeah.
SPEAKER_00:Yeah, yeah, yeah. You can you can do it on ACR. Um it's it's an it's a Docker image, so you can deploy it wherever you want. Um and as you can see, immediately it detects that I uh this endpoint has this model which is 5.4. Remember, I just put my put the IP address in the Docker container, and um I can say why host llms on kubermetis. And before I send that, let me actually show you how the endpoint would look like. So if we do that, and if I send this here, and if you look on the right, you're gonna see that there is um there is one request that got sent to the LLM, uh, to the engine, and this is the um the GPU cache and the CPU cache. So definitely the request has been sent to my um my LLM that is hosted um hosted on the Kubernetes cluster. So this is just from the interface, of course. Um you're gonna use you're gonna like have an application um uh to consume that large language model. So I'll also show you um how to do this, for example, using Python. So um let me just maybe zoom in a little bit and get the IP of my model. So this is a very, very simple um Python application. Um, and again, if you're familiar with the OpenAI endpoint uh SDK, you would see that it's exactly the OpenAI SDK. So I'm using the OpenAI um library uh package in Python. So this is one of the features of VLLM that it has an OpenAI compatible API. So if, for example, if you you're now all of your applications are using OpenAI directly, you can just switch the endpoint and you don't have to change any of your code.
SPEAKER_01:So this is another uh so the endpoint will be the the the front end IP, right? The external IP address.
SPEAKER_00:Correct. And that's what I've did. I just put the um uh the IP here. Uh and as you can see, the API key is empty because I it's I hosted this and I've not configured an API key, but I can, but uh here I I don't have it. And let's run this.
SPEAKER_01:I guess we could just use that to get all the logs right, like get all the logs from and get all the pods from like a k spam cluster and stuff. We're good for troubleshooting some of the issues.
SPEAKER_00:Yeah, exactly. It it's it's um can like I like I showed the the VLLM uh logs, you it's it's at the end of the day, it's a pod. So you can uh uh you can go into the pod, you can see its logs. Um if you have container insights enabled, you can it's at the end of the day, uh as I showed in the slide, it makes the LLM uh a cloud native application.
SPEAKER_01:So you can so how do we how how do you track the token usage then from it? In yeah, exactly. I saw like one of the pods some of the pods have tokenizer there already.
SPEAKER_00:Yes, I'll I'll show you that in just in just a second. I'll show you how to do that. Um so uh we got a response. Um the other thing I wanted to show is um if you're gonna do agents, um you can um do tool calling. So tool calling is a capability that allows your AI, your AI model to call a function. Um obviously, MCP is now um the big thing. So you can do tool calling and you can um you can call an MCP server where whether it's hosted on the your Kubernetes cluster or even on a remote uh MCP server. And in this demo, what I'm doing is just I'm gonna show that it actually um I'm gonna ask it to get the weather for a specific location, and it's gonna show me that it call it it will call this get weather function. Uh, of course, I've I'm not executing it, but I'm just demonstrating that um even with self-hosted, you can uh do tool calling. So here you can see that it actually successfully understood that it needs to call this, and those are the parameters that it passed. Because of course, I've asked it about the weather in Cairo. So it says location Cairo and the unit is Celsius.
SPEAKER_01:So it will be good if you create like a self-hosted MCP server and just call that server because you want to you don't want to make external calls because you absolutely security purposes, yeah.
SPEAKER_00:Absolutely, and and now you in Azure you can you can host your MCP server on Kubernetes, you can use Azure functions, you can even now use uh I think logic apps, even if you have APIs behind APIM, you can expose them as MCP. So you have a lot of a lot of options to do that. Okay, so um, and you've asked me about um tokens and uh how thing quite a lot of people would probably ask about that.
SPEAKER_01:So they want to see how much would cost how you try to usage.
SPEAKER_00:Sure. So um, as part of my deployment, I I deployed Grafana and Prometheus. Prometheus uh um allows you to um collect the logs and all of the insights from uh Kaitu and from VLLM specifically, and from the GPU. Remember the pod DCGM exporter. Um so this allows you to visualize all of this in Grafana. So um I've already imported a couple of dashboards. The first one um looks at the GPU performance. So I can see the temperature of my GPU, uh, I can see the power usage, I can see the utilization of my GPU, um, and a bunch of other uh metrics. Of course, this is just one, there are much more um uh dashboards that are available in the Grafana uh catalog, and you can even build yours. You just need to understand what metrics are exposed by the GPU, and then you can visualize them, visualize them here. Um in terms of VLLM, in terms of um the model inferencing, I've also deployed another um uh open source um dashboard that allows you also to track the performance of your inference engine. So, here for example, I can see the token throughput, which is basically tokens per second. Um, I can see uh like the latency of my prompts, um, the time it takes to generate um uh a token. Um, I can see like the time to first token. So, usually the time to first token takes most of the time, and then all of the uh uh the next tokens are much faster. So, this is a very important metric to measure. Um you can see, for example, the finish reason for your LLM. Some LLMs would stop because you specify, um I think because you specify like a max token, some of them stop because they think that this is the right answer, or the user stopped um the conversation, etc. So this is just one dashboard. Again, there are much more. Um, and you can select the different models that you deploy. So, again, observability is an is a very crucial thing just to make sure that you're getting the value out of the investment that you've done into the GPU. So having the right dashboards and collecting the right metrics is is is very important here. Um, so yeah, that's um I think that's mainly it. I I did not want to go much deeper. Um I just wanted to give um a high-level idea on the concept of hosting LLMs and how you can um do like a quick POC and understand the different moving parts from uh preparing your cluster, um configuring um kai tool, downloading the models, um observability, etc. etc. So this is how how it would look like in a very, very high level uh overview.
SPEAKER_01:Yeah, that's that's quite amazing. I never we've thought that you can do it with Kaito. So that's good idea. It's good to know about that. I guess I guess you as you there is other is there other like ways to use it, or you can only do it through Kaito. No, definitely there are hard ways.
SPEAKER_00:There are um actually I think I might have um yeah. Uh so can do it without Kaito. Basically, what you need to do is to deploy VLLM. So here, for example, I have a manifest that would deploy uh VLLM, and you can see very similar um YAML. Uh the only difference is that I'm I'm I'm deploying VLLM natively, I'm not using Kaito. So Kaitu, Kaito abstracts that complexity for me. I don't have to worry about VLLM, it does this for me. But if I want to do it like um without any um abstraction layer, I can just deploy this, it's gonna deploy VLLM, it's gonna in that in that uh scenario. I I need to download it, for example, from Hugging Face. So I need to provide the API key, it's gonna download the model, I need to um specify a volume so that um I have persistent, the model is persistent, etc. So there is much to do, but you can definitely do that.
SPEAKER_01:Yeah, I guess it's using Hytu is the most simplified way because this is when you do the this deployment, it's most like it takes more resources to do it.
SPEAKER_00:You'll have to tweak it, so um, you'll have to specify the type of resources it it um it will consume. Um so kai to makes it very seamless. As you've seen, it's just a very it's just this if I if I want to to to do phi uh so it's much much easier, but again, um you can always if you need more conf more customization, you still can use kai to uh but without the presets, you'll have to provide the the tweaking that you need. So both options are available.
SPEAKER_01:Brilliant. Yeah, yeah. Uh I will have a look at that. It looks amazing. Is there anything I take it's you can what are some of the use cases that you would need to use it for, like in AKS, like AI with AKS? Is it just
SPEAKER_00:would just be like should we shooting would be like looking at logs or you mean for kaito specifically yeah if you were to use kaito what are some practical use case we can use it for yeah so kaito really makes it very easy to um have the keyword here is having LLMs uh as a cloud native application so um as you were saying earlier you need to um uh create pipelines and um you need to make sure that you have a very solid um uh LLM ops or DevOps whatever you would call it uh pipeline available so Kaitu makes this easier because the model is containerized um it's described in YAML files everything is in YAML you can even have it to automatically provision nodes so it allows you to to have a cloud native implementation much much easier so everything can be on GitHub and you can use GitOps on Kubernetes so that everything is just a YAML file um and you don't have to like do things manually you can provision everything using the YAML files and everything gets deployed into your Kubernetes clusters. So it it makes it easier and think about it also when you have a fleet of AKS clusters or Kubernetes clusters um how hard would it be to like tweak VLLM uh individually on those clusters in Europe versus the clusters in the US there are different uh like uh GPU cards in in in in in them so having this it like an in in a YAML file um uh and abstracted in terms of the complexity can make your life much much easier at a larger scale yeah I like it it looks very awesome yeah uh Dave uh so has this episode is coming to an end uh is there any last minute word or you what you want to say to people about using Kaito LLM AI on LLM I I would definitely say um I think the um the Ignite sessions are still available to watch there are a couple of sessions that talk specifically about Kaitu and showing very cool demos and even customers who how they use kai2 in real in in real scenarios so I would say definitely go watch the session before they're removed um and it's very very easy to just test it so after you watch the demo go to the kai2 documentation spin up a Kubernetes cluster if you can find uh if you can uh provision one with the GPU and just test it and I'm sure you're gonna you're gonna be amazed on how easy it is to to do that.
SPEAKER_01:That's brilliant. So as always we are like we love to get to know our guests so apart from like learning about LLM on AI in LLM what do you normally do? Do you have any hobbies? What do you normally do in your spare time?
SPEAKER_00:Yeah I I um I'm usually I play a lot of sports mainly um mainly football uh uh or soccer depending on where where you're uh watching uh so I'm uh I'm a I'm a big soccer fan and I also play a lot so um uh I I usually work or playing playing soccer or doing any kind of like you join to like you join a football club you play it like after it yeah and then when I was young I I even I I played professionally for around eight years uh here in Egypt uh but then I uh then I I I stopped I I I started playing just casually uh okay uh that's that's okay so you can get you you can play online as well on soccer yeah yeah I I I prefer the the real thing but uh but yeah I I play FIFA and uh a lot uh yeah it's good yeah uh yeah so as always thanks for joining this episode F giving some time so hopefully everyone gets to learn more about how easily you can use to like spin up Kaito and to put an LLM on it and and track your your usage or anything that you want to do so you just want to deploy it the use case that adds value to your business uh yeah so as one other thing Steph are you going to any other events are you doing any other like uh football are you going to any tech events lately uh I hope I think in the in the in the in the next year maybe there are a couple of events um uh I think I I have submitted a couple of sessions on this specific topic so hopefully yeah uh if it gets accepted then I'll I'll go there and talk about uh hosting on AKS there but uh uh not before the not before January so hopefully yeah there's a few events on like Netherlands and Amsterdam which is quite close to Egypt so it's not that far yeah like a couple of hours yeah yeah which is quite simple definitely and and thanks for having me it has been a great discussion yeah okay thanks a lot bye thank you
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.
The Azure Podcast
Cynthia Kreng, Kendall Roden, Cale Teeter, Evan Basalik, Russell Young and Sujit D'Mello
The Azure Security Podcast
Michael Howard, Sarah Young, Gladys Rodriguez and Mark Simos