Microsoft Community Insights Podcast
Welcome to the Microsoft Community Insights Podcast, where we explore the world of Microsoft Technologies. Interview experts in the field to share insights, stories, and experiences in the cloud.
if you would like to watch the video version you can watch it on YouTube below
https://youtube.com/playlist?list=PLHohm6w4Gzi6KH8FqhIaUN-dbqAPT2wCX&si=BFaJa4LuAsPa2bfH
Hope you enjoy it
Microsoft Community Insights Podcast
Episode 50 - Adopting AI Product Development with Kate Catlin
Choosing the right AI model shouldn’t feel like roulette. We sit down with Kate Catlin, the product manager from GitHub. Kate shares how model strengths and explains how auto mode aims to pick the right model for the job so developers can focus on outcomes.
We dig into practical tactics that cut through hype: start with a golden dataset, run evaluations early, and keep refining with real user prompts once your AI is live. If you’re overwhelmed by weekly model releases, you’re not alone—Kate outlines how to compare new options with scoring and selective manual review.
She also tackles the enterprise challenge: slow model approvals that leave teams on outdated systems. With a disciplined eval pipeline, organisations can safely adopt newer, faster, and often cheaper variants that deliver better results.
Hiya everyone, welcome to Microsoft Community Insights Podcast where we share insights to community experts to stay up to date in Microsoft. In this podcast episode, we will dive into adopting AI uh product development with guest Kate Kate Catelyn. Could you please introduce yourself?
SPEAKER_00:Yeah, thanks so much. My name's Kate Catelyn. I'm the product manager for Copilot Model Release. So what models uh get make it into your copilot model picker, which ones we deprecate, which ones we make your fallback model, and everything else in between.
unknown:Okay.
SPEAKER_03:So what's what's a model picker? Is it just choosing a different type of model in what copilots you can choose in? Like in Copilot, like arrangement, and then you just pick according to like the usage or exactly.
SPEAKER_00:Yeah. So if you're using Copilot in VS Code or Visual Studio or any other IDE, or even in uh GitHub.com in our chat feature, um, you're able to select which model you actually want to work with. Um, and the curation of those models um happens through my team. Uh so we make sure that the minute, literally the minute that the model is released from the provider, it's also available in your co-pilot model picker.
SPEAKER_03:Okay. So what's I'm curious to know what's your day-to-day job at GitHubbies? So what do you just comprise of doing inside of picking modules? Because it sounds like a cool job.
SPEAKER_00:It's so fun. Um the first word that comes to mind is chaos. Um, but and we just have the funnest job in the industry. So essentially we work really closely with all of the major model providers that uh copilot. So your, you know, OpenAIs and Anthropics and the Gemini team, et cetera. Um, and then we um work with them on which models are coming up. And once they know of a model coming up, they usually give us early access to that model for us to evaluate it and start to test it. Um, and we run like seven different evals on these models. So we're testing how it works in VS Code, we're testing how it works in Copilot Coding Agent, we're testing how it works in other places. Um, and we report all of those results back to the model provider in the hopes that they can actually tweak the model training so that it works even better in our copilot once it finally releases. And then we'll work with them as well to launch it uh on the day that it launches. And so, you know, again, the minute that that the major provider releases that model, you have access to it in Copilot. Um, and then we're looking at a lot of data around like who is using the models, how are they using the models, what are the results from these models? And based off of that, we'll choose which models to deprecate and take out of your model picker as well.
SPEAKER_03:Oh, okay. So you're you'll be doing like lots of like testing, like evaluations on the module to see if, and then provide a feedback to like uh like Antrophic, the model provider?
SPEAKER_00:Lots of evaluations, lots of talking to our revenue team about how enterprises are perceiving uh models, uh, and then lots of writing product docs about how we do things better. Like, how do we deprecate better such that our co-pilot users aren't caught by surprise when we take a model away? What changes can we make in the UI to make it more obvious that it's happening? What changes can we make in our docs, et cetera? So I, you know, I think on any given day I have about 12 different tasks going on, um about 40 different Slack channels that I'm jumping in and out of. Um it's it's quite the it's quite the spread of activity.
SPEAKER_03:Oh, okay. So if, for example, if a module provider gave you like one module, and then if it's only good for like maybe uh certain tasks, and then do you need to keep testing it to see what how accurate it is?
SPEAKER_00:Yeah, we'll we'll we'll start to work with them and start a conversation first to make sure that we're prompting correctly and that we've got it configured correctly for us. Um, but then if it's only good for certain tasks, we have a decision to make. Um and that decision is do we want to launch this in the model picker for everyone? Or do we want to say, like, hey, this model is really only good for this specific feature and product, and only launch it in that specific feature and product? Um, so a lot of my job also is um uh pinging all of the different feature teams and seeing how they feel about the model. Is this right for you? What are your emails saying about this specifically, etc.? Um, and then so on launch day, we know exactly where we should launch the model and should not.
SPEAKER_03:Okay. I know that currently there's lots of module on GitHub Copilot now. So what are some of your what's it one of your favorite ones? Because for me, my favorite one is I think the OPS one, 7.4.15, because sometimes it's like it does a lot of thinking, it's quite fast to answer it. So what's some of your best module currently?
SPEAKER_00:Those models are really smart. Um I would say I know this is a cop-out answer, but I work so closely with all of these model providers that I have learned to appreciate all of the unique strengths and weaknesses of each of them. Oh, okay. So on any given day, I'm probably using a model from every major model provider, depending on the part of my day. Like there's a different model I would use for like a complete back end overhaul than there is for something like, hey, I just want a UI tweak on my little side project and a different model again for when I'm like, take these three 10 sources of information and make me a draft PowerPoint. Um, and so it's it's just so dependent on the use case. And I highly encourage everyone to experiment and play, like really embrace a mindset of play with these models so that they can discover as well what works in each uh specific instance.
SPEAKER_03:Yeah, because I can't remember if there's a like a specific, like a list of what module per what use case, because at the moment it's too much. You just have to test one, and then if you like it, test the rest.
SPEAKER_00:Yeah, and that's why we've actually launched something called auto mode in Copilot. Okay. Um that is basically going to choose the correct model for each use case for you. Um, we're working on it actively right now, uh, and we're trying to make it smarter and smarter so that um we really are fine-tuned on um exactly which model we choose for you.
SPEAKER_03:If we were to like implement AI into like into fully production, they will have that to put you have to like implement evaluation maybe in the workflows before it goes into production.
SPEAKER_00:That is a starting point, yeah. Before production for sure. But then even in production, I think the best teams who are building AI apps are doing it as like this full life cycle loop. So you launch the app based on the e bills that you already did, and then you look through your history of like, well, what um what did users actually enter in? And then you see the results that they get as well. Um, and you're combing through that data. I heard of a team doing it every Friday. They get together and spend two hours combing through every single input and output. And then from those real users' input and output, you're gonna pull out really good examples of where the AI crushed it and did a great job, um, and also where it totally failed. And you're gonna use both of those to feed them back into what you call your golden data set for evaluations. So that list of like 10 or 100 or a thousand user inputs that you started with for your evaluations, um, you're gonna add them to that list and make sure that there's an idealized output as well. Uh, and so then you're continuing to evaluate, you're continuing to test new models, you're continuing to test tweaks in the system prompt on and on and on until you have something really ideal. Um, and if you want to get super fancy, there are also teams that are adding like a thumbs up or a thumbs down uh into their uh production AI products. And so they're not even manually doing that. It's just like an automated feedback loop of like, oh, this answer was good, this answer was bad, and then you can tweak uh your user and system or your system prompts based off of that to continue making the product better in a totally automated way.
SPEAKER_03:Okay. You mean like having the emoji on PR or something? Or you have it like in a GitHub workflow?
SPEAKER_00:Oh, uh this is kind of separate from a GitHub workflow. This is more so stuff people are tying together to create AI products out in the wild. There are some really, really cool tools available through Microsoft Foundry to do some of this. Um and so I would highly encourage folks to check out the Microsoft Foundry tooling to assist with uh creating AI-powered apps. Okay.
SPEAKER_03:So, in your experience from working at GitHub, where do you see teams that mostly get stuck up, like in startup when implementing AI? Is it just the beginner phrase or is it just they have difficulty moving to different like production?
SPEAKER_00:Um again, like using using Copilot to write software or like to create uh AI production apps.
SPEAKER_03:Okay, let me rephrase it. So, where do you see teams like begin the AI journey? So whether it's like prototyping, whether it's just lots of evaluations, like you mentioned?
SPEAKER_00:So um if a team is building a journey, or if a team is just starting out with building their first ever AI-powered product, um I find a lot of people start with just being a bit lost. Um, there's a lot of overwhelm, there's a lot of like confusion, and there's a lot of like, well, where do I even get started? And like, what's a good model to use? And my friends said to use this open AI model, but but I don't know. Um, and what's a rag file, et cetera. Um, so I think that is like sort of the starting point. And then they typically throw something together because um, I mean, you can even ask co-pilot at this point to create you an AI-powered app and it'll just wire it all up for you. Um, all you have to do is put in your API key for your GitHub models API or your Microsoft Azure API or your OpenAI API, like whatever your API key happens to be. Um, and it can wire up that first draft. Um, and then once people have that first draft, they're like, I'm good, and they ship it out to the internet and they think it's fine. Um, and then over time they realize, like, oh my god, this AI isn't just a tool as part of my project. If I've deployed this AI to be something customer-facing, it is the product. It has to be good because it is the product. And that's when they come back and they start to say, like, wait, what was that eval's thing that someone told me about at one point? And that's when they'll start to explore evaluations tooling. But it's typically not until something goes wrong. So we're trying to do a lot of education and be like, what if you brought eval sooner into the process so that you were evaluating things before you shipped them uh to the internet? Uh, and I think that is hopefully where we're seeing the industry go at this at this point.
SPEAKER_03:Yeah. Because I think for me, I often see like people g getting like still in the PLC phrase and having difficulty like trusting AI in production environment. Whether it's like any like AI, whether we create an application with like Copal or A or Foundry, they're trying to like make sure it's hundred percent even like you're working the public sector of financial sector.
SPEAKER_01:Yeah.
SPEAKER_03:Because unless they know that it's hundred percent trustworthy, they'll like button. That's when you need a lot of like evaluation, security, boundary, yeah, guardrail.
SPEAKER_00:Yeah. I've also seen some bigger teams start with like only allowing the AI to have a certain format of output. Like, you know, maybe you have some kind of chat app that your sales team talks to when they're on a call and it like reminds them of what they said last time or something. And so the AI will only spit out information in a certain format, or you can only ask it certain questions or prompts. And that's another way to help people get more comfortable with like a starting point, because you know, there is risk with AI that it will go off the rails and return something that you weren't expecting. Um, and so you know, you can build in formats and and specific outputs to your AI outputs if you if you would like to, and if it makes the team feel more safe.
SPEAKER_03:Is there any what kind of like use cases do you often see when for companies they're now adopt adopting AI or anything really like Cold Pilot?
SPEAKER_00:Um I think the most important use case of any company adopting AI right now is using AI for actual code generation. So rather than you know, creating an AI-powered app, it's like how do we how do we power our developers to work faster with AI? And so and that is um blowing people away. Um I think that I have seen, you know, of course, here at GitHub, where we're obsessed with AI, everyone is using AI and everyone's using, you know, everyone's work lives are different than they were even two years ago, which is incredible. Um but even with my software developer friends across the industry, they're now saying, like, of course we're all using AI. Of course we're doing it. Um and so I think that that is the biggest adoption and that a company can push is like helping their software developers um use AI better to build all of their products. And I can speak more about some of the hangups and problems with that if you're if you're interested.
SPEAKER_03:Yeah, uh so that would probably push into the next point, like the challenges, some of them like you see like companies going with AI, implementing AI.
SPEAKER_00:Yeah, you know, I think the biggest challenge with enterprises who want their software developers to use more AI really comes down to model access. Um, a lot of these large enterprises have very long review cycles to give their developers access to the newest models. And for these large institutions that are critically important to my day-to-day life, I absolutely do not want them getting hacked. And so I'm very glad that they have these um extensive, responsible AI review processes. Let's not take that away entirely, but we do need them to speed it up a little bit because the models that we have today are a different world than the models that we had six months ago, even. Even one month ago, they've improved. Like at this point, we're releasing a new model every single week. So, like, you know, if they keep getting better and better and you're still using a model from 2023, like, of course, you don't think it's going well. Um, and so they're not, they're not getting the results that they wanted to see from AI. And so I think that is like the biggest blocker that we have with these large enterprises is like we need to get you using a more modern model. That's a funny thing to say, fast modern model. Um, but we it's just a different world and it's a different level of results that they're gonna see.
SPEAKER_03:So you're saying that someone can't just keep using audio, the the audio module following because they won't know which one to use, so they will just have to keep using it because it's I know it's quite cheaper, like it's cheaper token to consume, like different like Sonics, uhposts and stuff.
SPEAKER_00:That's what it really depends on the model that you choose. Like a lot of the more modern models, they'll release the base model, right? Like your, you know, GPT 5.2 or whatever we're we're on now. Uh, and then they'll also release like a mini version of it or a nano version of it, um, or like a low-thinking version of it, or like depending on which model brand you're talking about, they have different words and different ways that you can uh change the outputs. Uh, but there are still a lot of models that are smarter than they used to be and are still lower token than the older models. So I just highly encourage folks to get on the most recent model they possibly can. And I know the pushback is gonna be like, but Kate, I already built a product with, you know, a model from two years ago. Don't make me change it, but that's why you need evaluations so that you can truly see that um changing the model won't break your product or will even improve the results uh from your product.
SPEAKER_03:But I would think the challenge is like keeping up to date on the latest model because every every time every week a new model will release, but you won't know which one to try, which one to use, because we don't know whether it's good for certain use cases and stuff.
SPEAKER_00:Yep, yep, that's the biggest problem. Um and I talked to a startup at one point that you know every single week was running every single new model through their evaluations pipeline and saying, like, here's what we're using right now, here's the new model outputs, and then they're like manually reviewing in their case which one is better. But you can also set up quantitative evaluations to just give you a score of like, oh, this model performed better objectively and quantitatively versus the model that I was using last week.
SPEAKER_01:Okay, so you just uh on this podcast.
SPEAKER_00:All I say is eval. All I say is evaluations.
SPEAKER_03:Yeah, because it it's it's important in a way to test it even for any AI development, you will still need a lot of evaluations.
SPEAKER_00:Yeah, exactly. Okay, so this I write your code, then trust uh GitHub Copilot and our auto mode. I think that's that's the way to go. If all you're using AI for is your code generation.
SPEAKER_03:Yeah, because I think it's similar to foundry model picker. It's it's very similar, the audio one, because it's good you use it per it generated per different, you just choose the right module for you, and then when a new module comes out, you can still interchange it, edit it.
SPEAKER_00:Exactly.
SPEAKER_03:Yeah. Okay, so how is there any advice for like new starters that want to like implement AR or Copilot or get started to the journey in when developing AI solution?
SPEAKER_00:Hmm. Um yeah, I would say um Foundry and Copilot are both really great places to start. Um Foundry has a whole AI agent builder now that it can help you out with that is so powerful. Um and it'll even write the prompt for you if you wanted to. You can enter in a sample prompt, and then there's like a special button that is like, make this prompt better. And you describe how you want it to be better, and it'll prove the prompt for you. So um look into the Foundry AI toolkit for generating your first ever AI agents. Uh, and then on the GitHub side, trust Copilot. Um, so if you are um developing software for Copilot or with Copilot, um, you're gonna want to like trust our auto mode, trust our default models, um, because we have done a ton of testing on which models work best and which models tend to lead to like highest rates of code actually being saved. Um, and so the more you can use our defaults and our suggestions, the better luck you're gonna have on your actual results.
SPEAKER_03:Yeah. Uh I know, I remember one time I was using the coding agent, but you can't really tell which model it's actually using behind this, like uh in the front. You can't see you can't see which model it's using.
SPEAKER_02:Oh, it's actually no, no, no, no.
SPEAKER_03:The GitHub one, you know, on a GitHub repo, you know you turn on coding agent and you get to review code, and once it does decode, you can't see which module is actually using it, whether it's a smart newest module or whether it's just 4.1 or so.
SPEAKER_00:Oh yeah, okay. So you're talking about the workflow where you like create an issue and you describe the whole issue. And then um you can assign, for those who don't know, you can assign an entire GitHub issue now to copilot to take care of from start to end in like a truly agentic development workflow. I love this feature. I use it all the time. I was just using it last weekend and sending feedback to the PM who owns it, um, who is absolutely brilliant. Um, and so I think um they're yeah, I think that, well, I happen to know that they've done really, really extensive testing on which model they use for that as well. And so trust, um, they're using the best.
SPEAKER_01:Yes, trust your process.
SPEAKER_00:It does do a fantastic job.
SPEAKER_03:But I guess if someone have lots of told like in Python or C to go through or get cobalt to do it, it may take a while, but we don't know whether it's actually doing a thinking module or we're doing it doing a different kind of model that's behind the scenes. So we just have to trust the process that's happening.
SPEAKER_00:I promise you, we we have entire teams devoted to uh just running evaluations on how the models work all day long.
SPEAKER_03:Okay. Uh so as this episode is coming to an end, uh, I would like to add to uh what do you normally do in your spare time, Kate?
SPEAKER_00:Oh, um at the moment I've been really having a lot of fun with like mini game development. So just making like silly little games. Um, and I had one go, well, it like trended on the front page of Hacker News, which is the most success I've had yet. Um I was really excited about that.
SPEAKER_03:And then last weekend I was just you've been playing with making games with GitHub SDK or with GitHub Copilot, yes.
SPEAKER_00:So um I am a code bootcamp grad, but these days I mostly just vibe code, um, and I flatter myself that I still understand. Every single line of code. But you know, I trust the process. And I do a ton of vibe coding. So I have a lot of fun with that.
SPEAKER_03:Oh, yeah, that's good. Because I remember, like, for example, vibe code, the terms vibe coding, it could be like it's like uh you create something, you implement AI with it. When I was having a discussion with someone, like, is vibe coding production for code? You wouldn't vibe coding production, but it's just I think I think there's different levels to it.
SPEAKER_00:Um the the engineers that I see um operating at the top levels, and there's so many engineers I love. There's so many engineers I love at every level, but the engineers that I see operating at the very, very top um now kind of have a workflow where they tend to have like multiple different Git branches going at the same time, and they're kind of like kicking the AI in each one of those branches towards different feature changes and tasks. Um, however, you are still responsible for being the tastemaker of your code, right? So you can call it vibe coding if like you just merge it without reviewing, but the best developers are not doing that. They're like letting the AI do its thing, and then they're combing through the code and being like, okay, I like this, I like this, I like this. This was excessive. Get rid of that. Uh, and then they'll merge it. So they're still tripling or quadrupling the amount of work that they're able to do at once, but they're still kind of more responsible for having that systems thinking mindset and making sure that all of the code is good before they're sending it in and merging it. So I think he's gotten a bit of a bad name because most people are not just merging without thinking about it.
SPEAKER_03:Yeah, because you still need to review, you still need a human in a loop that actually like an agent, a copilot, does everything straight forward, like just raise a PR to like merge a PR or everything to production. So that's why I kind of asked the question about that.
SPEAKER_00:Yeah, I'm really glad you followed up on that. Like, even when you're using the magical powers of copilot coding agent and having it tackle an entire issue from start to finish, like you still are responsible for you know bringing that uh the PR that it creates down locally and like deploying the app and testing it locally and making sure everything still works.
SPEAKER_03:Yeah, so you there will always be a human in a loop anyway. So no matter what kind of AR you're using, implementing, there would always be like whether it's reviewing a PR, whether it's like uh approving something, you'll still need that.
SPEAKER_00:I completely agree with that. You will always need a human um to review um all of this code.
SPEAKER_03:Okay, so thanks a lot for joining this episode, uh Kate. So one last other thing is are you going to any events?
SPEAKER_00:So still still determining my uh my event schedule for 2026. So no no exciting conferences to report yet, but hope to get out there and meet a lot of our users and get some more feedback as coming on.
SPEAKER_01:Bye.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.
The Azure Podcast
Cynthia Kreng, Kendall Roden, Cale Teeter, Evan Basalik, Russell Young and Sujit D'Mello
The Azure Security Podcast
Michael Howard, Sarah Young, Gladys Rodriguez and Mark Simos