AIP Podcast

The AIP Podcast, by AI Partnerships, a Railtown company, showcases the companies and leaders within the AI Partnerships network. Through conversations with founders, CEOs, and technology innovators, we explore real-world AI solutions, industry trends, implementation insights, and the business impact of artificial intelligence across industries.

All Episodes

AIP Podcast

AIP Podcast EP 65 - Prove ROI to Clients with RagMetrics' LLM Judge

April 08, 2025 • AI Partnerships Corp. • Episode 65

0:00 | 15:09

This episode's guest, Alon Bochman, CEO of RagMetrics, discusses how his decades spent working with AI teams in Azure and Google Ads brought him to start RagMetrics and his journey to help companies prove the ROI on LLMs. Today, RagMetrics helps LLM builders prove ROI and optimize performance through tailored, scalable, automated evaluations.

The AIP Podcast is hosted by the CEO and Founder of Supercharge Lab, Anne Cheng.

Full Video: https://youtu.be/oS7sbop7n84

Learn more by checking out the links below.

Follow AIP Affiliate, Ragmetrics
Website: https://www.ragmetrics.com/
LinkedIn: https://www.linkedin.com/company/ragmetrics/

Send us Feedback or Partnership Inquiries

Follow AI Partnerships
Website: https://www.aipartnershipscorp.com/
LinkedIn: https://www.linkedin.com/company/aipartnershipscorp/
X: https://twitter.com/AIPartnerships

The AIP Podcast is hosted by Anne Cheng, on behalf of the AI Partnerships, a Railtown company

SPEAKER_00 0:30

And we're not even keeping credit. And so the True and Google. Google DL and Zero and Google and Clinton Testing WhatsApp and something that was primarily. Testing on it in a priority simply. And then people we've come to the building of great metrics, but it's clean about it to solve the problem, helping more AI proofs of concepts of POTs make it straight into the world of production. In a world where the greatest of AI things to come line the roads of good intention. Alan, it is so good to have you on the show with us today.

SPEAKER_01 1:30

Thank you, Ann. It's my uh favorite thing to talk about.

SPEAKER_00 1:34

Let's dive right in, Alan. Tell us a little bit about your backstory and why you found the Right Metrics.

SPEAKER_01 1:40

Of course. So uh I I've been, as you mentioned, I've been doing AI for a long time, uh, 20 plus years. My last role was head of AI for a company called Factset. I joined them after uh after Google. And um Facet is a Fortune 500 company based in Connecticut. They compete with Bloomberg, they sell uh market data, financial data. And I had a pretty big department there. I had about 60 engineers, product managers, etc. We had 100 plus LLM use cases, which is a lot. Everybody was excited about AI. Everybody wanted a use case, and nobody was testing anything. Uh it was really frustrating, scary, upsetting. Um it's you know, think think of it like you know, if you're if you're a bridge building engineer and you have to build a bridge and you uh you can't run any trucks on it. You just have to see what happens when the first person steps on. It's pretty scary, right? So um that's kind of what it felt like. And uh we tried a lot of different things to get uh tests to happen. Um we tried to build a tool ourselves. We tried very hard to buy a tool. I even gave the team a blank check and a list of uh list of companies to shop from, and I realized that um, you know, it I wasn't alone. I called a few friends. I have I have some good friends that are you know doing pretty well in AI. And uh everybody kind of shrugged and said, uh, evals are hard. What are you gonna do? And so then uh that was uh that was like uh sort of you know waving a red flag in front of me. I'm like, I think I can I think I can do something. Um and that's what Regmetrics is. We're trying to sort of uh make it just a couple of orders of magnitude easier with uh tools that are ready out of the box.

SPEAKER_00 3:38

I'll admit in my past life, you know, so many of us just dive right into developing AI applications without testing them. And that probably leads to a whole bunch of hallucination. Tell us a little bit about how your solution works.

SPEAKER_01 3:56

Absolutely. You know, I think in the first generation of chat applications, AI applications, we're just so excited that AI can um can keep the grammar straight that we don't expect anything else. And a lot of the early applications um don't really ask much of AI. Like it's just, you know, maybe it's you know, write a poem with where every line starts with an A, or maybe it's count the R's in Strawberry. It's basically part of our tricks, right? And uh it's like watching uh, you know, watching a dog play piano. We're so excited that it can get a couple of notes out. But now it's time to actually enjoy some music. So, you know, when you want to do something serious, something in the enterprise, it's not enough. You know, it's it's not funny when when AI just uh mixes something up, sends the wrong message to a customer, uh, misinterprets a policy. And we've all heard these horror stories. Like uh there was this uh lawyer who brought a case that was completely hallucinated and tried to argue from that case. Uh that lawyer had a really bad day. Uh, you know, there was uh a Chevy dealer somewhere, I think, in the Midwest, that sold a Chevy Tahoe for a dollar. That's a $78,000 car. That's before the tariffs. That's just that's how much the car was. And it would get sold for a dollar because the chat bot got hacked basically by a user. The user, and then that deal got upheld by a court. So there's like a lot of ways they could really, like you said, graveyards. There's a there's a lot of careers and applications that uh can really end prematurely when they're not uh when when software's not tested. So the way that our software works is um it's a one-stop shop. Um, so there's like uh I don't think we have enough time to go through all the features, but I'll try to keep it at a very high level. You have one experience that you go through as you're developing, and another one when you're in production. When you're developing, you run experiments. Experiments are like A-B tests, they help you make decisions that are data-driven about every aspect of your code. And when you are in production, it's all about processing the incoming stream of conversations that the co-pilot is having with your users and evaluating them by both humans and by an LLM judge. And the humans and the LLM judge make each other better by reviewing each other's scores. Um that's that's the shortest way that I can explain it.

SPEAKER_00 6:36

That's amazing. Um, you just mentioned the judge. So you've also mentioned to me in our pre-show that Recmetrix has three solutions: the library synthetic data generator as well as the uh the judge. But I'm sure everybody's just dying to ask you about the judge. Tell us more about how the judge judges.

SPEAKER_01 6:59

Of course. Um so um everybody has this dream that we can offload our work to an automated little minion helper. And a lot of the work in testing AI applications is just reading what the model says and deciding if it's good or bad. And figuring out if it's good or bad is a very um, it's not personal if we're we're talking about it in the sense of the enterprise, but it's very uh different from one business to another. Like um if uh you know if we're dealing with a co-pilot that is uh a compliance co-pilot, then the with the judgments we want are very different than if it's a you know personal injury law firm, for example, right? They're they're they're gonna have, even though the domain might be very similar, they have very different views and very different best practices. A secret sauce is different. So to make a judge work, it's really important to make the judge an expert in the domain and an expert in your business, uh, so that the judgment is appropriate for you and your application. And that expertise is not available out of the box with um ChatGPT or with Gemini or with any of those models. It can only be acquired as the application and the judge interacts with your domain experts and your users. And so the judge learns just like we learn by you know from constant feedback and uh course corrections. And um so our goal is is to uh make sure that the performance of the judge improves over time uh based on that feedback. So in the beginning, it might be no better than an intern, and over time it becomes the the partner, the managing director.

SPEAKER_00 8:46

That's amazing. You've also mentioned that you've incorporated the use of agentic AI in the solution. You know, the term agentic AI has become a buzzword in the world of AI recently. Do you what do you think will be the next evolution of AI?

SPEAKER_01 9:02

Yeah, it's a really good question. Um I I'm so bullish and excited about AI. Obviously, I'm biased. I dedicated my my life to it, but um there have been so many discoveries made, in my opinion, in the last three years that even if we do nothing else but scale what we already know, we're going to reach um something that's massively consequential for our civilization, in the sense of like changing societies, changing the nature of work, changing economics for the average person, um, massively for the better, in my opinion. And that assumes no further breakthroughs in fundamental science, just taking what we know and doing more of it. More compute, more uh more inference time compute, um, basically, you know, more interactions, you know, more training, more reinforcement learning, all things that we know how to do, just doing more of them. So I'm I'm pretty excited about just that. But yeah beyond that, there there's um, you know, we're just at the tip of the at the tip of the iceberg. I think um there are, if you if you dig into the way these models are built, um it's incredibly inefficient, incredibly redundant. If you just look at the linear agile linear algebra that goes into training a model, think of millions of calculations that are each 99% identical with each other. And there's just like a little tiny difference, and you keep changing that difference, redoing so you're redoing calculation that has, let's say, you know, a thousand steps, and you change two of those steps and you redo the calculation over and over again millions of times. That's kind of the way uh the way a lot of the training works and a lot of the inference works. And when you look at it, you know, if you just step take take a couple steps back, there must be um ways to do that calculation that are millions of times more efficient, where you're just not redoing the same, almost the same calculation over and over again. It's just that we don't know how to do that. I mean, our brain knows how to do that, our brain works with uh what is it, like uh three watts of energy, and it can do all of these calculations and you know, the size of you know, the size of this. And we are currently trying to match some of that capacity with massive um you know uh data center sized computers. Uh but we know that it's possible to compress all of that and to use a small fraction of the energy and to actually get a lot more done. We know because we have the machinery in our heads to do it, we just don't understand it yet. I think we will. Um so I think that uh I'm I'm really excited about breakthroughs applying what we know to some new domains, AI creating science, AI solving medical problems, drug discovery. Um I'm really excited about um agents replacing labor uh because I think it would mean that I mean I think it would mean that things become massively cheaper and the average person can afford to live like a king because they could be all the things that we today can't afford because they're only done by very specialized people, would suddenly be available, plentiful, cheap, fast. Uh so yeah, I'm excited about uh those things for starters.

SPEAKER_00 12:37

That's amazing. Um one further road, um, but quickly running out of time. Talk us through your solution and the success stories that you've um you've you've enjoyed.

SPEAKER_01 12:50

Yeah, absolutely. Let me let me tell you one since we're we're short on time. Uh one of our earliest clients is a company that's building AI for accounting. It's a company called Telen, T-E-L-L-E-N, Telen AI. And uh when we met Telen, they had a couple of problems. Uh they've been working on their app for about six months or so. And um the first problem was that their uh first design partner didn't want to go live because they didn't trust that the app was actually good at accounting, at a particular specific accounting task that they were working on. And um the conversation went along the lines of, you know, Telen told them, you know, well, here's proof that it's good at accounting. Here's an example of an accounting question that it answers very well. And the customer said, Well, how do I know if that's the only one? Maybe you cherry-picked it. And so they kind of went back and forth like that. We created for them a benchmark that measures domain expertise in the specific accounting task that they were building, and we measured them versus all the three alternatives, and we showed with some quantitative rigor just how much better they were, and they were significantly better. Uh within a couple of weeks, their uh design partner went live, and within a couple of months, they were able to raise a three million dollar seed round. Um, so um numbers matter, uh investors, investors care about it, customers care about it, and if you know if you put in the extra effort, uh people will notice.

SPEAKER_00 14:26

That's amazing. Well, Ellen, that's been a blast, but of course, that's all the time we seem to have for our show today. Very keen to track your progress through your development and growth. Thank you for giving AI companies such an easy-to-use tool to test their applications, a much needed solution. Once again to our audience, this is Ann, the host of the AIP podcast, and my guest is Alan Bachman of Regmetrics. Thank you for tuning in. And to all our listeners, remember to like, share, and follow this episode with your friends and your peers.

Anne Cheng

Host

Dorian Marquez

Producer