close
close

How custom assessments deliver consistent results from LLM applications

How custom assessments deliver consistent results from LLM applications

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. More information


Advances in large language models (LLMs) have lowered the barriers to creating machine learning applications. With simple instructions and quick technical techniques, you can get an LLM to perform tasks that would otherwise require training custom machine learning models. This is especially useful for companies that don’t have their own machine learning talent and infrastructure, or for product managers and software engineers who want to create their own AI-powered products.

However, the advantages of user-friendly models are not without disadvantages. Without a systematic approach to tracking the performance of LLMs in their applications, enterprises can end up with mixed and unstable results.

Public benchmarks versus custom evaluations

The current popular way to evaluate LLMs is to measure their performance on common benchmarks such as MMLU, MATH and GPQA. AI labs often market the performance of their models based on these benchmarks online rankings rank models based on their evaluation scores. But while these assessments measure the general capabilities of models for tasks such as question answering and reasoning, most enterprise applications want to measure the performance of very specific tasks.

“Public evaluations are primarily a method for commodity model makers to market the relative merits of their models,” Ankur Goyal, co-founder and CEO of Braintrust, told VentureBeat. “But when an enterprise builds software with AI, all they care about is whether this AI system actually works or not. And there is actually nothing that you can transfer from a public benchmark to that benchmark.”

Instead of relying on public benchmarks, companies should do just that create custom assessments based on their own usage scenarios. Evaluations typically involve presenting the model with a series of carefully crafted inputs or tasks, and then measuring the outputs against predefined criteria or human-generated references. These assessments can cover different aspects, such as task-specific performance.

The most common way to create an evaluation is to capture real user data and format it into tests. Organizations can then use these assessments to backtest their application and the changes they make to it.

“With custom evaluations you do not test the model itself. You test your own code that perhaps uses the output of a model and further processes it,” says Goyal. “You’re testing their cues, which is probably the most common thing that people tweak and try to refine and improve. And you test the settings and the way you use the models together.”

How to create custom assessments

evaluation_framework
Image source: Brain trust

To make a good evaluation, every organization must invest in three key components. First, there is the data used to create the samples to test the application. The data can be handwritten samples created by company staff, synthetic data created using models or automation tools, or data collected from end users, such as chat logs and tickets.

“Handwritten samples and end-user data are dramatically better than synthetic data,” Goyal said. “But if you can come up with tricks to generate synthetic data, it can be effective.”

The second part is the task itself. Unlike the generic tasks that represent public benchmarks, the custom evaluations of business applications are part of a broader ecosystem of software components. A task can consist of several steps, each of which has its own rapid engineering and model selection techniques. Other non-LLM components may also be involved. For example, you can first classify an incoming request into one of several categories, then generate a response based on the category and content of the request, and finally make an API call to an external service to complete the request. It is important that the evaluation covers the entire framework.

“The most important thing is to structure your code so that you can call or call your task in your evaluations the same way you would in production,” says Goyal.

The final component is the scoring function that you use to assess the results of your framework. There are two main types of scoring functions. Heuristics are rule-based functions that can check well-defined criteria, such as checking a numerical result against the ground truth. For more complex tasks such as generating and summarizing text, you can use LLM-as-judge methods, which give rise to a strong language model to evaluate the result. LLM-as-judge requires advanced, fast-paced engineering.

“LLM-as-judge is difficult to get right and there are many misconceptions about it,” Goyal said. “But the key insight is that, as with mathematical problems, it is easier to validate whether the solution is correct than to solve the problem itself.”

The same rule applies to LLMs. It is much easier for an LLM to evaluate a result produced than to perform the original task. It just requires the right prompt.

“Usually the technical challenge is repeating the wording or the prompt itself to make it work properly,” Goyal said.

Innovate with strong reviews

The LLM landscape is rapidly evolving and providers are constantly releasing new models. Companies will want to upgrade or change their models as old ones become outdated and new ones become available. One of the key challenges is ensuring that your application remains consistent when the underlying model changes.

Once proper evaluations have been performed, changing the underlying model becomes as easy as testing the new models.

“When you have good evaluations, switching models feels so easy that it’s actually fun. And if you don’t have any evaluations, then it’s terrible. The only solution is to have evaluations,” Goyal said.

Another problem is the changing data that the model faces in the real world. As customer behavior changes, companies will need to update their evaluations. Goyal recommends implementing a system of “online scoring” that conducts continuous evaluations based on real customer data. This approach allows companies to automatically evaluate their model performance based on the most up-to-date data and incorporate new, relevant examples into their evaluation sets, ensuring the continued relevance and effectiveness of their LLM applications.

As language models continue to reshape the software development landscape, adopting new habits and methodologies becomes critical. Implementing custom assessments represents more than just a technical practice; it is a mindset shift towards rigorous, data-driven development in the age of AI. The ability to systematically evaluate and refine AI-based solutions will be a key differentiator for successful businesses.