Stop Eye-Balling If Prompts Work, How To Test LLMs

Jacky
5 min readSep 13, 2023

--

The Problem

LLM applications are not the easiest applications to test. As my co-founder Jeffrey put it, it’s a lot like “playing whack-a-mole” when you are changing prompts/improving RAG applications and before DeepEval, it was difficult to write tests to ensure good performance while iterating on LLM applications.

Let’s first explore the typical developer workflow for LLM applications:

  • Developer writes a bunch of spaghetti code to get LangChain working.
  • They test a few prompts and then finds a good one that satisfies their initial set of use cases.
  • They then play around with different Langchain settings and tutorials to realize that they can do a lot more.
  • They then make a few changes — they add in a Cohere Reranker, bring in summarisation into their prompts, switch LLM providers, try out OpenAI Function Calls, introduce GuardRails
  • They then realize their initial set of use cases no longer work. And now they have to start from square 1 and figure out some testing framework. They can see they can apply BertScore comparisons and then proceed to do just that.
  • They experience a significant slowdown in development.

To summarise, what makes up their MVP looks a bit like this (optional benchmarks required to prove to their CTO/Leads about performance):

MVP for 90% of LangChain projects

And that’s great! But as they explore more, they find their RAG applications can balloon up quite quickly:

An example of a project with a “little” bit of scope creep.

The problem now is that each time you add a new tool, it invites more issues and errors because of a number of reasons:

  • Natural language outputs can vary quite a bit and are stochastic.
  • Agents can reason things differently with different conclusions — imagining Tree of Thought, Chain of Thought and other agent tools
  • LLMs can hallucinate at various levels or reasons

If an agent/RAG application can’t get to the right conclusion — could it be because they didn’t reason properly to the right scenario? How can you test for that?

In traditional software engineering practices, these kind of tests are impossible to write. This is where DeepEval comes in as an testing/evaluation framework for these types of problems.

Enter DeepEval — Unit Testing For LLMs

Instead of the usual pattern of build, deploy, eyeball, DeepEval gives developers a framework to unit test these LLM applications.

DeepEval helps developers first create an evaluation dataset, build and test them immediately. The best part? This is all do-able with tooling they already know! CLI, Pytest, OpenAI completion protocols. No obscure SDK/API. Just a pure open-source testing framework designed to improve on prompts.

A quick overview diagram of our framework can be seen below (read more about it here).

DeepEval Framework — read more about it here: https://docs.confident-ai.com/docs/framework

DeepEval allows engineers and data scientists to focus on developing these LLM-based applications to confidently know when a new tool/component they added has a down-stream effect. For example — they can test for RAG hallucinations in just 1 line of code as shown below:

from deepeval.metrics.factual_consistency import assert_factual_consistency
assert_factual_consistency(
"He left at 3PM",
"At 3PM, James stood up, looked at the gate and exited."
)

This enables developer flows to look more like this.

Developers can now more robustly build their application for production

An example workflow with example tests developers can now write:

  • We start off with a simple Question-Answering application and get it working.
  • We then decide “What if we want to add summary as a use case” so we implement a guardrail to ensure for those use cases. To make sure that it doesn’t effect other queries, we add a DeepEval test case for conceptual similarity with previous reasons.
  • We may then decide to switch this out for an LLM Agent instead of a QA application so we add an additional tool on top. We add a Deepeval overall test score to ensure this isn’t effected.
  • We then discover Cohere embeddings to perform better on multilingual datasets and switch them out instead of OpenAI embeddings. So we add support for the multilingual test case to ensure it is similar to the others.
  • We decide next week to fine-tune LLMs as well and then add a bias test to ensure we don’t accidentally add bias ot our fine-tuned LLMs

DeepEval allows these new LLM applications to be built with more robustness than before.

But what if you don’t like writing tests?

DeepEval gives users a way to easily start benchmarking their datasets.

from deepeval.dataset import create_evaluation_query_answer_pairs
dataset = create_evaluation_query_answer_pairs(
"""Python is a great language for mathematical expression and
machine learning."""
)

Under the hood, we use LLMs to create synthetic queries and answers using ChatGPT (support for more models to come soon!).

And what if ChatGPT gives terrible synthetic queries and answers?

Users can review these queries and answers using our no-code dashboard. You can launch the dashboard in just 1 line of code. You can read more about that here.

dataset.review()
Users can review these queries and answers through a simple dashboard

We have an ambitious roadmap to quickly build up the evaluation space with integrations with tools machine learning engineers love like Unstructured, Guardrails, LangChain, LlamaIndex, Streamlit.

We welcome checking out our repository and leaving any feedback!

Thanks for reading!

Obviously much of DeepEval wouldn’t have been possible without readings/open-source efforts from various experts/packages/open-source communities.

--

--

Jacky
Jacky

No responses yet