RAG Eval Sets: The Secret to a Reliable AI Product

Imagine an AI system advising a doctor on treatment protocols or summarizing a patient’s medical history. A single hallucinated fact could be disastrous. This is why Retrieval-Augmented Generation (RAG) is essential: it anchors the LLM to a verified knowledge source (like internal clinical guidelines or legal documents), ensuring the answer is accurate, up-to-date, and fully auditable. The RAG pipeline turns an unreliable, general-purpose LLM into a dependable, specialized product. However, based on my experience, we need to engineer an effective Evaluation (EVAL) set to ensure that our Rag-based AI system is reliable.

The Foundation: Three Factors that Determine Difficulty

1. Context Relevance: Does the system retrieve documents that actually relate to the user’s question?

2. Context Sufficiency: Does the system retrieve all necessary facts to fully answer the question, or is the evidence scattered across multiple documents? This tests the system’s ability to handle fragmented knowledge.

3. Format Constraints: Can the system stick to the format we asked for? Does it deliver the answer as a perfect JSON object, a numbered list, or with the correct citations included?


Engineering the Eval Set: How to Build a Test Lab

Creating an enterprise-grade Eval Set is like designing a controlled science experiment for your product. It requires three core steps:

1. The Golden Triple (Q, A, Ground Truth Context)
Every test case must start with the Golden Triple: a Question (Q), the perfect Answer (A), and the Ground Truth Context. Having the Ground Truth Context lets you separate the errors, which is critical for product development. Given the retrieved documents, was the generated answer supported only by those facts? If not, the LLM is likely hallucinating.

2. Stress Testing and Robustness
While the Golden Triple ensures accuracy on expected queries, building truly resilient systems requires stress testing – a topic so critical it deserves its own deep dive. We will cover it in another post to avoid distraction.

3. The Continuous Loop (Versioning)
The validity of your Eval Set degrades as your company’s knowledge base changes. If you add a new policy or retire an old one, your tests might become meaningless.

In conclusion, an effective RAG Eval Set is a strategic tool. It moves beyond simple pass/fail tests and provides the quantifiable data you need to ensure your AI product is not just smart, but reliably safe, accurate, and ready for the enterprise.

Note: HumanLens.ai brings proven experience in quality and risk management to build the necessary technical and procedural guardrails around your RAG pipeline, ensuring your AI product is not just accurate, but auditable and safe.

Non-Determinism: Why it is a Feature, not a Bug, in LLMs. Plus, what Thinking Machines Lab’s quest for consistent output mean for Responsible AI.

If you have ever asked an LLM the same question twice and received different answers, you have experienced non-determinism. While this might seem like a bug, it is actually a fundamental characteristic of these powerful models. To understand why, let’s contrast it with a more traditional, deterministic system.

Determinism vs. Non-Determinism: A Personal Story

As a grad student, I spent a lot of time with finite state machines (FSMs). FSMs are computational models that can be in exactly one of a finite number of states at any given time. It transitions from one state to another based on specific inputs. đź’ˇ The light switch is a perfect example of a deterministic finite state machine. It’s “deterministic” because what it does is always 100% predictable. It’s a “finite state machine” because it has only a few, specific “states” it can be in.

Here’s how it works:

States: The light switch has only two states: On and Off.
Inputs: The only input is you flipping the switch.

The key is that an FSM is deterministic if given the same state and input, it always produces the same output. There is no room for variation! Some real world problems, especially in human language, although a bit more complicated than the light switch example, follow similar patterns. My graduate school project was to build an FSM to model phonotactics, the rules governing which sounds can appear together in a language. As an illustrative example of phonotactics rule in Nepali, we tend to avoid word initial consonant clusters (https://www.ling.upenn.edu/Events/PLC/PLC35/abstracts/5b_Koirala.pdf). I was able to use Probabilistic Deterministic FSMs for modeling this aspect of human language.

However, other aspects of human language are more complex. Consider this sentence as an example: “Put the book on the table with the pencil.”

This sentence is a little bit of a puzzle. It has two possible meanings:

Meaning 1: Use the pencil to help you put the book on the table. (This sounds silly, right?)
Meaning 2: Put the book on the table where the pencil is already sitting. (This makes a lot more sense!)

A deterministic system might get confused by the word “with” because it might not know which meaning is the right one.

A non-deterministic system knows both interpretations exist, and if designed correctly, it can guess that in this situation when people say “with the pencil,” they likely mean a location, not a tool.

Okay, back to LLMs. LLMs generate text based on probabilities. When an LLM predicts the next word in a sentence, it considers a wide range of possibilities, each with a different likelihood. The model doesn’t just pick the most probable word every time; it samples from the distribution of likely words. This is where non-determinism comes in. By introducing a degree of randomness, an LLM can produce diverse, creative, and sometimes surprising outputs for the same prompt.

Why Non-Determinism is a Feature of LLMs

This variability is not a flaw; it is a feature that makes LLMs so versatile. Think about it:

Creativity: For tasks like writing stories, poems, or marketing copy, you don’t want a static, predictable output. Non-determinism allows the model to explore different ideas and styles, which is essential for creative work.

Avoiding Repetition: If an LLM were perfectly deterministic, it would likely get stuck in repetitive loops or generate the same uninspired text over and over again. Non-determinism helps it break out of these ruts.

Adaptability: It allows the model to produce different outputs for the same prompt, which can be useful when you are iterating on a task and looking for new angles or perspectives.

The Quest for Determinism

However, for some applications, this variability is a significant problem. I worked in AI Governance for a healthcare organization for past 3 years, and non-determinism was undoubtedly one of the biggest bottlenecks for LLM adoption beyond pilots.

A new paper from Mira Murati’s Thinking Machines Lab (https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/) claims to have defeated non-determinism in LLMs, at least on a technical level. They argue that the source of non-determinism isn’t just the random sampling we talk about, but a deeper issue with how GPU kernels handle parallel processing in different batch sizes . By re-engineering the kernels to be batch-invariant, they can ensure that the same input always produces the same output.

This is a great feat because it opens up the door for using LLMs in more restricted and critical domains, such as healthcare. For example, a deterministic LLM could be used to:

Automate clinical documentation where consistency and accuracy are paramount.

Generate patient summaries that are always the same for a given set of medical records, which is essential for legal and auditing purposes.

Assist in diagnostic support where consistent outputs are necessary to build trust and ensure safety.

Why Rigorous Testing is Still Critical

The work by Thinking Machines is a crucial step toward giving us the best of both worlds, but a deterministic LLM is not a silver bullet. While the output may be consistent, the model can still be prone to hallucinations—generating information that is factually incorrect. But testing a model for safety and reliability goes far beyond just checking for hallucinations. Here is why rigorous testing is still non-negotiable, even with a deterministic model:

Accuracy: Consistency doesn’t equal correctness. A deterministic model might consistently provide an inaccurate diagnosis based on flawed training data. Testing must verify that the models outputs align with established facts and expert knowledge. In healthcare, this means validating the LLMs outputs against medical guidelines and professional review.

Bias: The black box problem and the risk of bias don’t disappear with determinism. A deterministic LLM can consistently produce outputs that are biased against a particular gender, race, or socioeconomic group. This happens when the training data is not representative or reflects societal prejudices. Testing must include auditing the model for fairness and ensuring it doesn’t perpetuate or amplify existing biases.

Security: Even a deterministic model can be vulnerable to security risks. A malicious actor could exploit the system to inject harmful content or manipulate its outputs. In a healthcare setting, a security breach could expose sensitive patient data or compromise the integrity of the diagnostic tool. Testing must include robust security audits to protect against these threats.

Governance and Oversight: The advent of deterministic LLMs doesn’t eliminate the need for clear AI governance. We still need policies and procedures that define who is responsible for the model, how it will be monitored, and what happens in the event of a failure. These frameworks, as discussed in my previous blog post, are what ensure accountability in the long run.

In the end, it seems we need both. Non-determinism is valuable for creative tasks, but determinism is essential for high-stakes, consistent applications. The work by Thinking Machines is a crucial step, but it only solves one part of the puzzle. The true promise of AI can only be realized when we combine these technical advances with a steadfast commitment to rigorous testing, ethical governance, and a proactive approach to safety and security.

Unpacking AI Accountability

In traditional software development, accountability is relatively straightforward. A bug in a program can often be traced back to a specific line of code or a developer’s oversight. The responsibility is clear.

AI, as we know it now, introduces the “black box” problem. The model’s decisions are based on patterns learned from vast datasets. This makes it incredibly difficult to pinpoint why an AI system made a particular decision. For example, if an AI system denies a mortgage application, it might be due to a complex interplay of hundreds of thousands of data points, not a single piece of faulty logic. The accountability is distributed and obscured.

Who is Accountable?

When an AI system causes harm, assigning accountability isn’t simple. The accountable parties can include – Data Providers, Model Developers, Product Managers/Leaders, Deploying Organization, or, dare I say, even the End Users in some cases.

Developer Accountability is especially tough! As developers, we carry a heavy burden. We are tasked with building complex systems that can have real-world consequences, often without full visibility into the data or the business context. The challenge is twofold: First, an AI can absorb and amplify societal biases present in the training data, even if the developer has good intentions. We are expected to “do right” by not just creating working algorithms but also by ensuring they are fair and ethical. Second, the very nature of machine learning makes it difficult to guarantee a perfect outcome. A model trained on 10 million data points might perform flawlessly on 9.9 million but fail on a specific, unforeseen edge case. Holding a single developer accountable for every single error in a system of this scale is not realistic.

The Role of Responsible AI Tools

So how can we, as developers, be more accountable? This is where RAI tools like HumanLens come in. By integrating RAI tools directly into the development lifecycle, we can build accountability into our processes, not just react to problems. These tools can-

Identify Bias: They can give us a head start on “doing right.”
Increase Transparency: They can provide “explainability” by showing which features had the most influence on a model’s decision.
Document Everything: From data lineage to model versions, these tools create a clear, auditable trail. This helps distribute accountability appropriately across the different parties involved.

In the end, accountability isn’t just about placing blame. It’s about having the right systems and tools in place to build better, more trustworthy AI. It’s about shifting the focus from individual failure to systemic responsibility, ensuring we can all “do right” and build a more ethical future with AI.