August 26, 2025

The "Health Check" of AI Products: How I Build Evaluation Systems

By Zhexi Zhu6 min read

Recently, while optimizing AI applications, I've come to a realization: fine-tuning models and writing prompts often only account for 20% of the work. The remaining 80% of the effort is actually spent on "setting standards."

Whether building RAG, Agents, or fine-tuning foundation models, we often face a soul-searching question: "Is this version really better than the last one?"

Without an objective ruler, optimization can easily become "metaphysics." Today, I want to outline the evaluation process I've explored in my projects. This methodology is not just for acceptance testing; it is for driving development.

1. Data "Intuition": Starting with a Good Dataset

When I first started doing evaluations, I used to grab some ready-made cases to run. But I quickly found that these test sets were severely "distorted." Later, we adjusted our strategy and followed these principles when building datasets:

1. Authenticity and Distinctiveness

Nothing is more precious than real user queries. While early synthetic data is usable, it's hard to cover the vague expressions and "dirty data" of real-world scenarios. At the same time, we deliberately increase the proportion of "hard problems" when screening data. If the model scores 95 on the test set, that test set is insensitive to optimization; it needs enough distinctiveness to expose the model's boundaries.

2. Allocation of Scenario Weights

We tried mixing all data together for testing, but found that long-tail issues were easily drowned out. Our current approach is to classify cases by scenario (e.g., high-frequency mass scenarios vs. low-frequency niche scenarios) and assign weights based on business importance. The resulting score can truly reflect the "perceived quality" of the product.

3. "Extrapolating" from Bad Cases

This is a very practical technique: whenever we find a Bad Case in actual operation, we don't just fix that one. We treat it as a signal and immediately collect or construct a batch of similar cases to add to the test set. Gradually, the test set becomes a moat for product capability.

2. Manual Review: The "Dumb Work" You Can't Skip

Greg Brockman of OpenAI once shared a view that manually checking data is a high-ROI activity. I have deeply felt this in practice.

In the early stages of a project, I force myself to eyeball the results of over 100 cases to build a real intuition for the product.

When reviewing, I mainly do one thing: Clustering.

AI errors are rarely random. If errors look messy, it's usually because you haven't seen enough of them. When we cluster Bad Cases, we often find they concentrate in a few specific scenarios (like "failure to identify multiple intents" or "specific format hallucinations"). Once errors are categorized, the problem is half solved. Usually, when I stop seeing new error types, I stop the large-scale manual review and switch to the automated phase.

3. System Quantification: The Core is "Consistency"

Once we have the intuition, the next step is to translate that intuition into metrics.

When designing scoring rules, we found that consistency is more important than accuracy. Whether using human annotation or AI scoring, we need to ensure that the same input and output receive stable scores across different evaluation times.

We design specific dimensions for different scenarios. For example, for different user intents, we expect different styles of answers, and the content of the answers will have different focuses, which in turn derives different scoring standards.

4. Rejecting the Black Box: Component-Level Evaluation

When building complex systems like AI Agents or RAG, the easiest pitfall is doing only end-to-end testing.

Looking only at the final result can sometimes mask many problems. For example, if the answer is wrong, was the "document not found" or did the "model not understand"? To figure this out, we need visibility into all logs and perform segmented evaluation:

  • RAG Scenarios: We test retrieval and generation separately. The retrieval module looks at precision and recall, while the generation module looks at hallucination rates.
  • Agent Scenarios: Focus on the intermediate process of tool calls. Were the parameters passed correctly? Were the call rounds redundant?

Only by understanding the input and output of each component can we truly achieve "Evaluation-Driven Development."

Final Thoughts

In fact, the process of evaluation is essentially the process of deepening our understanding of the product.

First, establish intuition through manual analysis, then solidify standards through high-quality datasets, and finally locate problems through component-level monitoring. When this data flywheel starts spinning, the direction of product iteration naturally becomes clear.