Background
- As product development progressed, the need for additional prompts, models, and logic updates to meet evolving requirements became essential.
- With each new rollout, increasing amounts of data and feedback were gathered, providing valuable insights for continuous improvement.
Pain Point
- Lack of Benchmarking: Without a solid evaluation framework, there is no clear benchmark to measure progress or identify areas for improvement, leading to a trial-and-error approach that may result in unnecessary detours.
- Interconnected Systems and Flows: The entire workflow and system are highly interdependent, making it crucial to implement robust testing to ensure any changes remain consistent with existing components and do not disrupt other processes.
- Time-Consuming Manual Effort: Evaluations were primarily conducted manually, without the support of an automated testing platform. As demands increase, a scalable and comprehensive evaluation system is needed to identify issues and enhance efficiency.
Goal & Achievement

- Bad Case Reduction: Reduced bad case rate by 10%
- Human Task Reduction: Decreased reliance on human intervention by 17%
- Model Optimization: Established monthly model fine-tuning using collected data
Product Planning
.png)
.png)
- Scope Alignment
As shown above, the overall evaluation consisted of four main categories. In this section, I focus only on the first two—model, prompt, conversation, and tech health, while the others can be monitored through general data analyses and dashboards. With a technical focus, I began by understanding single-round outputs and would explore advanced modules in the future, such as RAG and tool use like NQ2SQL.
.png)
- Framework Alignment
Based on previous development practices and preliminary discussions with the engineering team, I proposed an evaluation framework.
1. What to evaluate: model, prompt, feature change, generated conversation
2. When to evaluate: development, QA testing, production
3. How to test: unit test, module test, flow test (including building an AI agent to simulate customer chats)
4. Primary evaluation dimension: accuracy, hallucination, consistency, goal completion, relevance, recall, precision
- Research and Brainstorming
Collaborated with design, data, engineering, sales & ops teams, and other product managers to brainstorm various testing and evaluation ideas, then prioritized them based on effort, effectiveness, and feasibility.
Solution & Release
The team and I were still exploring and working on the evaluation framework proposed above. Below was the description and status of what I had already implemented.
Goal: Auto-Testing Platform

Regarding model and prompt
- Human Labeled and Crafted Test Set
1. For any change in the prompt or model, such as implementing new rules or fine-tuning, a portion of the live conversations or a newly human-created dataset would be used for testing. Subsequently, human evaluators would review the results.
2. The current approach lacks scalability and automation. Consider leveraging existing tools or frameworks, such as DeepEval and Promptfoo, to streamline the process.
Regarding conversation

- Human-in-the-Loop: Thumbs Up and Down (focus on domain expertise and subtle cases)
1. Displayed on the chatting interface, allowing users to mark their feedback with reasons and details.
2. Widely used by human leasing agents or internal users to share direct feedback.

- AI-as-a-Judge: Predefined Case Evaluation (focus on detection of known issues)
1. Focused on using a specialized LLM to monitor specific and predefined cases such as hallucinations or instances where the AI admitted it’s an AI.
2. Human leasing agents would receive a reminder when detected, enabling them to intervene.

- AI-as-a-Judge: Conversation Evaluation (focus on response quality)
1. Leveraged an LLM to assess conversation quality by evaluating goal completion and consistency, offering a 3-tier score level along with explanatory feedback.
2. Still in early stages and requires additional human oversight to review sample results, ensuring accuracy and preventing misjudgments.

- Automated Testing Script with Fixed Dataset (focus on conversation fluency)
1. Used human-crafted datasets to automatically test the entire flow via programming. QA then reviews the results.
2. Automatic but limited in scope. Primarily used for general regression testing.
Learning
- Prioritizing Evaluation Early On
Looking back, I would have prioritized evaluation from the very beginning. It's a critical aspect for measuring and ensuring consistent performance, and it requires careful planning and dedicated resources. - Balancing Core and LLMOps Modules in GenAI Dev
For applications of LLM and GenAI, successful development requires consideration of factors beyond core functionality, testing, and evaluation. These include crucial LLMOps aspects such as prompt management, log tracking, robust API design, and data governance. Striking the right balance among these factors in system design and resource allocation is essential for achieving optimal performance and user experience.