2.3 Evaluate generative AI results and refine prompts
Getting an AI response is good. Getting a good response is better. At MagicFridge, the QA team does not blindly trust GUS (our application's AI). As with any software, the AI output must be verified. But how do we "test the tester"?
This section teaches you how to measure the quality of AI responses and iteratively improve your prompts.
2.3.1 Metrics for evaluating results on test tasks
Evaluating AI is not done by "guesswork". The syllabus defines specific metrics to judge whether the AI has done a good job.
Here are the key metrics to know:
| Metric | What it measures | MagicFridge Example |
|---|---|---|
| Accuracy | The percentage of correct responses compared to a reference. | The AI generated 10 test cases. 9 are valid, 1 is off-topic. Accuracy = 90%. |
| Precision | The quality of defects found (avoiding false positives). | The AI flags 5 bugs in the code. 4 are real bugs, 1 is an AI error. |
| Recall | The ability to find everything (avoiding false negatives). | There were 10 hidden bugs in the code. The AI found 6. Its recall is 60% (it missed 4). |
| Relevance | Adequacy to the business context. | Do the generated recipes respect the requested "Gluten-Free" constraints? |
| Diversity | The variety of responses. | If we ask for 50 test scenarios, are they all different or does the AI repeat the same pattern? |
| Execution success rate | The proportion of generated test cases or test scripts that can be executed successfully (without syntax or format errors). | Out of 10 Selenium scripts generated by the AI, 8 run on the first try, 2 crash due to a syntax error. Rate = 80% |
| Time efficiency | The time saved compared to manual test efforts (ROI). | The AI generates 100 lines of SQL data in 2 minutes. A human would have taken 4 hours. The gain is massive. |
2.3.2 Techniques for evaluating and iteratively refining prompts
Prompt engineering is a cycle: write, test, correct. If the AI result is mediocre, it is often not the model's fault, but the prompt's fault.
Here are techniques to improve your results:
1. Iterative prompt modification
Start simple, look at the result, and progressively add constraints.
Red thread: MagicFridge
The tester wants test data.
- Iteration 1: "Give me a list of users." -> Result meh: The AI just gives first names.
- Iteration 2: "Give me a list of users with email and password." -> Better, but text format.
- Iteration 3: "Give me a CSV table with: ID, Email, Password (complex), Registration Date." -> Perfect.
2. A/B testing of prompts
Write several variations of the same prompt and statistically compare the results to keep the best "template".
Red thread: MagicFridge
The team wants to summarize bug tickets. The QA engineer tests two approaches:
- Prompt A: "Summarize this bug."
- Prompt B: "Act as a senior developer. Summarize this bug following the structure: Title, Root Cause, Impact, Suggested Solution."
After testing on 20 tickets, Prompt B offers much higher relevance and utility. It becomes the team standard.
3. Output analysis
This is the critical examination of responses to detect hallucinations or biases. This allows understanding why the AI made a mistake and adding an explicit constraint in the prompt to prevent it from happening again.
Red thread: MagicFridge
Context: The tester asks GUS: "Generate a weekly menu for a family of 4."
1. Output analysis (the observation): the tester reads the generated menu. At first glance, everything looks correct (7 days, lunch and dinner). However, upon closer analysis, she spots a logical anomaly: for Tuesday lunch, GUS suggested a "Beef Bourguignon (Cooking time: 4h00)".
2. Diagnosis (why did the AI fail?): the AI did not hallucinate (the recipe exists), but it lacked implicit context. It did not "understand" that on Tuesday noon, people work and cannot cook for 4 hours.
3. Corrective action (prompt refinement): following this analysis, the tester adds an explicit constraint to the system prompt: "For weekday lunches (Monday-Friday), NEVER suggest recipes requiring more than 30 minutes of preparation."
4. Integrate user feedback
The quality engineer is not the only judge. It is crucial to ask the final users of the AI's output (developers, POs, other testers) if the result was useful to them.
Red thread: MagicFridge
Context: the AI automatically generates bug descriptions for developers.
The user feedback: a developer points out to the tester: "GUS's reports are technically correct, but too verbose. I waste time reading 3 paragraphs just to find the error ID."
The adjustment: the tester modifies the prompt to integrate this feedback: "Output format: you must start with the Error ID and the line of code, then provide a one-sentence description."
5. Adjust prompt length and specificity
This is about adjusting the "slider". Sometimes, a prompt that is too short yields generic results. Sometimes, a prompt that is too long "drowns" the AI in too many contradictory details. You have to experiment to find the right balance.
Red thread: MagicFridge
Objective: generate innovative recipe ideas.
- Try 1 (too short): "Give me a recipe idea."
- Result: "Butter pasta." (Too banal).
- Try 2 (too specific): "Give me a vegan, gluten-free, blue recipe, using yuzu, cooked in the microwave in 3 minutes for an astronaut."
- Result: the AI hallucinates or says it is impossible.
- Try 3 (adjusted): "Propose an original and colorful recipe, using an Asian citrus fruit, achievable in less than 20 minutes."
- Result: "Scallop Carpaccio with Yuzu and Pomegranate pearls." (The perfect balance).
Syllabus point (key takeaways)
- It is not enough to generate, you must evaluate.
- Key metrics are: accuracy, precision, recall, relevance, diversity, execution success rate, time efficiency.
- Improvement is iterative: refine the prompt based on observed errors and omissions.
- A/B Testing allows objectively choosing the best prompt for a given task.