3.1 Hallucinations, reasoning errors and biases

LLMs are probabilistic machines, not factual knowledge bases. They have no notion of "true" or "false", only "plausible". This nature creates three major types of defects that the MagicFridge QA team must identify and mitigate.

3.1.1 Hallucinations, reasoning errors and biases in generative AI

To effectively test GUS (our AI), we first need to know how to name the problems. The syllabus distinguishes three categories of defects:

1. Hallucinations

A hallucination is a response generated by the AI that is grammatically correct and confident, but factually incorrect or invented relative to reality or the provided context.

Red thread: MagicFridge

A user asks: "What are the benefits of the 'Lompon' fruit?"

GUS's response: "The Lompon is a rare citrus fruit from Southeast Asia, rich in Vitamin C and excellent for digestion."

The problem: The "Lompon" does not exist. GUS invented this fruit entirely because the name sounded plausible. This is a pure hallucination.

2. Reasoning errors

Unlike hallucinations (which are knowledge errors), reasoning errors occur when the AI fails to follow logic, mathematics, or a sequence of steps, even if the starting data is correct.

Red thread: MagicFridge

Prompt: "I have 3 eggs. The recipe requires 4. How many am I missing?"

GUS's response: "You are missing 2 eggs."

The problem: The AI failed a simple subtraction (4 - 3). This is a critical risk for a culinary application where proportions determine the success of the dish.

3. Biases

Biases stem from training data. If the AI learned from a volume of data containing cultural or social stereotypes, it will reproduce these patterns, favoring certain responses over others.

Red thread: MagicFridge

Prompt: "Generate an image of a starred chef preparing a meal."

AI Response: GUS consistently generates images of white men of a certain age.

The problem: This is a representation bias. The AI ignores diversity (women, other ethnicities) because its training data statistically associated "Chef" with "Man".

3.1.2 Identify these defects in LLM output

How can the tester spot these errors? Several methods are available to her:

Cross-verification: Compare the AI's response with a reliable source of truth (documentation, specifications).
Domain expertise consultation: Engage experts to validate the accuracy of the generated content.
Consistency checks: Verify if the AI contradicts itself within the same conversation.
Logical validation: Review the AI's chain of thought to spot logical leaps.
Output testing: Beyond simple review, this involves concretely executing the scripts or test cases generated by the AI on the target application. This step allows for technical validation of the output: if the script fails execution, the reasoning error is confirmed.
Bias detection: This consists of verifying that generated synthetic data is fair and representative of the diversity of real users. It also involves checking that the AI does not favor certain test types (e.g., functional) at the expense of other critical aspects like security or accessibility.

3.1.3 Mitigation techniques

Once the risk is identified, the tester must propose solutions to reduce the frequency of these defects. Prompt engineering plays a key role here.

Provide complete context: The more information the AI has in the prompt, the less it needs to invent.
Divide prompts into segments (Prompt Chaining): Breaking down a complex task allows for verifying reasoning at each step and reduces logical errors.
Use clear data formats: Structuring data (JSON, Tables) helps avoid ambiguities and assists the AI in correctly interpreting the essential aspects of the task.
Compare results across models: Ask the same thing to two versions of GUS to see if the answers converge.
Select the appropriate AI model: (see section 5.1.3).

Red thread: MagicFridge

To prevent GUS from inventing toxic ingredients, the team adds a whitelist in the system prompt context: "You must ONLY use ingredients present in the official 'USDA FoodData' database. If an ingredient is not there, refuse the recipe."

3.1.4 Mitigation of non-deterministic behavior

Non-determinism is the tendency of LLMs to produce different results for the same input. This is a nightmare for automated regression testing.

To reduce variability and improve result reproducibility, the technical team can:

Adjust the temperature: Setting this parameter close to 0 makes the model more factual and less "creative" (thus more stable).
Set random seeds: Force the random number generator to use a fixed value to reproduce the same results during tests.

Syllabus point (key takeaways)

Hallucination: Plausible but false information.
Reasoning error: Failure in logic or calculation.
Bias: Unfair prejudice stemming from data.
Mitigation: Requires better context, prompt chaining, and adjusting the temperature (close to 0) to reduce non-determinism.

Is this course useful?

This content is 100% free. If this section on MagicFridge helped you:

It's 0% fees for me, and 100% fuel for the next chapter! ☕