Model Evaluation, Testing, and Validation

⏱ 18 min 📊 Advanced AIGP Certification Prep

Testing isn't just a technical activity — it's a governance activity. The AIGP exam tests whether you can define testing requirements, understand fairness metrics, and know when to apply red teaming.

Fairness Metrics

Multiple mathematical definitions of fairness exist — and they can be mutually exclusive. Governance professionals must understand the tradeoffs:

Demographic parity — The AI system's positive outcome rate should be equal across demographic groups. Example: The hiring AI recommends candidates at the same rate regardless of gender.

Equalized odds — The true positive rate AND false positive rate should be equal across groups. Example: A medical diagnostic AI detects disease equally accurately for all demographic groups.

Individual fairness — Similar individuals should receive similar outcomes. More intuitive but harder to measure — requires defining "similarity."

Predictive parity — The precision (positive predictive value) should be equal across groups. Example: When the AI flags a transaction as fraud, it's equally likely to be actual fraud regardless of the customer's demographic.

Key exam point: These metrics can conflict. A system that achieves demographic parity may not achieve equalized odds. The governance decision is: which fairness metric is most appropriate for this use case?

Knowledge Check

A loan approval AI achieves demographic parity — the approval rate is 40% for all racial groups. However, the false positive rate (approving loans that default) is significantly higher for one group. Which fairness metric reveals this disparity?

Equalized odds considers both true positive and false positive rates across groups. Demographic parity (equal approval rates) is satisfied, but equalized odds reveals that the false positive rates differ — meaning the model's errors are distributed unequally across groups.

Robustness Testing

AI systems must function reliably under adverse conditions:

Adversarial inputs — Deliberately crafted inputs designed to fool the AI. Examples: modified images that cause misclassification, perturbed text that changes sentiment analysis results.

Edge cases — Unusual but legitimate inputs that the AI may not have seen during training. Examples: rare medical conditions, unusual financial transactions.

Distribution shift — The input data in production differs from training data. Examples: seasonal changes in consumer behavior, new product categories.

Stress testing — How does the AI perform under high load, noisy data, or degraded conditions?

Governance requirement: Define robustness testing requirements proportionate to the risk level. High-risk AI systems should undergo comprehensive adversarial testing.

Security Testing for AI

AI introduces unique security vulnerabilities beyond traditional cybersecurity:

Prompt injection — Manipulating AI inputs to bypass safety controls or extract sensitive information. Critical for generative AI systems.

Data poisoning — Contaminating training data to cause the model to learn incorrect patterns. A supply chain attack on AI development.

Model extraction — Using the AI's outputs to reconstruct a copy of the model, potentially stealing trade secrets.

Membership inference — Determining whether a specific data point was in the training dataset, potentially revealing private information.

Model inversion — Using the model to reconstruct training data, potentially exposing personal information.

Knowledge Check

An attacker sends carefully crafted queries to a public AI API and uses the responses to build their own copy of the model. This attack is known as:

Model extraction uses the AI's inputs and outputs to reconstruct a functional copy of the model. This differs from prompt injection (manipulating input behavior), data poisoning (contaminating training data), and membership inference (determining if specific data was used for training).

Red Teaming for Generative AI

Red teaming involves adversarial testing by a dedicated team trying to make the AI system fail or produce harmful outputs.

For generative AI, red teaming focuses on:

- Harmful content generation — Can the system be prompted to produce dangerous, illegal, or toxic content?

- Bias and stereotypes — Does the system generate biased or stereotypical responses?

- Factual accuracy — Does the system confabulate (hallucinate) false information?

- Privacy leakage — Does the system reveal training data or personal information?

- Safety bypass — Can safety filters be circumvented through creative prompting?

Governance requirements for red teaming:

- Red team must be independent from the development team

- Testing must include diverse perspectives (demographics, expertise, languages)

- Results must be documented and addressed before deployment

- Red teaming is ongoing, not just pre-launch

Real-World Scenario

In February 2024, Google launched its Gemini image generation feature, which quickly drew widespread criticism for producing historically inaccurate and racially inappropriate images. When users asked for images of America's Founding Fathers or German soldiers from World War II, the system generated racially diverse depictions that were factually wrong. Google was forced to pause the image generation feature entirely within days of launch. An internal review revealed that the model's safety tuning — designed to promote diversity — had not been adequately tested against historically specific prompts, and the red teaming process had failed to include prompts grounded in historical context.

This incident illustrates several key AIGP concepts. First, benchmark performance alone is insufficient — Gemini scored well on standard image generation benchmarks but failed on adversarial, real-world prompts. Second, red teaming must include diverse perspectives and domain-specific scenarios, not just generic safety tests. Third, fairness interventions (in this case, diversity tuning) can themselves introduce new failure modes if not rigorously tested. Google's post-incident response included expanding its red teaming program to include historians, cultural experts, and a broader range of adversarial prompt categories before re-launching the feature.

For the AIGP exam, this scenario demonstrates why robustness testing, red teaming independence, and multi-stakeholder review are governance requirements — not optional enhancements. A governance professional reviewing this system should have required historically grounded adversarial testing as part of the pre-deployment evaluation plan.

Final Check

A generative AI system passes all standard performance benchmarks with high scores. Is red teaming still necessary?

Benchmarks measure expected performance under normal conditions. Red teaming specifically tests unexpected, adversarial, and edge-case scenarios that benchmarks don't cover. A model can score perfectly on benchmarks while still being vulnerable to prompt injection, bias, or harmful content generation.

🎯

Day 21 Complete

"Fairness metrics can conflict — the governance decision is which metric fits the use case. Red teaming tests adversarial scenarios that benchmarks miss. AI security testing must cover prompt injection, data poisoning, and model extraction."

Go Deeper

Want to see these concepts applied to full case studies? Check out AIGP Scenarios — 10 real-world governance simulations mapped to the AIGP exam domains.

Next Lesson

Documentation — Model Cards, Data Sheets, and AI Impact Assessments

→