Domain IV Capstone — Building a Monitoring and Incident Runbook — AIGP Certification Prep

Today's capstone integrates everything from Domain IV. You'll work through building a monitoring and incident runbook for a deployed AI system and answer 10 scenario-based practice questions.

Case Study — SafeScreen AI

Background: MedTech Solutions has deployed SafeScreen AI, a high-risk AI system that screens mammography images for potential breast cancer. The system provides a risk score (0–100) and a recommendation (further review, routine follow-up, or clear). It operates in hospitals across the EU and US.

Deployment model: Human-on-the-loop — radiologists review all cases flagged as "further review" (score > 70) before patient notification. Cases scored "clear" (score < 30) proceed through the standard workflow. Cases in the middle range (30–70) are queued for radiologist review within 48 hours.

Current state: The system has been deployed for 3 months. No formal monitoring framework or incident response plan exists.

Guided Exercise: Monitoring Framework

Define the monitoring framework for SafeScreen AI:

Performance KPIs:

- Sensitivity (true positive rate): Baseline 94.5%, threshold: never below 92%

- Specificity (true negative rate): Baseline 88.2%, threshold: never below 85%

- False negative rate: Baseline 5.5%, threshold: alert above 6%, halt above 8%

- Processing time: Baseline 12 seconds, threshold: alert above 30 seconds

Fairness metrics:

- Sensitivity by age group (under 40, 40–60, over 60): gap threshold ≤ 3%

- Sensitivity by ethnicity: gap threshold ≤ 3%

- False negative rate by demographics: gap threshold ≤ 2%

Drift detection:

- Input data distribution comparison: weekly statistical tests

- Score distribution monitoring: flag if mean risk score shifts more than 10%

- Confidence score monitoring: flag if average confidence drops below 80%

Alert levels:

- Green: all metrics within normal parameters

- Yellow: one or more metrics approaching thresholds — investigate within 24 hours

- Red: thresholds exceeded — escalate immediately, consider system halt

Scenario Question 1

SafeScreen AI's false negative rate for patients over 60 increases from 5.5% to 7.2% while remaining stable for other age groups. What is the MOST appropriate initial response?

The age-group-specific false negative rate (7.2%) exceeds the 6% alert threshold. A yellow alert is appropriate. The investigation should determine the cause (data drift? population changes?) and whether targeted action (pausing for that group, retraining) is needed. The overall rate may still be acceptable, but the demographic disparity requires investigation.

Scenario Question 2

A radiologist reports that SafeScreen AI gave a "clear" score (28/100) to a mammogram that the radiologist identified as clearly suspicious. Investigation reveals this was an accurate radiologist assessment — the AI missed a visible mass. What incident classification is appropriate?

A false negative in cancer screening is critical severity regardless of whether it was caught. The human caught it this time, but the system's safety depends on the AI's reliability — if the "clear" queue bypasses radiologist review, similar cases could be missed. This must be treated as a critical incident.

Scenario Question 3

The incident investigation reveals that the false negative involved a specific type of lesion that was underrepresented in the training data. What governance action addresses the ROOT CAUSE?

The root cause is data underrepresentation. While adjusting thresholds or adding review may be appropriate interim measures, they don't address the fundamental issue. Assessing training data gaps and planning targeted retraining addresses the root cause directly.

Scenario Question 4

MedTech Solutions operates SafeScreen AI in EU hospitals. After the critical incident, what EU AI Act obligation is triggered?

A false negative in cancer screening that could have resulted in delayed diagnosis is a serious incident under the EU AI Act. The provider (MedTech) must report to the relevant market surveillance authority within 72 hours. Both provider and deployer have reporting obligations.

Scenario Question 5

After retraining SafeScreen AI with additional lesion data, MedTech wants to deploy the updated model immediately to fix the identified safety issue. What governance step must NOT be skipped?

A retrained model is a new model. Even under time pressure from a safety issue, full testing cannot be skipped. The retrained model may perform differently across demographic groups, may have different error patterns, or may introduce new issues. If urgency requires it, the old model can be paused while the new one is properly tested.

Scenario Question 6

Six months after deployment, monitoring shows that SafeScreen AI's specificity has gradually declined from 88.2% to 84.1%. The yellow alert threshold is 85%. What has likely occurred?

Gradual performance decline below thresholds strongly suggests concept drift. Changes in imaging equipment, patient population, or clinical protocols may have shifted the data distribution. The drift has now crossed the alert threshold, requiring investigation and likely retraining.

Scenario Question 7

A hospital (deployer) modifies SafeScreen AI's scoring thresholds without consulting MedTech (provider), changing the "clear" cutoff from 30 to 15. Under the EU AI Act, the hospital:

Changing scoring thresholds can constitute a substantial modification under the EU AI Act, especially for a high-risk AI system. This could shift the hospital from deployer to provider status, triggering conformity assessment and documentation obligations.

Scenario Question 8

The governance team decides to retire SafeScreen AI v1 after deploying v2. What documentation must be retained?

Regulatory inquiries, legal claims, and audit requests related to v1's decisions may arise long after retirement. Complete lifecycle documentation must be archived according to regulatory retention requirements. Medical AI documentation may need to be retained for years.

Scenario Question 9

After the retraining and redeployment of SafeScreen AI v2, automation bias is a concern. Radiologists who relied on v1's recommendations may over-trust v2. What governance measure is MOST appropriate?

Addressing automation bias requires training (understanding new model behavior) and ongoing testing (ensuring radiologists maintain independent judgment). Switching to HITL may not be scalable. Artificially reducing confidence scores undermines trust calibration. Removing recommendations may reduce the AI's clinical value.

Real-World Scenario

In 2020, the UK exam grading controversy demonstrated the catastrophic consequences of deploying a high-risk AI system without adequate monitoring or incident response. When COVID-19 cancelled A-level exams, the Office of Qualifications and Examinations Regulation (Ofqual) deployed an algorithm to predict students' grades based on their schools' historical performance and teachers' predicted grades. The algorithm systematically downgraded nearly 40% of teacher-predicted grades, disproportionately affecting students at state schools and in disadvantaged areas while inflating grades at elite private schools. The system effectively perpetuated socioeconomic inequality at scale, affecting university admissions for hundreds of thousands of students.

The governance failures in this case map directly to the SafeScreen AI capstone scenario. First, there was no adequate monitoring framework — Ofqual did not track outcomes by school type or socioeconomic indicators in real time. Second, the incident response was disastrously slow: despite immediate public outcry and clear evidence of disparate impact, Ofqual initially defended the algorithm for over a week before the UK government reversed the grades and reverted to teacher predictions. Third, the system lacked a rollback plan — the reversion to teacher predictions was an emergency measure, not a pre-planned fallback. The eventual U-turn affected university admissions that had already been processed, creating cascading administrative chaos.

For the AIGP exam, the UK grading algorithm case is a powerful example of why monitoring frameworks must include disaggregated fairness metrics, why incident response plans must define severity levels and response timelines before deployment, and why rollback procedures must be tested and ready. An AIGP auditor reviewing this system would have flagged the same critical gap identified in Scenario Question 10: the absence of an incident response plan for a high-risk AI system affecting fundamental rights.

Scenario Question 10

MedTech is preparing for an AIGP audit of SafeScreen AI. Which of the following would an auditor consider the MOST critical governance gap?

The absence of an incident response plan for a high-risk AI system is a critical governance gap. The marketing materials, Python version, and sustainability statement are secondary concerns. For a safety-critical medical AI system deployed in the EU, an incident response plan is a fundamental governance requirement.

🎯

Day 29 Complete

"Post-deployment governance is where principles meet production reality. Monitoring frameworks, incident response plans, and lifecycle governance aren't optional — they're the difference between responsible AI and a liability."

Go Deeper

Want to see these concepts applied to full case studies? Check out AIGP Scenarios — 10 real-world governance simulations mapped to the AIGP exam domains.