Welcome to Domain IV — the domain most candidates under-prepare for. Post-deployment governance is where governance meets reality. An AI system that passes all pre-deployment tests can still fail in production. Monitoring catches what testing misses.
Key performance metrics for deployed AI:
Accuracy metrics — Is the model performing as expected? Track overall accuracy, precision, recall, and F1 score against baseline values established during testing.
Latency — How fast does the model respond? Performance degradation can indicate infrastructure issues or model complexity problems.
Throughput — How many decisions is the model processing? Unexpected changes (spikes or drops) may signal issues.
Error rates — Track error types (false positives, false negatives) and their distribution. A sudden increase in a specific error type may indicate model degradation.
Business outcome metrics — Connect AI performance to business outcomes. If the AI is approving loans, track default rates. If it's screening resumes, track hiring success rates.
Drift is the silent killer of AI systems. The model doesn't change — the world does.
Data drift (covariate shift) — The input data distribution changes from what the model was trained on. Example: A fraud detection model trained on in-store transaction patterns encounters a surge in mobile payments.
Concept drift — The relationship between inputs and outputs changes. Example: Customer behavior patterns that predicted churn in 2023 no longer predict churn in 2026 due to market changes.
Detection methods:
- Statistical tests comparing production data distributions to training data distributions
- Monitoring prediction confidence scores — declining confidence may indicate drift
- Tracking performance metrics over time — gradual degradation suggests concept drift
- Periodic re-evaluation on labeled production data
Governance response to drift:
- Define drift thresholds that trigger alerts
- Establish escalation procedures for significant drift
- Define retraining criteria and approval processes
- Document all drift events and responses
Pre-deployment bias testing is necessary but not sufficient. Fairness must be monitored continuously:
Why production fairness differs from test fairness:
- Production data may have different demographic distributions
- Data drift may affect demographic groups unequally
- Real-world feedback loops can amplify initial biases over time
- User behavior may interact with the AI in unexpected ways
Fairness monitoring checklist:
- Track the same fairness metrics used in pre-deployment testing
- Monitor outcomes by demographic group on an ongoing basis
- Set alert thresholds for disparities exceeding defined tolerance levels
- Conduct periodic fairness audits on production data (not just training data)
- Document all fairness monitoring results and any corrective actions
Effective monitoring requires structured alerting:
Green/Yellow/Red alert system:
- Green — All metrics within normal operating parameters
- Yellow — One or more metrics approaching threshold values. Trigger investigation.
- Red — Metrics exceed defined thresholds. Trigger escalation and potential system intervention.
Alert configuration:
- Define thresholds for each metric based on risk tolerance
- Differentiate between gradual degradation and sudden changes
- Route alerts to appropriate stakeholders based on severity
- Avoid alert fatigue by tuning thresholds carefully
- Document all alert events, investigations, and outcomes