Now that you understand the threat landscape from Day 5, it is time to implement defenses. This lesson covers CY0-001 Objective 2.2 — the specific security controls you apply to AI models and the gateways that sit in front of them. Think of this lesson as building the defensive perimeter around your AI systems.
The exam tests two categories of controls: model controls (applied directly to the AI model) and gateway controls (applied at the infrastructure layer between users and the model). You need to understand what each control does, when to use it, and where it falls short.
Guardrails are constraints built into or around an AI model to prevent undesirable behavior. They operate at the model layer — meaning the model itself enforces them, not external infrastructure.
Content guardrails restrict the types of content the model will generate. A production chatbot might have guardrails preventing it from generating explicit content, medical advice, legal opinions, or instructions for harmful activities. These guardrails are typically implemented through system prompts, fine-tuning, or reinforcement learning from human feedback (RLHF).
Behavioral guardrails constrain what the model can do, not just what it says. For example, a guardrail might prevent the model from executing code, accessing external URLs, or modifying files — even if the user requests it.
Topic guardrails keep the model focused on its intended domain. A customer service bot for a bank should only discuss banking topics. If a user asks it to write poetry or solve math problems, topic guardrails redirect the conversation.
The critical exam point about guardrails: they are not foolproof. Guardrails can be bypassed through jailbreaking, prompt injection, and creative prompt engineering. They are a layer of defense, not a complete solution. The exam will present scenarios where guardrails fail and ask you to identify compensating controls.
Prompt templates are pre-defined structures that wrap user input before it reaches the model. Instead of passing raw user text directly to the model, the application inserts user input into a controlled template.
A basic template might look like: "You are a banking assistant. Answer only questions about accounts, transfers, and bank products. Respond in English only. The customer says: [USER INPUT]."
Templates serve several security functions. They establish the model's role and boundaries. They constrain the model's output format. They can include explicit instructions to refuse certain requests. And they separate the system context from user input, making prompt injection more difficult (though not impossible).
Template hardening involves making templates resistant to injection. Techniques include placing security instructions at both the beginning and end of the template (so injected instructions cannot override them), using delimiters to clearly separate system and user content, and including explicit refusal instructions for common attack patterns.
The exam tests template design: given a scenario, which template configuration best prevents the described attack?
A prompt firewall sits between the user and the model, inspecting and filtering inputs before they reach the AI system. Think of it as a web application firewall (WAF) for AI.
Prompt firewalls perform several functions. Pattern matching detects known prompt injection patterns — phrases like "ignore previous instructions," "you are now," or "system prompt override." Semantic analysis goes beyond pattern matching to understand the intent of a prompt. Even if the user avoids known injection phrases, semantic analysis can detect that the prompt is attempting to manipulate the model. Content filtering blocks prompts containing prohibited content — profanity, PII, classified information, or content that violates organizational policy.
Prompt firewalls have limitations. New injection techniques may bypass pattern matching until signatures are updated. Semantic analysis can produce false positives, blocking legitimate queries. And sophisticated attackers can encode their injections in ways that bypass both pattern and semantic analysis — using character encoding tricks, language switching, or multi-step attacks spread across multiple prompts.
These three gateway controls address abuse and resource exhaustion.
Rate limits restrict how many requests a user or API key can make within a time window — for example, 60 requests per minute. Rate limits prevent brute-force attacks, model extraction through rapid querying, and resource exhaustion from automated scripts. They are the simplest and most widely deployed gateway control.
Token limits restrict the size of individual requests and responses. An input token limit of 4,096 tokens prevents users from submitting massive prompts designed to overwhelm the model or extract training data. An output token limit prevents the model from generating unbounded responses that consume excessive compute.
Input quotas set cumulative limits — total tokens, total requests, or total cost per user per billing period. Quotas prevent sustained abuse that stays under rate limits but accumulates over time. A user sending 59 requests per minute (just under a 60/minute rate limit) for 24 hours straight would consume enormous resources without triggering rate limits — but would exceed reasonable quotas.
The exam expects you to understand when each control is appropriate. Rate limits protect against burst attacks. Token limits protect against oversized individual requests. Quotas protect against sustained, low-rate abuse.
Modality limits restrict the types of input a model will accept. A text-only model should not accept images, audio, or file uploads — even if the underlying architecture supports multimodal inputs.
Modality limits are a security control because each input type introduces its own attack surface. Image inputs can contain steganographic data or adversarial perturbations. Audio inputs can include hidden commands. File uploads can contain embedded instructions for indirect prompt injection.
If your AI system only needs to process text, disabling all other modalities eliminates entire categories of attacks. This is an application of the principle of least functionality — disable everything that is not required.
AI models are typically accessed through APIs, and those APIs need the same security controls as any other API — plus AI-specific additions.
Authentication verifies the identity of API callers. API keys, OAuth tokens, and mutual TLS are all applicable. The exam expects you to know that API keys alone are often insufficient because they can be leaked or shared.
Authorization determines what an authenticated caller can do. Different roles might have different permissions: read-only users can query the model but not fine-tune it; administrators can modify guardrails and deploy new versions; developers can access evaluation metrics but not production traffic.
Input validation at the API layer catches malformed requests before they reach the model. This includes verifying content types, checking request sizes, validating required fields, and rejecting unexpected parameters.
Red-teaming AI systems means deliberately attempting to bypass security controls — prompt injection, jailbreaking, guardrail evasion, data extraction, and abuse. This should be done before production deployment and regularly afterward.
Effective AI red-teaming covers multiple attack categories: direct and indirect prompt injection, jailbreaking through role-play and persona manipulation, data extraction through targeted questioning, guardrail bypass through creative phrasing, and abuse scenarios that stretch the model's intended use.
The exam emphasizes that red-teaming is not optional and should not be conducted only by the team that built the model. External red teams bring fresh perspectives and are less likely to share the development team's blind spots.