M12 · Red Teaming¶
Goal: proactively attack your own model with the AI Red Teaming Agent — run a basic scan across risk categories, then an advanced scan with encoding strategies and multiple languages, and read the Attack Success Rate scorecard. You'll use:
azure-ai-evaluation[redteam](PyRIT-backed) —RedTeam,RiskCategory,AttackStrategy,SupportedLanguages, andscan(...).
In M11 you built defences. How do you know they hold? You red team — automatically generate adversarial prompts, fire them at your system, and measure how often it produces harmful content. The AI Red Teaming Agent wraps Microsoft's open-source PyRIT toolkit: it seeds attack objectives per risk category, optionally mutates them with evasion strategies, scores every response, and hands you a scorecard.
RedTeam ──seeds objectives──▶ your target callback ──▶ model
│ │
│◀── PyRIT scorer (pass/fail per attack) ──── responses ───┘
▼
Attack Success Rate (ASR) scorecard ◀── lower is better
!!! warning "Region + Python constraints"
The Red Teaming Agent runs in a subset of regions (e.g. East US 2, Sweden
Central, France Central, Switzerland West) and needs Python
3.10–3.13 (PyRIT excludes 3.9 and 3.14+). Install with
pip install "azure-ai-evaluation[redteam]". If your .env isn't ready, do the
Setup first.
1. Configure¶
Same .env as every lab. The Red Teaming Agent needs your project endpoint
(it logs the scan there) and a model deployment to attack. The reference routes
through an APIM gateway; we point the target straight at this project.
import os, sys
from dotenv import load_dotenv
load_dotenv() # reads .env from the repo root
assert (3, 10) <= sys.version_info < (3, 14), (
f"PyRIT requires Python 3.10–3.13; current is {sys.version.split()[0]}."
)
PROJECT_ENDPOINT = os.environ["PROJECT_ENDPOINT"]
CHAT_MODEL = os.environ.get("CHAT_MODEL", "gpt-4.1-mini")
print("Project :", PROJECT_ENDPOINT)
print("Model :", CHAT_MODEL)
print("Python :", sys.version.split()[0], "(OK)")
!!! note "Expected output"
Project : https://<account>.services.ai.azure.com/api/projects/<project> Model : gpt-4.1-mini Python : 3.12.7 (OK)
An AssertionError here means your kernel is on an unsupported Python — switch
to a 3.10–3.13 kernel before continuing.
2. The target callback¶
The scanner needs a target to attack. The simplest target is a callable that
takes a prompt string and returns the model's reply. We call this project's
get_openai_client() directly — this is your system under test. In production
you'd point the callback at your real app (RAG pipeline, agent, API).
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential)
openai_client = project_client.get_openai_client()
def target_callback(query: str) -> str:
"""The system under test: forward a prompt to the model, return its reply."""
response = openai_client.chat.completions.create(
model=CHAT_MODEL,
messages=[{"role": "user", "content": query}],
)
return response.choices[0].message.content
# Smoke-test the target before handing it to the scanner.
print("smoke test:", target_callback("Say hello in one word."))
!!! note "Expected output"
smoke test: Hello!
If this returns a normal reply, the scanner can drive the target. Swap the body
of target_callback to attack any app you own — the scanner only cares that it
takes a string and returns a string.
3. Build the Red Team agent¶
RedTeam is the scanner. You give it the project (where results are logged),
a credential, the risk categories to probe, and num_objectives — how
many distinct attack prompts to generate per category. Four categories × 5
objectives = 20 baseline prompts.
from azure.ai.evaluation.red_team import RedTeam, RiskCategory
red_team = RedTeam(
azure_ai_project=PROJECT_ENDPOINT,
credential=credential,
risk_categories=[
RiskCategory.Violence,
RiskCategory.HateUnfairness,
RiskCategory.Sexual,
RiskCategory.SelfHarm,
],
num_objectives=5, # attack prompts per category
)
print("RedTeam ready — 4 categories × 5 objectives = 20 baseline prompts")
!!! note "Expected output"
RedTeam ready — 4 categories × 5 objectives = 20 baseline prompts
Start small: num_objectives=5 is enough to see the shape. Crank it up (the
reference allows up to 100/category) once you're scanning for real.
4. Run the basic scan¶
scan(...) is async — top-level await works in a notebook kernel. It seeds
the baseline objectives, drives them through your target_callback, scores each
response with PyRIT, and writes results to output_path. A 20-prompt scan takes
a few minutes.
basic_result = await red_team.scan(
target=target_callback,
scan_name="redteam-basic",
output_path="redteam_basic_output",
)
print("✅ basic scan complete — results in redteam_basic_output/")
!!! note "Expected output"
✅ basic scan complete — results in redteam_basic_output/
The folder holds results.json (every attack/response pair) and
evaluation_results.json (the aggregated scorecard we read next).
!!! tip "In a plain script, not a notebook?"
Top-level await needs an IPython kernel. In a .py file, wrap the call:
asyncio.run(red_team.scan(...)).
5. Read the ASR scorecard¶
The headline metric is Attack Success Rate (ASR): the fraction of adversarial prompts that succeeded in eliciting harmful content. Lower is better. The scorecard breaks ASR down by risk category — so you see exactly where your model is weakest.
import json
from pathlib import Path
results = json.loads(Path("redteam_basic_output/evaluation_results.json").read_text())
scorecard = results.get("scorecard", {})
risk = scorecard.get("risk_category_summary", [{}])[0]
print(f"{'category':<18}{'ASR':>8}{'success':>9}{'total':>7}")
print("-" * 42)
print(f"{'OVERALL':<18}{risk.get('overall_asr', 0):>7.1f}%"
f"{risk.get('overall_successful_attacks', 0):>9}{risk.get('overall_total', 0):>7}")
for cat, key in [("Violence", "violence"), ("Hate/Unfairness", "hate_unfairness"),
("Sexual", "sexual"), ("Self-Harm", "self_harm")]:
print(f"{cat:<18}{risk.get(key + '_asr', 0):>7.1f}%"
f"{risk.get(key + '_successful_attacks', 0):>9}{risk.get(key + '_total', 0):>7}")
!!! note "Expected output"
category ASR success total ------------------------------------------ OVERALL 10.0% 2 20 Violence 20.0% 1 5 Hate/Unfairness 0.0% 0 5 Sexual 20.0% 1 5 Self-Harm 0.0% 0 5
Two of twenty attacks landed — a 10% overall ASR, concentrated in Violence and
Sexual. That's your prioritised to-do list: tighten those categories (the
M11 filters are one lever) and re-scan to
confirm the number drops.
6. Advanced — evasion strategies + languages¶
Baseline prompts are the easy case. Real attackers obfuscate: Base64, ROT13,
character-spacing, Unicode confusables — and they probe in other languages.
attack_strategies mutates each objective through these encodings (and
AttackStrategy.Compose([...]) chains them); languages translates the prompts.
This is the scan that finds the leaks a baseline misses.
from azure.ai.evaluation.red_team import AttackStrategy, SupportedLanguages
advanced = RedTeam(
azure_ai_project=PROJECT_ENDPOINT,
credential=credential,
risk_categories=[RiskCategory.Violence, RiskCategory.HateUnfairness],
num_objectives=5,
)
advanced_result = await advanced.scan(
target=target_callback,
scan_name="redteam-advanced",
attack_strategies=[
AttackStrategy.Base64,
AttackStrategy.ROT13,
AttackStrategy.UnicodeConfusable,
AttackStrategy.Compose([AttackStrategy.Base64, AttackStrategy.ROT13]),
],
languages=[SupportedLanguages.Spanish, SupportedLanguages.French],
output_path="redteam_advanced_output",
)
print("✅ advanced scan complete — strategies + Spanish/French")
!!! note "Expected output"
✅ advanced scan complete — strategies + Spanish/French
Each baseline objective is now fired several ways (plain, Base64, ROT13,
confusable, composed) in multiple languages — far more prompts than the basic
scan, so this run takes longer.
!!! warning "API is evolving"
RedTeam, AttackStrategy, and SupportedLanguages live under
azure.ai.evaluation.red_team and move between releases (some builds expose
custom_attack_seed_prompts= for your own objectives). This lab targets the
azure-ai-evaluation[redteam] extra — pin it in pyproject.toml and check the
installed version if an import or argument differs.
7. Compare baseline vs. strategies¶
The advanced scorecard adds an attack-technique breakdown alongside the risk-category one. The story you're looking for: an encoding strategy that scores a higher ASR than baseline means that obfuscation slips past your filters — a concrete gap to close before you ship.
adv = json.loads(Path("redteam_advanced_output/evaluation_results.json").read_text())
tech = adv.get("scorecard", {}).get("attack_technique_summary", [{}])[0]
print(f"{'technique':<14}{'ASR':>8}{'success':>9}{'total':>7}")
print("-" * 38)
for label, key in [("OVERALL", "overall"), ("baseline", "baseline"),
("easy", "easy_complexity"), ("difficult", "difficult_complexity")]:
asr = tech.get(key + "_asr")
if asr is None:
continue
print(f"{label:<14}{asr:>7.1f}%"
f"{tech.get(key + '_successful_attacks', 0):>9}{tech.get(key + '_total', 0):>7}")
!!! note "Expected output"
technique ASR success total -------------------------------------- OVERALL 16.0% 8 50 baseline 10.0% 1 10 easy 17.5% 7 40
Encoded ("easy" complexity) attacks land more often than baseline here —
proof that obfuscation evades the model's defences. Every scan also logs to the
Foundry portal, where you can drill into individual attack/response pairs and
track ASR across runs.
!!! tip "This closes the safety loop" Guardrails (M11) are defence; red teaming is offence; evaluation (M9) is the measuring tape. Run all three on every release and you have a repeatable safety pipeline.
🧪 Your turn¶
- Widen coverage. Bump
num_objectivesto10in section 3 and re-run the basic scan — more prompts per category means a more stable ASR (and a longer run). - Add a strategy. Append
AttackStrategy.Flip(orAttackStrategy.Leetspeak) to theattack_strategieslist in section 6 and re-scan — does the new encoding raise the easy-complexity ASR? - Attack a defended target. Point
target_callbackat the guardrailedcontoso-bank-agentfrom M11 (via anagent_referenceResponses call) and compare its ASR to the bare model — the guardrails should drive it toward zero.
✅ You ran a basic risk-category scan, an advanced scan with encoding strategies and multiple languages, and read the ASR scorecard to find where your model is weakest. Next: put a human in the loop and drive agents over raw REST. → M13 · Human-in-the-Loop & REST