Build an AI Ad-Copy A/B Testing Pipeline in 30 Minutes
Claude generates variants. Promptfoo scores them. You ship the winner. End-to-end, no spreadsheets.
Most ad copy testing is theater. Three barely-different variants from one writer's brain, judged by clicks for a week, no idea which structural variable actually moved the needle. This pipeline fixes both problems.
The end state: you write a brief, the pipeline produces 10 structurally varied versions, scores them automatically against your criteria, and you ship the top two for the live test.
What you need
- Claude API key (Anthropic console, $5 covers months of this)
- promptfoo installed (
npm i -g promptfoo) - A JSON file with your brand voice rules
Step 1: Define your variant axes (5 min)
Edit axes.json:
{
"hook_type": ["question", "stat", "pattern_interrupt", "story", "POV"],
"length": ["short", "medium"],
"cta_style": ["soft", "direct"],
"emotion": ["fear_of_missing_out", "curiosity", "aspiration"]
}
Pick axes that you actually believe affect performance. Don't include axes you're not willing to act on.
Step 2: Write the brief prompt (10 min)
brief.txt:
Generate ONE ad copy variant for [product] given these constraints:
Hook type: {{hook_type}}
Length: {{length}}
CTA style: {{cta_style}}
Emotion to target: {{emotion}}
Brand voice rules:
[paste from voice-rules.json]
Banned words: [list]
Output ONLY the ad copy. No commentary.
Step 3: Configure promptfoo eval (10 min)
promptfooconfig.yaml:
prompts:
- file://brief.txt
providers:
- anthropic:claude-sonnet-4-5
tests:
- vars:
product: "[your product description]"
assert:
- type: llm-rubric
value: "Does NOT use banned words from voice-rules.json"
- type: llm-rubric
value: "Reads like a person wrote it, not a template"
- type: llm-rubric
value: "Makes a specific claim, not vague benefits"
- type: llm-rubric
value: "CTA is under 5 words and active"
Step 4: Run the matrix (1 min)
promptfoo eval -t axes.json
This generates a copy for every combination of axes (60 variants in our example), grades each on your rubric, and outputs a sortable HTML report.
Step 5: Pick + ship (5 min)
Top 2 by score go into your ad platform as the live A/B test. The rest stay in a "graveyard" file for next quarter — sometimes losers become winners when the audience or offer shifts.
What this catches that human review doesn't
Subtle voice violations. You'd let "leverage" or "unlock" slip past at the end of a 4-hour copy session. The rubric won't. Over a year, that catches dozens of brand drift moments you'd never notice individually.
Get new tools first
Fresh templates, tools, and automation recipes in your inbox each week. No noise.
More from the playbook
- Claude
Set Up Claude Code for a Marketing Team From Scratch
30-minute setup, one config file, and your team has Claude with the right context every time. No more 'paste the brand guidelines again.'
- n8n
Automate Competitor SERP Monitoring With n8n and Claude
Daily-checked, AI-summarized, Slack-delivered. The whole pipeline takes 45 minutes to set up and never asks for your attention again.
- Multi-channel
AI Lead Scoring: Prompts, Pipelines, and the Pitfalls Nobody Warns You About
LLMs are weirdly good at lead scoring. Until they're not. Here's how to set it up, and the three failure modes that catch every team off guard.