AI Lead Scoring: Prompts, Pipelines, and the Pitfalls Nobody Warns You About
LLMs are weirdly good at lead scoring. Until they're not. Here's how to set it up, and the three failure modes that catch every team off guard.
Rule-based lead scoring is brittle. You get one weight wrong (say, "title contains Director = +10") and you ship junk to sales for a month before catching it. LLMs handle the fuzzy reasoning humans use intuitively — "this lead's job title is weird but the company profile and email signature pattern screams enterprise budget."
The catch: LLMs also confidently score garbage when you set them up wrong. Here's the path that works.
The minimal pipeline
New lead → enrich → score with Claude → assign tier → push to CRM
The score prompt is the part that matters. Everything else is plumbing.
The score prompt that doesn't suck
You are a B2B sales-development analyst scoring inbound leads for [company]
that sells [product] to [ICP].
Given the lead data below, return JSON with:
- score: 1-100
- tier: "hot" | "warm" | "cool" | "junk"
- reasoning: 2-3 sentences
- top_risk: the one thing that would make this lead not convert
Lead data:
[enriched profile]
Scoring rubric:
- Title fit (0-25): does this person have authority + budget for our price point?
- Company fit (0-25): is the company the right size, stage, industry?
- Intent signal (0-25): what evidence do we have they're actively shopping?
- Engagement quality (0-25): what is the quality of how they reached us?
Be honest in `top_risk`. A great-looking lead with a tell that they're a tire-kicker should score below an okay lead with strong intent.
Pitfall #1: The model loves your leads too much
Default LLM behavior is to find reasons everything is good. After two weeks, you'll notice 70% of your leads are scoring "warm" or higher. That's calibration drift.
Fix: ground the prompt with examples. Append:
Reference cases (these score from your past data):
- [paste 3 past hot leads that closed]
- [paste 3 past warm leads that needed nurture]
- [paste 3 past junk leads that wasted SDR time]
The few-shot examples pull the model's calibration toward your actual reality.
Pitfall #2: Hallucinated signals
The model will sometimes invent things it "saw" in the data. "Their funding round suggests they're scaling" — but no funding data was in the input. This is hallucination dressed as inference.
Fix: require the model to quote the source data for each scoring dimension. If it can't quote, it can't score that dimension.
Pitfall #3: The model gets worse over time (silently)
Three months in, score quality drifts. Could be the model version updated. Could be your audience changed. Could be that your ideal-customer definition changed and nobody updated the prompt.
Fix: weekly drift check. Sample 20 random leads from the past week, have a human re-score them, compare to the LLM's score. If average gap > 10 points, retrain — meaning: update the few-shot examples in the prompt.
What ships better than this
Honestly, nothing if your lead volume is under 500/month. The pipeline costs about 8 cents per lead in API calls and 30 minutes to set up. If it cuts your SDR's time on bad leads by 30%, the math works for any team doing more than a hundred leads a week.
Get new tools first
Fresh templates, tools, and automation recipes in your inbox each week. No noise.
More from the playbook
- Multi-channel
The AI Marketing-Ops Stack That Replaced 5 SaaS Tools
Built once, runs forever, costs $42/month instead of $1,400. Here's the exact stack and what each piece replaces.
- Email
Personalize 10,000 Emails a Week With AI Without Sounding Like a Bot
There's a precise line between 'they did their research' and 'this is clearly automated.' Hold it on the right side and reply rates triple.
- Multi-channel
Build an AI Ad-Copy A/B Testing Pipeline in 30 Minutes
Claude generates variants. Promptfoo scores them. You ship the winner. End-to-end, no spreadsheets.