How Intercom Cut $250K/Month by Ditching GPT for Qwen [Chain of Thought]

The counterintuitive findings from scaling Fin to 1M+ resolutions per week

Feb 26, 2026

"We believe in sharing as we go. When Fergal sat down with Conor Bronsdon for the “Chain of Thought” podcast, he did exactly that. It’s an in-depth look at the creation of Fin, breaking down model selection, the infrastructure around it, and the surprising value of latency. You’ll also hear more about Fergal’s theory that per-customer prompt engineering turns AI products into “services in a trenchcoat”. Most importantly, Fergal weighs in on the bigger market question. Will frontier labs eat the application layer? Or can dedicated specialists stand out? If you build with AI, it's worth an hour of your time. Read Conor’s full summary below or tune in on YouTube, Spotify, or Apple Podcasts. "

- Anne Marie Kingsland

Also available on Spotify, Apple Podcasts, or wherever else you listen

Why This Episode

Fin, Intercom’s AI customer service agent now resolves over a million conversations per week(probably more since we recorded). It’s one of the most successful enterprise examples of agents in action.

Fergal Reid is the Chief AI Officer at Intercom, where he has scaled their AI team from 10 people to 55 and has plans to more than double again this year. He’s responsible for Fin, and took us through their thought process, their approach, and the pivots they’ve made along the way.

What makes this conversation particularly valuable is the strategic reversal at its center. Two years ago, Intercom’s position was clear: frontier LLMs are improving so fast, don’t bother fine-tuning. Just get good at prompt engineering and ride the wave. Today, they’re heavily invested in post-training open-weight models, running fine-tuned Qwen models at scale in production, and saving hundreds of thousands of dollars a month in the process.

This episode walks through the full decision process. Not just what they switched to, but why, how they evaluated it, what their training infrastructure looks like, and how they think about defensibility as an application-layer company in an era where the labs keep moving upstack. If you’re building AI products and trying to figure out when to invest in post-training versus riding frontier APIs, this is the episode for you.

Key Takeaways

1. Intercom saved ~$250K/month by replacing GPT with a fine-tuned 14B Qwen model for a single task

Before any query hits Fin’s RAG pipeline, an LLM summarizes and canonicalizes the user’s question, because end users ask things colloquially, inaccurately, or ambiguously. Intercom was running GPT for this step and spending roughly a quarter million dollars a month on it. They fine-tuned a 14 billion parameter Qwen model to do the same task, saved almost all of that cost, and got more fine-grained control over model behavior in the process. You don’t need frontier intelligence for every step in your pipeline.

2. Most of Fin’s improvement came from surrounding systems, not the core LLM

Fin’s resolution rate has gone from ~30% at launch to nearly 70% today. But the vast majority of those gains didn’t come from upgrading the frontier model at the center (Claude Sonnet as of December). They came from the cloud of 10-15 surrounding systems: a custom retrieval model fine-tuned on production resolution signals, a custom re-ranker built on ModernBERT that outperformed Cohere’s best offering, and the query canonicalization pipeline. Effective context engineering, getting the right information to the model, matters more than the model itself once you’re past a certain intelligence threshold.

3. Higher latency counterintuitively increases resolution rates, including hard resolutions

This was one of the most surprising findings Fergal shared. You’d expect faster responses to create better user experiences. But in production A/B tests at scale (multi-million interaction tests), increased latency consistently led to more resolutions, and not just soft resolutions (where the user disappears). Hard resolutions went up too: where users explicitly confirmed their question was answered. Fergal’s hypothesis is that more latency increases the end user’s perception that the system did real work.

The takeaway for builders: you cannot rely on backtests alone. Production signals surface confounders you’d never anticipate.

4. Per-customer prompt engineering is technical debt disguised as customization

Fergal drew a sharp line between two types of AI products. The first: hand-engineered prompts customized per customer by a forward-deployed engineer. It feels great on week one. The second: a standardized, scientifically optimizable system that gets better over time through rigorous A/B testing. Fergal argued that prompt branching per customer is the AI equivalent of branching your database per customer. It seems responsive, but it’s unmaintainable and prevents systematic improvement. Intercom has bet on the second approach, and it’s the foundation that makes their post-training and evaluation infrastructure possible.

5. The application layer has a real hand to play

When asked whether companies like Intercom can survive if the labs keep moving upstack, Fergal was candid: he doesn’t know for sure. But his thesis is that for specific domains like customer service, model intelligence saturates and a 140 IQ human isn’t meaningfully better than a 110 IQ human at answering support questions. What matters at that point is expertise, vertical tuning, and the product harness around the model. Intercom’s bet is on vertical integration: custom post-trained models, proprietary retrieval systems, and a unified “customer agent” that handles service, sales, and success in one system. Whether the bitter lesson catches up remains to be seen.

Full Transcript

Available on Spotify.

Timestamps

00:00 — Intro
00:46 — Why Intercom Completely Reversed Their Fine-Tuning Position
08:00 — The $250K/Month Summarization Task (Query Canonicalization)
11:25 — Training Infrastructure: H200s, LoRA to Full SFT, and GRPO
14:09 — Why Qwen Models Specifically Work for Production
18:03 — Goodhart’s Law: When Benchmarks Lie
19:47 — A/B Testing AI in Production: Soft vs. Hard Resolutions
25:09 — The Latency Paradox: Why Slower Responses Get More Resolutions
26:33 — Why Per-Customer Prompt Branching Is Technical Debt
28:51 — Sponsor: Galileo
29:36 — Hiring Scientists, Not Just Engineers
32:15 — Context Engineering: Intercom’s Full RAG Pipeline
35:35 — Customer Agent, Voice, and What’s Next for Fin
39:30 — Vertical Integration: Can App Companies Outrun the Labs?
47:45 — When Engineers Laughed at Claude Code
52:23 — Closing Thoughts

If You Liked This Episode

Related Episode:

If you liked Fergal’s breakdown of Fin, check out my conversation with Angie Jones (VP of Engineering, Block) on how her team deployed AI agents to all 12,000 Block employees in 8 weeks at the start of 2024. Includes a live demo of Goose, their open-source agent that lets employees choose their own models.

Related Article:

🤫 Sneak peek

For those of you who read to the end, here’s an early look at some upcoming episodes:

Alex Ratner, Co-founder & CEO of Snorkel AI
Sterling Chin, showcasing his MARVIN agent chief of staff (think focused OpenClaw)
many more exciting guests!

If you know someone we should have on the show, reply and let me know.

Chat soon,
Conor

Thanks for reading Test Lab! If you know someone who would enjoy this episode, do us a favor and send it to them.

Links & Resources

Connect with Fergal:

X/Twitter: @fergal_reid
LinkedIn: linkedin.com/in/fergalreid

Connect with Conor:

Products & Tools Mentioned:

Fin — Intercom’s AI customer service agent
Intercom — Customer messaging platform
Claude Code — Anthropic’s agentic coding tool
ModernBERT — The architecture behind Intercom’s custom re-ranker
Qwen models — Alibaba’s open-weight model family
Galileo — Episode sponsor; multi-agent systems guide available free

Subscribe to Test Lab for episode breakdowns, occasional essays, and strategic thinking on AI, developer tools, and the business of technology.