INFERENCE·January 2026·4 min read

The Economics of AI Advisory: Making Inference Sustainable

The question nobody asks in the AI demo is: what does this cost per conversation? It is the number that determines whether AI advisory is a viable business or an impressive prototype that quietly dies when the funding runs out.

The Token Math That Breaks Most Business Cases

A typical advisory conversation — property search, healthcare triage, financial product guidance — runs 8 to 12 exchanges. Each exchange involves input context (conversation history, retrieved knowledge, system instructions) and generated output. On a GPT-4 class model, a single advisory conversation can consume 20,000–40,000 tokens.

~$1.50

Cost per unoptimised advisory conversation

$15K+

Monthly cost at 10K conversations (unoptimised)

~$0.10

Cost per conversation with optimisation

$1K

Monthly cost at 10K conversations (optimised)

At scale, unoptimised inference economics make AI advisory unviable at any price point that users will actually pay. The business case collapses before it gets to unit economics.

Lever One: Model Routing

Not every advisory query requires the most capable and most expensive model. The key insight is that queries within a domain split naturally into two categories: factual retrieval ('What is the price per square foot of this development?') and complex reasoning ('Given this buyer's NRI status, budget, and investment horizon, compare the regulatory and financial implications of these three projects').

Factual retrieval queries — which represent 60–70% of advisory volume in our deployments — route to smaller, faster, cheaper models. Complex reasoning queries route to more capable models. This single optimisation reduces average inference cost by 50–60% without any degradation in response quality, because simple questions answered by a capable small model are answered just as well as by a large one.

Lever Two: Semantic Caching

Advisory domains are predictable. In NRI property advisory, the same 200 question types account for roughly 80% of all query volume. 'What documents does an NRI need to buy property in India?' will be asked thousands of times, worded slightly differently each time.

Semantic caching matches incoming queries against prior responses using embedding similarity — not exact string matching. A query about 'papers required for NRI property purchase' hits the same cache as 'documentation needed for overseas Indian buying flat in Mumbai.' Cache hit rates in production advisory deployments run between 35–45% on common query types. Those conversations cost near-zero after the first response is generated.

Lever Three: Retrieval Precision

The largest hidden cost in RAG-based advisory systems is inefficient retrieval — pulling too much context into the model's window because the retrieval step prioritises recall over precision.

In advisory AI, precision beats recall. It is better to retrieve three highly relevant knowledge chunks than fifteen loosely relevant ones. A 30% reduction in average context window size through precision retrieval yields a 30% direct reduction in input token costs — with the additional benefit of cleaner, more focused responses.

The Compounding Effect

These three levers compound. Model routing reduces cost on 60–70% of queries. Semantic caching eliminates cost on 35–45% of all queries. Retrieval optimisation reduces token cost on every non-cached query. Applied together, they move advisory conversations from $1.50+ each to $0.08–0.15 each — a 10–15x reduction.

Sustainable economics are not an operational detail. They are what determines whether AI advisory can be offered at a price point that reaches the users who need it most — not just the enterprise budgets that can absorb premium pricing.

“The teams that crack advisory inference economics will not just have lower costs — they will have the margin to invest in domain depth, data quality, and trust architecture that competitors running unoptimised inference simply cannot afford.”

Back to InsightsStart a Conversation