All writing

When the Vendor Becomes the Ceiling: How to Spot the AI Platform Lock-In Before It Costs You

Most teams don't get locked in all at once. It happens in layers. You start with a vendor API because it's the fastest path to a demo. Then you build a few workflows on top of their tooling. Then their data format bleeds into your schema. Then your product roadmap starts waiting on their feature releases. By the time someone notices the ceiling, you're six months deep and a rewrite looks existential.

This is the lock-in pattern I see most often in AI projects right now — not the obvious kind where you sign a three-year enterprise deal, but the quiet architectural kind where every reasonable decision compounds into a structural trap.

I've watched it play out across domains. When we were building FleetOS, the temptation to lean on a third-party telematics platform's native agent tooling was real — it would have shipped faster in week one and owned us completely by month three. The same pressure surfaced building multi-agent travel workflows at Etera AI. The vendors with the slickest demos are usually the ones whose abstractions leak the deepest.

The Three Layers Where Lock-In Actually Happens

Vendor dependency isn't monolithic. It lives in specific layers, and each one has a different cost to escape.

Layer 1: API shape and data contracts. If your application code is calling a vendor's SDK directly and passing their proprietary object types around, you've coupled your business logic to their release cadence. This is fixable with a thin abstraction layer — an internal client that owns the translation — but most teams skip it when moving fast.

Layer 2: Evaluation and fine-tuning data. This one is underestimated. If your prompt engineering, RLHF feedback loops, or retrieval pipelines are tied to a vendor's proprietary evaluation tooling, that data often can't migrate cleanly. You've effectively outsourced your model improvement flywheel to someone else's infrastructure.

Layer 3: Orchestration and agent architecture. This is the most dangerous layer right now. Several platforms offer drag-and-drop agent builders that feel productive early on. But if your multi-step reasoning chains, tool calls, or memory patterns are expressed in a proprietary graph format, you can't lift them to a different runtime without a full redesign. You haven't built an agent — you've configured one inside someone else's product.

A Simple Heuristic: Can You Swap the Foundation?

Here's a test I apply early in any AI system design: if I needed to replace the primary model provider in two weeks, what would break?

If the answer is "only the API call and the response parser" — you're in reasonable shape. If the answer is "the entire orchestration layer, the evaluation pipeline, and half the prompt templates" — you're already locked in.

The goal isn't necessarily to swap providers constantly. It's to preserve optionality, because the AI infrastructure market is still moving fast enough that this year's best-in-class solution is next year's technical debt.

Here's what this boundary actually looks like in a multi-agent system with real tool routing — not just a wrapper, but an abstraction that normalises streaming deltas, structured tool calls, and cost metadata across providers:

python
from dataclasses import dataclass, field
from typing import Any

@dataclass
class NormalisedToolCall:
    call_id: str
    name: str
    arguments: dict[str, Any]

@dataclass
class AgentResult:
    text: str | None
    tool_calls: list[NormalisedToolCall] = field(default_factory=list)
    input_tokens: int = 0
    output_tokens: int = 0
    stop_reason: str = "end_turn"

class LLMGateway:
    """Single translation boundary. Vendor types never escape this class."""

    def __init__(self, provider: str = "openai"):
        self.provider = provider

    def complete(self, messages: list[dict], tools: list[dict]) -> AgentResult:
        if self.provider == "openai":
            return self._call_openai(messages, tools)
        elif self.provider == "anthropic":
            return self._call_anthropic(messages, tools)
        raise ValueError(f"Unknown provider: {self.provider}")

    def _call_openai(self, messages, tools) -> AgentResult:
        import openai
        resp = openai.chat.completions.create(
            model="gpt-4o", messages=messages, tools=tools
        )
        choice = resp.choices[0].message
        calls = [
            NormalisedToolCall(
                call_id=tc.id,
                name=tc.function.name,
                arguments=__import__("json").loads(tc.function.arguments),
            )
            for tc in (choice.tool_calls or [])
        ]
        return AgentResult(
            text=choice.content,
            tool_calls=calls,
            input_tokens=resp.usage.prompt_tokens,
            output_tokens=resp.usage.completion_tokens,
            stop_reason=resp.choices[0].finish_reason,
        )

    def _call_anthropic(self, messages, tools) -> AgentResult:
        import anthropic
        client = anthropic.Anthropic()
        resp = client.messages.create(
            model="claude-opus-4-5", max_tokens=4096,
            messages=messages, tools=tools
        )
        calls = [
            NormalisedToolCall(
                call_id=b.id, name=b.name, arguments=b.input
            )
            for b in resp.content if b.type == "tool_use"
        ]
        text_blocks = [b.text for b in resp.content if b.type == "text"]
        return AgentResult(
            text=" ".join(text_blocks) or None,
            tool_calls=calls,
            input_tokens=resp.usage.input_tokens,
            output_tokens=resp.usage.output_tokens,
            stop_reason=resp.stop_reason,
        )

This takes a few hours to write properly. It saves weeks when you migrate — and more importantly, it lets you run cost comparisons across providers in production without touching a single line of business logic.

When Buying Is Still the Right Call

I'm not arguing for building everything. The economics of AI infrastructure don't support that. Embedding models, vector databases, reranking APIs, safety classifiers — for most teams, running these yourself is a distraction from the actual product.

The distinction I care about is: buy the commodity, own the differentiator.

If a capability is generic enough that a dozen vendors offer it and switching costs are low, buy it and move on. If it's the thing your product's quality or moat depends on — the retrieval logic, the agent reasoning pattern, the domain-specific evaluation — build it, own it, and keep it portable.

The failure mode I see most often is teams applying this logic backwards. They build custom infrastructure for generic problems ("we need our own vector store") while outsourcing the differentiated logic to a vendor's proprietary agent framework. You end up doing the hard work and still getting locked in.

Imagine a fintech team that hand-rolls a vector store to avoid a $200/month Pinecone bill, then wires their core credit-decisioning logic directly into a no-code agent builder. They've optimised for the wrong variable entirely.

The Timeline Trap

Here's where I push back on a common justification: "we'll refactor it later when we have more time."

In AI projects, later almost never comes for architectural debt — not because engineering is slow, but because the business pressure doesn't pause. With modern tooling, the actual engineering lift to add an abstraction layer or own your orchestration is days, not weeks. What makes it feel slow is the decision overhead and the reluctance to slow down a demo that's gaining traction.

The refactor that feels expensive at month six was a two-day job at month one. That's the real cost of lock-in — not the migration itself, but the compounding opportunity cost of decisions you can't make because the architecture won't let you.

What to Audit Before You Commit

Before signing an enterprise tier or going deep on any AI platform, I run through four questions:

  • Data portability: Can I export my fine-tuning data, evaluation sets, and usage logs in a format I control?
  • Abstraction boundaries: Where does the vendor's type system touch my business logic?
  • Orchestration ownership: Is my agent behavior expressed in a vendor-specific format, or in something I can run anywhere?
  • Replacement timeline: If this vendor raises prices 3x or gets acquired, how long is my migration path?

None of these are reasons to avoid buying. They're the questions that tell you whether you're buying a tool or accidentally outsourcing your product's ceiling.

The teams that navigate this well aren't the ones who build everything — they're the ones who are deliberate about exactly where the vendor's world ends and theirs begins.

Working on something like this? I take on a few fractional-CTO and AI engagements at a time.

The AI CTO playbook

Get my AI playbooks — straight to your inbox

Practical notes on shipping production AI, scaling teams, and the calls a CTO actually has to make. A few times a month. No spam, no fluff.