The Modern AI Product Stack For SaaS Teams
I spent two weeks last month helping a Series B SaaS company untangle their AI architecture. They had four different LLM providers, three vector databases, and nobody could explain why. The monthly bill had crossed $47,000 and the AI features still felt brittle. Sound familiar? You're not alone. Most teams I talk to are stitching together AI capabilities without a clear mental model of how the pieces fit.
The AI infrastructure space has exploded. According to The AI Comet's analysis, total funding raised by LLM infrastructure companies reached $10 billion by March 2024, excluding the big players who raised over a billion each. That's a lot of options. And a lot of ways to make expensive mistakes.
This post breaks down what I consider the modern AI stack for SaaS teams in late 2024. Not the bleeding edge research stack. The practical, production-ready stack that ships features and doesn't wake you up at 3am.
The Five Layers You Need to Understand
Before diving into specific tools, you need a mental model. The a16z team published a reference architecture that's become the canonical starting point. I've adapted it slightly based on what I see working in production SaaS environments.
Here's how I think about the layers, from bottom to top:
Infrastructure
The compute and observability foundation. Cloud providers, GPU access for self-hosted models, logging, monitoring, and cost tracking. Most SaaS teams use existing cloud (AWS, GCP, Azure) plus specialized observability like Helicone or Weights & Biases for LLM-specific monitoring.
Model Layer
The actual LLMs you call. Could be proprietary APIs (OpenAI, Anthropic, Google), open-source models you self-host (Llama, Mistral), or a mix. This layer is increasingly commoditized. The model you pick matters less than how you use it.
Context Layer
Where your data lives and gets retrieved. Vector databases (Pinecone, Weaviate, ChromaDB, pgvector), embedding models, and the RAG pipelines that pull relevant context. This is where most SaaS teams differentiate. Your context is your moat.
Orchestration Layer
The glue that connects everything. Frameworks like LangChain, LlamaIndex, or custom code that handles prompt construction, retrieval, chaining, and agent behaviors. Most of the "AI engineering" happens here.
Application Layer
Your product. The UI, UX, and business logic that wraps the AI capabilities. Streaming responses, loading states, error handling, guardrails. This is where users experience value or frustration.
Model Selection: The Decision That Matters Less Than You Think
Teams spend weeks debating GPT-4 vs Claude vs Llama. Here's the thing: for most SaaS use cases, the differences are smaller than you'd expect. What matters more is matching model capability to task complexity and managing costs.
| Model | Best For | Approx Cost (1M tokens) | Latency |
|---|---|---|---|
| GPT-4 Turbo | Complex reasoning, coding | $10-30 input, $30-60 output | Medium |
| Claude 3 Sonnet | Long context, analysis | $3 input, $15 output | Fast |
| GPT-3.5 Turbo | Simple tasks, high volume | $0.50 input, $1.50 output | Very fast |
| Llama 3 70B | Self-host, data privacy | Compute cost varies | Depends on infra |
| Mistral Large | European data residency | $2-8 depending on tier | Fast |
Anthropic published their Claude 3 family in March 2024. OpenAI has been iterating on GPT-4 Turbo throughout 2024. Meta released Llama 3 with 8B and 70B parameter versions in April 2024. The point isn't to chase the latest model. It's to pick a capable-enough model and invest your energy in the context and orchestration layers where your product actually differentiates.
The Context Layer: Where Your Moat Lives
In-context learning is the dominant pattern for production LLM applications. Instead of fine-tuning models on your data (expensive, slow, requires ML expertise), you retrieve relevant context at query time and include it in the prompt. This is what people mean when they say RAG, or Retrieval-Augmented Generation.
The a16z team describes it well: "This looks like a lot of work, but it's usually easier than the alternative: training or fine-tuning the LLM itself. You don't need a specialized team of ML engineers to do in-context learning."
Here's how the context layer typically works:
Choosing a Vector Database
The vector database market has matured significantly. Here's my current thinking on when to use what:
| Option | When to Use | Trade-offs |
|---|---|---|
| pgvector | Already using Postgres, want simplicity | Slower at scale, but good enough for most SaaS |
| Pinecone | Managed service, fast to start | Vendor lock-in, can get expensive at scale |
| Weaviate | Need hybrid search (vector + keyword) | More complex to operate self-hosted |
| ChromaDB | Early stage, local development | Not as battle-tested for production |
| Qdrant | Performance-critical, self-hosted | Requires infrastructure expertise |
My default recommendation for most SaaS teams: start with pgvector if you're already on Postgres. It's good enough for millions of vectors and you don't add operational complexity. Move to a dedicated vector database when you hit performance limits or need specialized features like hybrid search.
The Orchestration Layer: LangChain vs LlamaIndex vs Custom
IBM's comparison summarizes it well: "While LlamaIndex shines when querying databases to retrieve relevant information, LangChain's broader flexibility allows for a wider variety of use cases, especially when chaining models and tools into complex workflows."
| Framework | Strength | Best For |
|---|---|---|
| LangChain | Flexibility, agent workflows, tool use | Complex multi-step AI features, agents |
| LlamaIndex | Data indexing, RAG optimization | Search-heavy applications, document QA |
| Custom code | Full control, no dependencies | Simple use cases, latency-critical paths |
Here's my honest take: frameworks add complexity. If your AI feature is "take user input, retrieve context, call LLM, return response," you probably don't need LangChain. A hundred lines of Python will do it. Frameworks become valuable when you're building agent behaviors, complex chains, or need to swap components frequently.
What About Agents?
Every AI product conversation eventually turns to agents. As the a16z team notes: "Agents have the potential to become a central piece of the LLM app architecture. There's only one problem: agents don't really work yet."
That assessment from mid-2024 still largely holds. Agent frameworks can produce impressive demos but struggle with reliability in production. They're great for internal tools where occasional failures are acceptable. They're risky for customer-facing features where you need consistent results.
My advice: build agent-like behaviors with explicit control flows rather than autonomous agents. Break down complex tasks into discrete steps with human checkpoints. This gives you agent-like capability with production-grade reliability.
Infrastructure and Observability
The boring stuff that keeps you from getting paged at 2am. At minimum, you need:
- LLM call logging - Every prompt, response, latency, token count, cost. You'll need this for debugging, optimization, and cost control.
- Error tracking - LLM calls fail. Rate limits, timeouts, content filters. Your existing error tracking (Sentry, etc.) should capture these.
- Cost monitoring - Set alerts before you get a surprise bill. OpenAI and Anthropic costs can spike fast with certain usage patterns.
- Latency tracking - P50, P95, P99 latencies for LLM calls. User experience degrades fast when AI features feel slow.
Tools like Helicone, PromptLayer, and Weights & Biases have emerged specifically for LLM observability. If you're doing more than a few thousand LLM calls per day, the investment pays for itself in debugging time saved.
Putting It Together: A Reference Architecture
Here's what a typical production setup looks like for a mid-stage SaaS company adding AI features:
Cost Optimization Strategies
LLM costs catch teams off guard. Here are the levers you can pull:
- Model routing - Use smaller/cheaper models for simple tasks. Route complex queries to powerful models.
- Caching - Many queries have similar or identical prompts. Cache responses aggressively. Tools like GPTCache help here.
- Prompt optimization - Shorter prompts cost less. Review your system prompts quarterly. Remove fluff.
- Streaming - Stream responses to improve perceived latency without changing actual cost.
- Batch processing - For non-interactive use cases, batch requests and process during off-peak hours when some providers offer discounts.
What to Build First
If you're starting from scratch, here's a suggested sequence:
- Week 1-2: Basic RAG pipeline. Documents → chunks → embeddings → pgvector. Simple retrieval and prompt construction. Ship something to internal users.
- Week 3-4: Add observability. Logging, basic cost tracking, error monitoring. You need visibility before you can optimize.
- Week 5-6: Iterate on retrieval quality. Better chunking strategies, hybrid search if needed, prompt tuning. This is where quality improves.
- Week 7-8: Production hardening. Rate limiting, fallbacks, caching, cost controls. Make it reliable enough for customers.
- Week 9+: Optimization. Model routing, advanced caching, maybe a framework if complexity warrants it.
The Bottom Line
The AI stack is simpler than the vendor landscape suggests. You need models (pick capable-enough ones and don't overthink it), context (your vector database and RAG pipeline), orchestration (start simple and add frameworks when complexity demands it), and observability (non-negotiable).
Most of the value you create will come from the context layer. Your data, your retrieval quality, your domain-specific prompt engineering. The model layer is increasingly commoditized. The infrastructure layer is table stakes. Your differentiation lives in layers 3, 4, and 5.
Don't let the $10 billion in LLM infrastructure funding intimidate you into over-engineering. The teams shipping the best AI features are often using boring stacks. Postgres with pgvector. A few hundred lines of Python. One or two model providers. Good observability. That's it. The magic is in what they build on top.
The best AI architecture is the one you can explain in two minutes, debug in twenty, and ship features on in a week. Complexity is not a feature.— Nasr Khan