Guozhen AIGlobal AI field notes and model intelligence
Back to all AI decision guides

AI Topic Hub

RAG, Local LLM, and Model Infrastructure

Plan RAG systems, local LLM deployment, model APIs, cloud AI platforms, vector databases, evaluation, observability, rate limits, and cost optimization.

34 decision guidesUpdated 2026-06-11English search hub

Buyer questions

  • Should we use RAG, fine-tuning, GraphRAG, or hybrid search?
  • Which model API, cloud AI platform, or local runtime should we choose?
  • How do we control LLM cost, latency, fallbacks, and evaluation quality?

Evaluation angles

  • Retrieval quality, chunking, reranking, and evaluation
  • Model fit, context windows, latency, and token cost
  • Observability, gateway routing, rate limits, and fallback behavior
  • Privacy, deployment control, and cloud platform tradeoffs

Covered categories

RAG (5)AI operations (4)AI economics (3)Local LLMs (3)AI evaluation (2)AI security (2)Cloud AI platforms (2)Model APIs (2)AI model benchmarks (1)AI models (1)

Decision Pages

Guides in this topic hub

Local LLMs

Local LLM GPU calculator

Estimate whether a local LLM will fit your GPU by thinking through parameter count, quantization, context length, KV cache, CPU offload, and concurrent requests.

8 min readIntermediate
Read guide

Local LLMs

Ollama vs LM Studio

Compare Ollama and LM Studio for local LLM setup, privacy, model management, local API servers, developer workflows, and beginner-friendly desktop usage.

8 min readBeginner to intermediate
Read guide

RAG

RAG chunk size guide

A practical guide to choosing RAG chunk size, overlap, retrieval top-k, and evaluation loops for technical docs, policies, support articles, PDFs, and knowledge bases.

9 min readIntermediate
Read guide

AI model benchmarks

AI model benchmark 2026

A 2026 guide to reading AI model benchmarks, comparing leaderboards, separating preference from capability, and choosing models for coding, RAG, writing, agents, and local workflows.

9 min readIntermediate
Read guide

AI economics

AI API cost calculator

Estimate AI API costs by modeling input tokens, output tokens, retries, caching, traffic, routing, evaluation runs, and monthly usage before shipping an LLM product.

8 min readBeginner to intermediate
Read guide

RAG

Vector database comparison

Compare Pinecone, Chroma, Qdrant, and Weaviate for RAG workflows by deployment model, filtering, hybrid search, local development, production operations, and cost control.

9 min readIntermediate
Read guide

AI models

Context window guide

Understand LLM context windows, token limits, document size, long-context tradeoffs, RAG alternatives, and when a larger context window is actually worth the cost.

8 min readBeginner to intermediate
Read guide

RAG

RAG evaluation guide

Learn how to evaluate RAG systems with realistic questions, retrieval recall, context precision, faithfulness, answer quality, latency, and human review loops.

9 min readIntermediate
Read guide

AI operations

LLM observability tools

Compare LangSmith, Langfuse, and Helicone for LLM tracing, cost monitoring, prompt management, evaluations, gateway workflows, and production debugging.

8 min readIntermediate
Read guide

RAG

Embedding model comparison

Compare OpenAI, Cohere, and Voyage embeddings for semantic search, multilingual retrieval, document search, RAG quality, cost, latency, and evaluation workflow.

9 min readIntermediate
Read guide

RAG

RAG reranker guide

Learn when to add a reranker to RAG, how two-stage retrieval works, and how to compare Cohere, Voyage, Jina, and other reranking options by quality, latency, and cost.

9 min readIntermediate
Read guide

AI economics

Prompt caching guide

Learn when prompt caching helps, how OpenAI, Anthropic, and Gemini caching differ, and how to design prompts, RAG context, and agent workflows for cache hits.

8 min readIntermediate
Read guide

AI economics

AI Batch API guide

Compare OpenAI Batch API, Anthropic Message Batches, and Gemini Batch API for large-scale async jobs, evaluations, data labeling, cost reduction, and throughput planning.

8 min readIntermediate
Read guide

AI operations

LLM gateway comparison

Compare LLM gateways for unified model access, routing, fallbacks, budgets, observability, provider keys, self-hosting, and production AI operations.

9 min readAdvanced
Read guide

Local LLMs

vLLM vs TGI vs Ollama

Compare vLLM, Hugging Face Text Generation Inference, and Ollama for local development, OpenAI-compatible serving, production inference, GPUs, throughput, and operations.

9 min readAdvanced
Read guide

Model APIs

OpenAI vs Anthropic API

Compare OpenAI and Anthropic APIs for product teams choosing models, structured outputs, long context, cost controls, safety reviews, SDK compatibility, and production fallbacks.

9 min readIntermediate
Read guide

LLM reliability

Structured outputs guide

A practical guide to OpenAI structured outputs, Claude schema-based tool use, Gemini response schemas, JSON validation, retries, and production contracts for LLM apps.

8 min readIntermediate
Read guide

AI evaluation

LLM evaluation tools

Compare LLM evaluation tools for prompt regression tests, RAG quality, agent behavior, model upgrades, CI checks, human review, and production monitoring.

9 min readIntermediate
Read guide

AI safety

LLM guardrails guide

A practical guide to LLM guardrails for prompt injection, tool approvals, output validation, human review, policy checks, and production AI risk management.

9 min readIntermediate
Read guide

RAG strategy

RAG vs fine-tuning

Decide when to use RAG, fine-tuning, prompt engineering, or a hybrid approach for private knowledge, style control, domain behavior, cost, freshness, and accuracy.

8 min readBeginner to intermediate
Read guide

RAG security

Enterprise RAG security checklist

A practical security checklist for enterprise RAG: data ingestion, permissions, prompt injection, retrieval filtering, citations, logging, privacy controls, and human review.

10 min readIntermediate to advanced
Read guide

Model APIs

Responses API vs Chat Completions

Compare OpenAI Responses API and Chat Completions for new apps, agent workflows, tool use, conversation state, structured outputs, file search, web search, and migration planning.

9 min readIntermediate
Read guide

RAG architecture

GraphRAG vs vector RAG

Compare GraphRAG and vector RAG for enterprise knowledge bases, narrative documents, entity-heavy questions, global summaries, local search, cost, reindexing, and production complexity.

9 min readIntermediate
Read guide

RAG retrieval

Hybrid search RAG guide

A production guide to hybrid search for RAG: when to combine keyword BM25 and vector embeddings, how to fuse rankings, when to add rerankers, and how to evaluate retrieval.

8 min readIntermediate
Read guide

AI security

LLM red teaming guide

A practical LLM red teaming guide for prompt injection, jailbreaks, data leakage, tool misuse, RAG attacks, agent safety, adversarial testing, evals, and remediation.

10 min readIntermediate
Read guide

Cloud AI platforms

Azure OpenAI vs OpenAI API

Compare Azure OpenAI and the OpenAI API for enterprise apps, privacy review, regional deployment, quota, pricing, networking, identity, model access, and migration planning.

9 min readIntermediate
Read guide

Cloud AI platforms

Bedrock vs Azure OpenAI vs Vertex AI

Compare Amazon Bedrock, Azure OpenAI, and Google Vertex AI/Gemini Enterprise Agent Platform for model access, enterprise controls, RAG, agents, guardrails, pricing, and operations.

10 min readIntermediate
Read guide

RAG platforms

Cloud RAG platform comparison

Compare managed cloud RAG options: Amazon Bedrock Knowledge Bases, Azure OpenAI with Azure AI Search, and Google Agent Search for enterprise search, permissions, citations, cost, and operations.

9 min readIntermediate
Read guide

Private AI

Private LLM deployment guide

A practical guide to private LLM deployment for enterprises: vLLM, NVIDIA NIM, Ray Serve, GPU sizing, OpenAI-compatible APIs, security, cost, monitoring, and fallback design.

10 min readIntermediate to advanced
Read guide

AI operations

LLM rate limits guide

A practical guide to LLM API rate limits across OpenAI, Anthropic, Azure OpenAI, Bedrock, and Gemini: TPM, RPM, retry-after, backoff, queues, batching, fallbacks, and throughput planning.

9 min readIntermediate
Read guide

AI reliability

LLM fallback routing guide

Design LLM fallback routing for production: model tiers, provider outages, rate limits, quality regressions, schema compatibility, retries, observability, and graceful degradation.

9 min readIntermediate to advanced
Read guide

AI evaluation

AI agent evaluation guide

Learn how to evaluate AI agents before production: trace review, task datasets, tool-call correctness, route quality, safety checks, online evals, human feedback, and regression gates.

10 min readIntermediate to advanced
Read guide

AI security

LLM security tools comparison

Compare LLM security tools for prompt injection, jailbreaks, data leakage, insecure tool use, guardrails, red teaming, and vulnerability scanning: Lakera Guard, Promptfoo, NVIDIA NeMo Guardrails, and Garak.

10 min readAdvanced
Read guide

AI operations

AIOps tools comparison

Compare AIOps and AI observability tools for incident triage, root cause analysis, log and metric correlation, SRE workflows, alert noise reduction, and production reliability.

10 min readAdvanced
Read guide

Related topic hubs