LangGraph Best Practices

LangGraph Best Practices: A Comprehensive Developer Guide

I’ve spent a good amount of time building and reviewing LangGraph-powered systems – both single-agent workflows and fairly complex multi-actor setups. Below is a practical, opinionated playbook on LangGraph Best Practices that you can lift into your own projects. Focused on what matters in production: clear state, controllable flow, durable memory, predictable streaming, strong error boundaries, and operational visibility.

1) Core Architecture & State Design

1.1 Keep state boring—and typed

Your state object is the backbone of the graph. Keep it minimal, explicit, and typed. LangGraph supports TypedDict, Pydantic, and dataclasses—choose one and stick to it across the codebase for consistency. Use reducer helpers (e.g., add_messages) only where you truly need accumulation. Don’t dump transient values into state; pass them through function scope.

Example: Minimal typed state with a reducer for accumulating messages

from typing import Annotated, TypedDict, Any, Optional
from langgraph.graph.message import add_messages

class AppState(TypedDict, total=False):
    messages: Annotated[list, add_messages]
    current_step: str
    result: dict[str, Any]
    error_count: int
    max_steps: int
    last_error: Optional[dict[str, Any]]

1.2 Immutability mindset in node functions

Treat each node like a pure function: return a partial state update rather than mutating inputs. It makes testing easier and keeps edge routing predictable.

Example: Pure node returning partial updates

def classify_intent_node(state: AppState) -> dict:
    user_msg = state["messages"][-1]["content"]
    intent = "search" if "find" in user_msg.lower() else "chat"
    return {"current_step": "classified", "result": {"intent": intent}}

1.3 Validation at the boundaries

Validate inbound/outbound state per node boundary—simple schema checks and guards avoid downstream “mystery errors.”

Example: Lightweight state validation

from pydantic import BaseModel, ValidationError

class ResultModel(BaseModel):
    intent: str

def validate_result_node(state: AppState) -> dict:
    try:
        ResultModel(**state.get("result", {}))
        return {"current_step": "validated"}
    except ValidationError as e:
        return {"current_step": "error", "last_error": {"type": "validation", "detail": str(e)}}

2) Graph Flow & Edges

2.1 Prefer simple edges; add conditionals only where behavior branches

Sequence simple edges for linear steps; use conditional edges when the state genuinely branches.

Example: Conditional routing function

from langgraph.graph import StateGraph

builder = StateGraph(AppState)

builder.add_node("classify", classify_intent_node)
builder.add_node("validate", validate_result_node)

# Simple edge: classify -> validate
builder.add_edge("classify", "validate")

# Conditional: validate -> (error_handler | proceed)

def route_after_validate(state: AppState) -> str:
    if state.get("current_step") == "error":
        return "error_handler"
    return "proceed"

builder.add_node("error_handler", lambda s: {"error_count": s.get("error_count", 0) + 1})
builder.add_node("proceed", lambda s: {"current_step": "done"})

builder.add_conditional_edges("validate", route_after_validate, {
    "error_handler": "error_handler",
    "proceed": "proceed"
})

2.2 Tame cycles with guardrails

Cycles are normal in agentic systems (retry, clarify, tool-call loops). Add hard stops: a max_steps counter; exponential backoff on repeated failures; explicit exit conditions for “no progress.”

Example: Bounding a cycle with a step limit

def should_continue(state: AppState) -> str:
    steps = state.get("error_count", 0)
    if steps >= state.get("max_steps", 3):
        return "halt"
    return "retry"

builder.add_node("halt", lambda s: {"current_step": "halted"})
builder.add_node("retry", lambda s: {"current_step": "retrying"})

builder.add_conditional_edges("error_handler", should_continue, {
    "halt": "halt",
    "retry": "classify"
})

2.3 Multi-agent patterns: supervisor → specialists

Use a small “supervisor” node to classify the request and delegate to specialist agents; aggregate results centrally.

Example: Supervisor routing to specialist agents

SPECIALISTS = {
    "search": "search_agent",
    "chat": "chat_agent",
}

builder.add_node("supervisor", classify_intent_node)

for name in SPECIALISTS.values():
    builder.add_node(name, lambda s, name=name: {"result": {name: f"handled {s['result']['intent']}"}})


def route_to_specialist(state: AppState) -> str:
    intent = state["result"]["intent"]
    return SPECIALISTS.get(intent, "chat_agent")

builder.add_conditional_edges("supervisor", route_to_specialist, {
    "search_agent": "search_agent",
    "chat_agent": "chat_agent",
})

3) Memory, Persistence & Threads

3.1 Use a production checkpointer (Postgres)

For anything beyond demos, plug in a Postgres-backed checkpointer so you can pause/resume, inspect state, and survive process restarts.

Example: Postgres checkpointer

from langgraph.checkpoint.postgres import PostgresSaver
from psycopg_pool import ConnectionPool

DB_URI = "postgresql://user:pass@host:5432/langgraph?sslmode=require"

pool = ConnectionPool(conninfo=DB_URI, max_size=10)

with pool.connection() as conn:
    saver = PostgresSaver(conn)
    saver.setup()  # creates tables if not exists

compiled = builder.compile(checkpointer=saver)

3.2 Treat `thread_id` as a first-class key

Every invocation should carry a meaningful thread_id so checkpoints and HITL pauses attach to the right conversation or workflow instance.

Example: Invoke with thread and namespace

config = {
    "configurable": {
        "thread_id": f"tenant-{tenant_id}:user-{user_id}:session-{session_id}",
        "checkpoint_ns": f"tenant-{tenant_id}",
    }
}

final_state = compiled.invoke({
    "messages": [{"role": "user", "content": "Find hotels in Berlin"}],
    "max_steps": 3,
}, config=config)

3.3 Namespacing for long-term memories

Implement cross-thread or durable user preferences with a clean namespace strategy.

Example: Storing a user preference

from langgraph.store.memory import InMemoryStore

store = InMemoryStore()
user_ns = (f"tenant-{tenant_id}", f"user-{user_id}", "prefs")
store.put(user_ns, "language", {"preferred": "concise", "level": "advanced"})

prefs = store.get(user_ns, "language")

4) Streaming & Performance

4.1 Choose the right stream mode for your UI

LangGraph offers multiple streaming modes: messages (token-level UX), updates (state deltas), values (full state snapshots each step), and custom when you want to push tailored payloads to the client.

Example: Efficient streaming of state updates

for delta in compiled.stream({"messages": messages}, stream_mode="updates", config=config):
    # Send only changes to the frontend (bandwidth-friendly)
    handle_delta(delta)

4.2 Parallelize independent work with the Send API

When steps are independent (e.g., run N tool calls), use the Send API to dispatch work in parallel and rejoin.

Example: Fan-out with Send and aggregate

from langgraph.graph import Send


def fanout_node(state: AppState):
    tasks = state.get("pending", [])
    return [Send("worker", {"task": t}) for t in tasks]


def worker_node(state: AppState):
    t = state["task"]
    return {"result": {t: f"done:{t}"}}


def aggregate_node(state: AppState):
    # merge child results (pseudo-merge)
    acc = state.get("result", {})
    for k, v in state.get("child_results", {}).items():
        acc[k] = v
    return {"result": acc}

4.3 Control your context

Monitor prompt + tool I/O size. Compress history (summaries, selective retention) and switch to cheaper/faster models for non-critical steps. Keep prompts as data for A/B testing without code changes.

5) Error Handling & Resilience

5.1 Handle errors at multiple levels

Good graphs fail gracefully: node level (typed error objects into state), graph level (conditional edges to error_handler with retry/fallback), and app level (circuit breakers, rate limiting, alerting).

Example: Central error handler and retry policy

MAX_RETRIES = 2


def risky_node(state: AppState):
    try:
        # risky operation ...
        raise RuntimeError("transient failure")
    except Exception as e:
        return {
            "current_step": "error",
            "last_error": {"type": "exception", "detail": str(e)},
            "error_count": state.get("error_count", 0) + 1,
        }


def retry_or_fallback(state: AppState) -> str:
    if state.get("error_count", 0) > MAX_RETRIES:
        return "fallback"
    return "retry"

builder.add_node("fallback", lambda s: {"result": {"note": "using cached answer"}})

builder.add_conditional_edges("error_handler", retry_or_fallback, {
    "retry": "risky_node",
    "fallback": "fallback",
})

5.2 Bound retries and degrade gracefully

For LLM/tool instability, limit retries and switch to simpler fallbacks (lighter model, cached response, human escalation). Keep the user informed with helpful messages.

6) Human-in-the-Loop (HITL)

6.1 Interrupt where human judgment adds value

Use dynamic interrupt inside nodes to pause on sensitive actions (purchases, PII use, high-risk tool calls). Resume with the operator’s decision attached to state.

Example: Approval pause and resume

from langgraph.types import interrupt


def approval_node(state: AppState):
    action = state.get("proposed_action", {})
    if action.get("risk_level") == "high":
        decision = interrupt({
            "action": action,
            "request": "approval"
        })  # returns {"approved": bool, "note": str}
        return {"approved": decision.get("approved", False), "review_note": decision.get("note")}
    return {"approved": True}

6.2 Design the resume path

After an interrupt, restore minimal context and continue deterministically. Avoid brittle re-execution by capturing the exact decision payload and relevant inputs in state before pausing.

7) Testing & Quality

7.1 Test graphs, not just functions

Write tests that exercise the graph: construct a tiny state, invoke/ainvoke, and assert on the resulting state and chosen edge.

Example: Unit-style test of a small graph

import pytest

@pytest.mark.asyncio
async def test_happy_path(async_compiled_graph):
    init = {"messages": [{"role": "user", "content": "find cafes"}], "max_steps": 2}
    out = await async_compiled_graph.ainvoke(init, config={"configurable": {"thread_id": "t1"}})
    assert out["current_step"] in {"done", "validated", "proceed"}

7.2 Mock external tools

Isolate LLM and network/tool calls in tests. Provide deterministic mock outputs for edge selection and error branches. Treat “tool schemas” like contracts.

7.3 Property-style checks for state invariants

Assert invariants like “only one of {approved, rejected} is set,” or “retry_count never exceeds max.”

8) Deployment, Observability & Scale

8.1 Configuration per environment

Move provider choices, model names, and feature flags to config, and inject them with configurable inputs at runtime.

Example: Environment-aware config

import os

def runtime_config(thread: str, ns: str):
    return {
        "configurable": {
            "thread_id": thread,
            "checkpoint_ns": ns,
            "model_provider": os.getenv("LLM_PROVIDER", "openai"),
            "route_traces": True,
        }
    }

8.2 Trace everything

Use LangSmith (or your APM/OTEL stack) for traces, token accounting, and step timing. Log selected edges, tool payloads (sanitized), retries, and outcomes.

Example: Basic LangSmith setup (conceptual)

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-langgraph-app"
os.environ["LANGCHAIN_API_KEY"] = "***"
# Traces will automatically include node spans when using LangChain integrations

8.3 Right-sized durability & pooling

Postgres checkpointer with connection pooling.
Prune old checkpoints by policy.
Horizontal scale workers; keep graphs stateless beyond the checkpointer.

8.4 Streaming at scale

Prefer updates for dashboards and reserve messages for chat-like UX. Combine modes only when distinct subscribers truly need them.

9) Security & Compliance

9.1 Treat state as sensitive

State often carries prompts, user inputs, and tool outputs. Sanitize PII, encrypt at rest (DB, object store), and scrub logs. If you mirror parts of state into analytics, do it through a privacy-aware export path.

9.2 Harden your edges and tools

Validate external inputs (schema + range checks), authenticate tool backends, apply rate limits, and prefer allowlists over wildcards for tool execution. For multi-tenant deployments, enforce row-level security or scoped queries keyed by tenant_id + thread_id.

10) Practical Reference Patterns

10.1 Minimal, typed state with reducers

Use TypedDict/Pydantic and add reducers only when you need accumulation of messages or results.

10.2 Supervisor → agent fan-out with Send API

Have the supervisor classify the task, fan-out to N workers in parallel via Send, and aggregate into a single node for ranking/decision.

10.3 Durable HITL checkpoints

Before an interrupt, snapshot the proposed action and minimal inputs into state so the operator sees exactly what they’re approving. Resume with the approval payload merged into state.

11) Troubleshooting Checklist

A branch isn’t firing? Re-check your conditional function return strings and mapping dict—small typos cause silent misroutes.
Streams look heavy? Switch to updates or add a custom channel for just the fields your UI needs.
Graphs “forget” progress? Verify you’re consistently sending thread_id and your checkpointer is set up in the same namespace.
Parallel fan-out stalls? Ensure each Send target consumes compatible partial state and returns the same shape expected by the aggregator.

12) Putting It All Together

A resilient LangGraph application looks like this:

State: small, typed, and validated; reducers used sparingly.
Flow: simple edges where possible; conditional edges only at real decision points; bounded cycles.
Memory: Postgres checkpointer with thread-scoped checkpoints; namespaced long-term preferences.
Streaming: deliberate choice of messages/updates/values/custom per UX and bandwidth needs.
Errors: node + graph + app-level handling with graceful degradation and escalation.
HITL: precise interrupt points and deterministic resume paths.
Ops: environment-based config, full tracing, connection pooling, and cost monitoring.

If you follow these practices, your team will ship graphs that are easier to reason about, test, and scale—without turning your orchestration layer into a ball of yarn.

Swarnendu De