Langfuse
Langfuse is an open-source LLM engineering platform that provides observability, tracing, prompt management, evaluation, and dataset tooling for teams building applications on top of large language models.
Langfuse is an open-source platform for LLM engineering that provides observability, tracing, prompt management, evaluation, and dataset capabilities for teams developing applications built on large language models. It is designed to address the operational challenges that arise when deploying LLM-based systems in production — where debugging, cost management, quality measurement, and iterative improvement require specialised tooling beyond what standard application monitoring systems offer.
Langfuse was founded in 2023 by Maximilian Deichmann, Marc Klingen, and Clemens Rawert, and was part of Y Combinator's Winter 2023 cohort. It has grown to become one of the most widely adopted open-source LLM observability platforms. In 2025, ClickHouse — the open-source analytics database company — acquired Langfuse, signalling a long-term strategic investment in LLM data infrastructure.
Why LLM Observability Matters
Traditional software observability tools — metrics, logs, distributed traces — were designed for deterministic systems where outputs are predictable given known inputs. LLM applications are non-deterministic: the same prompt can produce different outputs depending on model version, temperature settings, and context window contents. Debugging why an LLM application produced an incorrect or harmful output, or why its quality degraded after a prompt change, requires capturing the full context of each model call — the exact prompt sent, the model response, any tool calls made, latency at each step, token counts, and associated costs.
Langfuse captures this information in structured traces that span entire LLM workflows, including chains, agents, retrieval-augmented generation pipelines, and any non-LLM steps such as database lookups or API calls. This observability layer makes LLM applications debuggable and auditable in the same way that distributed tracing (via OpenTelemetry or Jaeger) made microservice architectures observable.
Core Features
Tracing and Observability
Langfuse's tracing system captures hierarchical execution trees for LLM applications. A single user request may initiate a trace that contains spans for an embedding call, a vector database retrieval, an LLM generation, and a post-processing step — each with its own latency, input, output, token count, and cost. Traces are queryable by user, session, tag, or time range, enabling engineers to investigate specific failures or analyse performance patterns across large volumes of requests.
Integration is available for major LLM frameworks and providers including OpenAI, Anthropic Claude, LangChain, LlamaIndex, LiteLLM, and any OpenTelemetry-compatible system. Most integrations require fewer than ten lines of code.
Prompt Management
Langfuse provides a centralised prompt registry where prompt templates are stored, versioned, and labelled (development, staging, production). Applications retrieve prompts at runtime via the Langfuse SDK rather than hardcoding them, enabling prompt iteration without code deployments. Server-side and client-side caching ensures that dynamic prompt retrieval does not add meaningful latency to production applications.
Prompt versioning enables controlled rollouts and easy rollback: teams can push an updated prompt to production and monitor its effect on quality metrics before fully replacing the previous version.
Evaluations
Langfuse supports multiple evaluation methods to assess LLM output quality at scale. LLM-as-a-judge evaluation uses a secondary LLM call to score outputs against criteria such as correctness, faithfulness, and relevance. Code-based evaluators apply deterministic logic — for example, checking that an output is valid JSON or that a returned SQL query parses correctly. Human annotation workflows allow team members to manually label outputs through Langfuse's review interface. User feedback signals (thumbs up/down, star ratings) can be collected from production applications and correlated with trace data.
Evaluation scores are attached to individual traces and aggregated into dashboards showing quality trends over time, enabling teams to detect regressions introduced by model updates, prompt changes, or retrieval strategy modifications.
Datasets
The datasets feature stores curated collections of input-output pairs — both historical production examples and hand-crafted test cases — that can be replayed against different prompt versions or models in an offline evaluation environment. This enables systematic benchmarking before deploying prompt or model changes, effectively providing a regression testing workflow for LLM applications.
Deployment Options
Langfuse can be deployed as a managed cloud service (Langfuse Cloud) with a free tier offering 50,000 observations per month, or self-hosted using Docker Compose for development environments and Kubernetes via Helm for production deployments. Self-hosting is common among enterprises with strict data residency requirements, as it keeps all trace data — which may contain sensitive prompt contents and user inputs — within the organisation's own infrastructure.
See Also
References
- Langfuse. (2025). Langfuse Documentation: LLM Observability Overview. https://langfuse.com/docs
- GitHub. (2025). langfuse/langfuse — Open Source LLM Engineering Platform. https://github.com/langfuse/langfuse
- ClickHouse. (2025). ClickHouse Acquires Langfuse: The Future of Open-Source LLM Observability. ClickHouse Blog. https://clickhouse.com/blog/clickhouse-acquires-langfuse-open-source-llm-observability
- Y Combinator. (2023). Langfuse — YC W23. Y Combinator.
- Shankar, S. et al. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. arXiv:2404.12272.