Weekly LLM Observability Market Research Report
Date: 2026-02-12 | Model: google/gemini-3-pro-preview | Data Collected: 2026-02-12
1. Executive Summary
- LangSmith released Self-Hosted v0.13, expanding deployment options for on-premise and VPC environments to support data sovereignty requirements.
- Langfuse introduced an Org Audit Log Viewer, adding necessary security visibility and compliance tracking for enterprise-tier users.
- MLflow launched Organization Support to facilitate multi-workspace management, targeting larger engineering teams within the Databricks ecosystem.
- W&B Weave emphasizes its SOC 2 and HIPAA compliance posture while delivering multimodal tracing and a GUI-based judge wizard for evaluation.
- Arize Phoenix solidifies its position in RAG evaluation with DSPy support and specialized visualizations for retrieval chunks and function calls.
- Braintrust continues to leverage a hybrid deployment model that separates the control plane from data residing in customer VPCs to appeal to security-conscious organizations.
Market Insight by AI:
LangSmith’s market-leading visualization for agent execution graphs via LangGraph presents a direct challenge to Weave’s reliance on hierarchical tree structures for debugging complex agentic workflows. The release of LangSmith Self-Hosted v0.13 and Langfuse’s Org Audit Log Viewer increases competitive pressure on Weave to aggressively market its own dedicated cloud and compliance features. Arize Phoenix’s integration of DSPy for offline evaluation and MLflow’s no-code judge wizard threaten to erode Weave’s differentiation in evaluation workflows unless Weave leverages its multimodal tracing capabilities further.
2. Product Feature Comparison
| Product | Tracing | Eval | Agent Observability | Cost Tracking | Enterprise | Overall |
|---|---|---|---|---|---|---|
| W&B Weave | O | O | △ | O | O | O |
| LangSmith | O | O | O | △ | O | O |
| Langfuse | O | O | △ | O | O | O |
| Braintrust | O | O | △ | △ | O | O |
| MLflow | O | O | △ | X | O | O |
| Arize Phoenix | O | O | O | X | O | O |
3. New Features (Last 30 Days)
W&B Weave
- Audio Monitors: Support for creating monitors that observe and judge audio outputs alongside text using audio-capable LLMs. (2026-02-01, Evaluation & Quality)
- Dynamic Leaderboards: Auto-generated leaderboards inside Evaluations with rich customization, filtering, and CSV export capabilities. (2026-01-29, Evaluation & Quality)
- Custom LoRAs in Playground: Support for testing and comparing custom fine-tuned LoRA weights directly in the Weave Playground. (2026-01-16, Development Lifecycle)
LangSmith
- Customize trace previews: Ability to customize trace previews in the LangSmith UI. (2026-02-06, Integration & DX)
- Google Gen AI Wrapper: New wrapper support for Google Gen AI (Gemini) in Python and JS SDKs. (2026-02-02, Integration & DX)
- LangSmith Self-Hosted v0.13: Update to the self-hosted version of the LangSmith platform. (2026-01-16, Enterprise & Infrastructure)
Langfuse
- Single Observation Evals: Support for running evaluations on individual observations rather than just full traces. (2026-02-12, Evaluation & Quality)
- Reasoning Trace Rendering: New UI capability to render thinking/reasoning parts in trace details (e.g., for reasoning models). (2026-02-12, Core Tracing & Logging)
- Org Audit Log Viewer: Dedicated viewer for organization-level audit logs within the dashboard. (2026-02-12, Enterprise & Infrastructure)
- Inline Trace Comments: Ability to add comments inline on fractions of IO data within traces for collaboration. (2026-02-12, Integration & DX)
- Trace Corrections: Workflow to add corrections to trace and observation previews, enhancing dataset curation. (2026-02-12, Evaluation & Quality)
Braintrust
- Sub-agent nesting for Claude Agent: Added support for sub-agent nesting within the Claude Agent SDK wrapper. (2026-02-05, Agent & RAG Specifics)
- Classifications field: Added a new classifications field to traces/spans for better categorization. (2026-01-31, Core Tracing & Logging)
- Evaluation Cache Control: Added option to turn off caching during evaluation runs. (2026-01-29, Evaluation & Quality)
- Trace Scoring Candidate: Introduced candidate functionality for scoring traces in Python SDK. (2026-01-21, Evaluation & Quality)
- Playground Trace Scorer: Fixed and enabled JS trace scorer functionality within the playground. (2026-01-21, Evaluation & Quality)
- Facet Typespecs: Introduced new Facet typespecs for improved data handling. (2026-01-15, Core Tracing & Logging)
MLflow
- Organization Support: Support for multi-workspace environments allowing organization of experiments and resources across different workspaces. (2026-02-12, Enterprise & Infrastructure)
- MLflow Assistant: In-product chatbot backed by Claude Code to help identify, diagnose, and fix issues directly within the UI. (2026-01-29, Integration & DX)
Arize Phoenix
- Claude Opus 4.6 Support: Added Claude Opus 4.6 model support to the playground. (2026-02-09, Development Lifecycle)
- Tool Selection Evaluator: Added missing tool_selection evaluator to both libraries. (2026-02-06, Evaluation & Quality)
- Faithfulness Evaluator: Added FaithfulnessEvaluator and deprecated HallucinationEvaluator. (2026-02-02, Evaluation & Quality)
- Tool Invocation Accuracy Metric: Added metric to track tool invocation accuracy. (2026-02-02, Analytics & Dashboard)
- OAuth2 Email Configuration: Added EMAIL_ATTRIBUTE_PATH for configurable email extraction in OAuth2. (2026-01-28, Enterprise & Infrastructure)
- LLM Classification Evaluators: Added cursor rule for creating new built-in metrics (LLM classification evaluators). (2026-01-21, Evaluation & Quality)
4. Positioning Shift
| Product | Current | Moving Toward | Signal |
|---|---|---|---|
| W&B Weave | A code-first observability and evaluation toolkit deeply integrated into the W&B MLOps platform, targeting developers building production LLM applications. | Expanding from offline experimentation into real-time production monitoring and guardrails with enhanced UI-based workflows. | Release of Audio Monitors, Dynamic Leaderboards, and GUI-based Judge Wizard. |
| LangSmith | The premier observability and evaluation platform for the LangChain ecosystem, focused on deep code-level debugging of agentic workflows. | Expanding beyond LangChain to become a framework-agnostic enterprise observability standard with enhanced self-hosted and OTel capabilities. | Recent releases of non-OTel Google wrappers and consistent updates to the self-hosted enterprise version. |
| Langfuse | Open-source, developer-first LLM engineering platform focused on observability and evaluation. | Deepening collaboration features and support for advanced model behaviors like reasoning traces. | Recent releases of inline comments, corrections workflows, and reasoning trace rendering. |
| Braintrust | Enterprise-focused evaluation and observability platform bridging the gap between offline development and production monitoring. | Deepening support for complex agentic workflows and refining evaluation mechanics with granular scoring and classification. | Recent updates focus on sub-agent nesting, trace scoring candidates, and classification fields. |
| MLflow | The industry standard for open-source MLOps, now aggressively expanding into a comprehensive GenAI observability and evaluation platform. | evolving from a pure experiment tracker into a holistic ‘GenAI Ops’ suite that unifies prompt engineering, evaluation, and production monitoring. | Recent releases of native Prompt Management UI, LLM-as-a-Judge wizards, and AI-powered debugging assistants. |
| Arize Phoenix | A developer-first, open-source observability standard for tracing and evaluating LLM applications, particularly RAG and Agents. | Deepening evaluation capabilities for agentic workflows (tool selection, faithfulness) and expanding model support in the playground. | Recent releases focus heavily on new evaluators (Faithfulness, Tool Selection) and metrics for tool invocation accuracy. |
5. Enterprise Signals
- LangSmith released Self-Hosted v0.13, reinforcing the market demand for on-premise and VPC deployment options.
- MLflow introduced Organization Support to enable multi-workspace management for larger enterprise teams.
- Langfuse added an Org Audit Log Viewer, directly addressing enterprise compliance and security visibility requirements.
- Braintrust continues to leverage its hybrid deployment model (control plane in cloud, data in VPC) to appeal to security-conscious organizations.
- W&B Weave emphasizes SOC 2, HIPAA, and GDPR compliance, positioning itself as a secure choice for regulated industries.
Methodology
Data was collected on 2026-02-12 via Serper.dev web search, official documentation scraping, and GitHub/PyPI feeds. Analysis was performed using the google/gemini-3-pro-preview model via OpenRouter.