Weekly LLM Observability Market Research Report
Date: 2026-02-13 | Model: google/gemini-3-pro-preview | Data Collected: 2026-02-13
1. Executive Summary
- Langfuse introduced rendering support for ‘thinking’ traces to visualize reasoning steps, while Arize Phoenix released specific metrics for evaluating tool selection and faithfulness in agentic workflows.
- MLflow launched version 3.10, introducing Organization Support to enable multi-workspace management and access control within large enterprise environments.
- LangSmith released Self-Hosted v0.13 to improve infrastructure stability for on-premise deployments while maintaining its streaming support and agent execution graph visualizations.
- Braintrust added a dedicated ‘Review’ span type to structure human-in-the-loop quality control workflows alongside its existing ‘Loop’ AI evaluator assistant.
- W&B Weave confirmed SOC 2 Type II and HIPAA compliance for its VPC options and distinguished its agent observability capabilities with native Model Context Protocol (MCP) integration.
Market Insight: The release of specialized agentic metrics by Arize Phoenix and ‘thinking’ trace support by Langfuse directly challenges Weave’s agent observability depth, necessitating continued leverage of Weave’s exclusive Model Context Protocol (MCP) integration to maintain technical differentiation.
2. New Features (Last 30 Days)
W&B Weave
- Audio Monitors: Create monitors that observe and judge audio outputs alongside text, enabling evaluation of voice agents. (2026-02-01, Evaluation & Quality)
- Dynamic Leaderboards: Auto-generated leaderboards from evaluations with filtering and customization, replacing manual setup. (2026-01-29, Evaluation & Quality)
- Custom LoRAs in Playground: Support for testing and evaluating custom fine-tuned LoRA weights directly in the Weave Playground. (2026-01-16, Development Lifecycle)
LangSmith
- Customize trace previews: Ability to customize how traces are previewed in the LangSmith UI. (2026-02-06, Core Tracing & Logging)
- Non-otel Google ADK wrapper: New wrapper for Google ADK integration without OpenTelemetry dependency. (2026-02-02, Integration & DX)
- Google Gen AI wrapper: Exported wrapper for Google Generative AI integration. (2026-01-31, Integration & DX)
- Gemini TS wrapper: Beta TypeScript wrapper for Gemini models. (2026-01-26, Integration & DX)
- LangSmith Self-Hosted v0.13: Update to the self-hosted version of the platform. (2026-01-16, Enterprise & Infrastructure)
Langfuse
- LLM-as-a-Judge on Observations: Added support for running LLM-as-a-judge evaluations directly on specific observations for more granular quality control. (2026-02-13, Evaluation & Quality)
- Thinking/Reasoning Trace Rendering: New trace detail rendering for ‘thinking’ and ‘reasoning’ parts, supporting Chain-of-Thought models like DeepSeek. (2026-02-05, Core Tracing & Logging)
- Inline Trace Comments: Allows users to add comments inline on fractions of IO data within a trace, improving collaboration. (2026-01-25, Integration & DX)
- Single Observation Evals: Enables running evaluations on single observations rather than just full traces. (2026-02-08, Evaluation & Quality)
Braintrust
- Thread Retrieval API: Added capability to retrieve threads programmatically in the Python SDK. (2026-02-12, Integration & DX)
- Sub-agent Nesting: Added support for sub-agent nesting in the Claude Agent SDK wrapper. (2026-02-12, Agent & RAG Specifics)
- Review Span Type: Introduced a specific ‘Review’ span type to support human review workflows. (2026-02-05, Evaluation & Quality)
- Classifications Field: Added a classifications field to SDKs for enhanced metadata tagging. (2026-01-31, Core Tracing & Logging)
- Trace Scoring Candidate: New functionality for scoring traces directly within the Python SDK. (2026-01-21, Evaluation & Quality)
MLflow
- Organization Support: Support for multi-workspace environments allowing organization of experiments and resources across different workspaces. (2026-02-12, Enterprise & Infrastructure)
- MLflow Assistant: In-product chatbot backed by Claude Code to help identify, diagnose, and fix issues directly within the UI. (2026-01-29, Integration & DX)
Arize Phoenix
- Claude Opus 4.6 Support: Added support for Claude Opus 4.6 model in the playground. (2026-02-09, Development Lifecycle)
- Tool Selection Evaluator: New evaluator added to assess the quality of tool selection in agentic workflows. (2026-02-06, Evaluation & Quality)
- Faithfulness Evaluator: Introduced FaithfulnessEvaluator (deprecating HallucinationEvaluator) for checking groundedness. (2026-02-02, Evaluation & Quality)
- Tool Invocation Accuracy Metric: New metric to track the accuracy of tool invocations. (2026-02-02, Evaluation & Quality)
- Configurable Email Extraction: Added EMAIL_ATTRIBUTE_PATH for configurable email extraction in OAuth2. (2026-01-28, Enterprise & Infrastructure)
- Cursor Rule for Metrics: Added cursor rule for creating new built-in metrics (LLM classification evaluators). (2026-01-21, Evaluation & Quality)
3. Positioning Shift
| Product | Current | Moving Toward | Signal |
|---|---|---|---|
| W&B Weave | A code-first, rigorous evaluation and observability platform for developers building complex agentic systems. | Expanding multimodal support and bridging the gap between offline experimentation and online production monitoring. | Recent release of Audio Monitors and Dynamic Leaderboards reinforces the focus on comprehensive, automated evaluation across modalities. |
| LangSmith | The definitive observability and evaluation platform for the LangChain ecosystem and complex agentic applications. | Expanding beyond LangChain to become a universal LLM DevOps platform with broader model support (Google/Gemini) and enhanced enterprise self-hosting. | Recent release of agnostic Google Gen AI wrappers and continuous updates to the self-hosted enterprise version. |
| Langfuse | A developer-centric, open-source observability and evaluation platform favored for its strong framework integrations and self-hosting capabilities. | Deepening evaluation granularity and supporting complex reasoning models (CoT) to cater to advanced agentic workflows. | Recent updates focusing on ‘thinking’ trace rendering and granular ‘observation-level’ evaluations. |
| Braintrust | Braintrust positions itself as the premier ‘eval-centric’ development platform for enterprise engineering teams. | The platform is deepening its support for complex agentic workflows and human-in-the-loop review processes. | Recent updates adding sub-agent nesting, thread retrieval APIs, and dedicated ‘Review’ span types. |
| MLflow | The open-source standard for MLOps now offering a competitive, integrated suite for GenAI tracing and evaluation. | Deepening enterprise readiness with multi-workspace support and enhancing developer experience via AI-assisted debugging. | Release of Organization Support (v3.10) and MLflow Assistant (v3.9) in early 2026. |
| Arize Phoenix | Arize Phoenix is positioned as the premier open-source, code-first observability platform for AI engineers building complex RAG and agentic applications. | The product is moving toward deeper, more specialized evaluation capabilities for agents and tools, reinforcing its role as a technical workbench rather than a non-technical CMS. | Recent releases of specialized evaluators for tool selection, faithfulness, and tool invocation accuracy demonstrate a clear focus on solving complex agent reliability challenges. |
4. Enterprise Signals
- MLflow introduced Organization Support (v3.10) to enable multi-workspace management for large teams.
- LangSmith released Self-Hosted v0.13, reinforcing its commitment to on-premise enterprise stability.
- Braintrust added a specific ‘Review’ span type to formalize human-in-the-loop workflows for enterprise quality control.
- W&B Weave confirmed SOC 2 Type II, HIPAA, and GDPR compliance alongside VPC deployment options.
- Langfuse continues to offer a fully self-hostable MIT-licensed version, appealing to enterprises requiring total data sovereignty.
Methodology
Data was collected on 2026-02-13 via GitHub/PyPI feeds and documentation scraping. Category analysis was performed using Perplexity Sonar (web search + analysis). Synthesis was performed using the google/gemini-3-pro-preview model via OpenRouter.