Passer directement au contenu principal

Article 5 min read

Semantic observability: How we understand and measure AI intelligence

As AI powers more products and workflows, understanding why systems make decisions becomes just as critical as knowing that they work.

Harish Pratapani

Vice President, Software and Engineering at Zendesk

Dernière mise à jour January 22, 2026

In traditional software, observability was about keeping systems running. If the dashboard graphs were green, the system was healthy.

That’s not always the case with AI.

AI systems don’t fail the way traditional code does. A model can be fully operational and still produce wrong, biased, or inconsistent output. Two identical inputs can lead to very different outcomes depending on retrieved context, models, or even subtle prompt variations.

Here’s where semantic observability comes in. Traditional observability explains what happened. Semantic observability explains why it happened.

Why this shift matters

AI operates in the space of meaning, not just metrics. Simply measuring performance is not enough to define reliability. Instead, we must ask deeper questions:

Instead, we now ask deeper questions:

  • Why did the model make this decision?

  • What context or data influenced it?

  • Was its reasoning aligned with intent, facts, and user expectations?

How to think about observability in AI

In this new era, observability must go beyond infrastructure monitoring. It needs to capture reasoning, judgment, alignment, and how the system interprets and acts on context. The goal is no longer just measuring how a system performs, but understanding the intelligence behind it.

AI observability can be understood through four layers that work together:

1. Data observability

Data drift is often the earliest signal of changing system behavior. This layer focuses on quality, freshness, and representativeness of data.

2. Model observability

Confidence, sensitivity to inputs, and attention patterns are model reasoning signals. This layer focuses on how data turns into decisions.

3. Behavioral observability

Internal reasoning is connected to real-world impact. This layer helps detect issues such as bias, hallucinations, or degradation and ties those patterns directly to customer experience and fairness.

4. Semantic observability

Semantic observability adds intent and meaning to the loop. It explains why a model behaved the way it did by tracing reasoning, evaluating coherence and accuracy, and assessing alignment with goals and values.

Together, these layers form a continuous feedback loop: data shapes models, models shape behavior, and observed behavior guides improvement. Semantic observability closes the loop by adding interpretation and context.

Turning AI reasoning into something we can see

Semantic observability isn’t a single product or metric, but a set of connected capabilities that make AI reasoning visible and measurable.

Reasoning traces

Reasoning traces are the intermediate steps a model takes to arrive at an answer. They capture how prompts are interpreted, which data is retrieved, and what decisions are made along the way. This allows us to distinguish between a correct answer and correct reasoning.

For example, a customer asks, “Why was my refund request denied?”

The reasoning trace might show:

  • Refunds are allowed within 30 days

  • The order was delivered 47 days ago

  • The request was submitted after the policy window

  • Decision: Refund denied due to an expired period

For the customer, this turns a denial into a clear, explainable outcome instead of a frustrating black box. Each step can be evaluated for retrieval precision, reasoning coherence, and factual accuracy. We don’t just know what the model decided, but how it arrived at that decision.

Evals

Evals are the feedback engine. They continuously measure the quality of outputs across dimensions, such as factual accuracy, coherence, tone, and safety. When embedded into development and rollout workflows, evals ensure that improvements enhance reasoning quality, not just efficiency or latency.

A well-instrumented evaluation pipeline becomes the heartbeat of responsible AI. It detects semantic drift before it affects customers and provides an objective way to compare models, prompts, or retrieval strategies.

In traditional software, tests verify deterministic behavior. In AI, behavior is probabilistic and constantly evolving. It shifts with data, prompts, and context, which is why continuous evaluation is essential.

Human feedback loops

Human judgment remains essential. Qualitative signals like empathy, clarity, and usefulness cannot be fully captured by automated metrics. Integrated human feedback closes the gap between machine reasoning and human perception.

But it isn’t scalable. This is where LLM-as-judge comes in, using other usually large LLM models to evaluate the output against structured criteria such as reasoning quality, factuality, or tone.

Automation provides scale and consistency. Humans provide context, judgment, and accountability. Together, they form a balanced evaluation loop.

Alignment and ethical metrics

Beyond accuracy, observability must also assess fairness, transparency, and trust. These metrics help ensure performance improvements don’t come at the cost of safety or values.

For engineers, observability is about control. For leaders, it’s about trust.

For customers, it’s about feeling understood, treated fairly, and confident in the outcome.

In AI systems, trust comes from understanding reasoning. When organizations can explain why their models behave the way they do, they move from reactive monitoring to proactive intelligence. That transparency is the difference between AI that works and AI that is trusted.

Putting it into practice

As AI systems grow more capable, we need to build systems that are not only powerful, but understandable and accountable. While observability has always been the foundation of reliable systems, it must now play a more critical role.

Semantic observability shows us how intelligence operates, how systems interpret context, reason through decisions, and act. It creates a feedback system that ultimately builds accountability and trust.

Here at Zendesk, we’ve developed a platform that supports both online and offline evaluations, including A/B testing during rollouts. This enables us to deliver on our promise of a best-in-class AI platform for CX, one that’s reliable, accurate, and serves the needs of your customers and your business.

Harish Pratapani

Vice President, Software and Engineering at Zendesk

Harish Pratapani is Zendesk’s Vice President of AI Platform and Infrastructure, where he leads the development of next-generation AI systems that power Zendesk’s AI-driven customer experiences. Before joining Zendesk, he spent many years at Google, where he drove technical strategy for the Google Ads platform and built the product and infrastructure systems behind several Google products. He brings deep expertise in distributed systems and applied AI engineering.

share_the_story

Articles associés

Guides
5 min read

The road to AI maturity (Part 3): Empower your team with AI assistance

As we wrap up our AI Maturity series, the focus shifts from foundational work to higher-order…

Article
5 min read

To drive AI adoption in legal, you have to make time to play

A few months ago, I came across a LinkedIn post by one of our lead AI…

Article
4 min read

Power of a resolution: How FreeWorld creates second chances with Zendesk

FreeWorld Founder and CEO Jason Wang carries with him a mission that is both deeply personal…

Article
4 min read

The AI advantage: How smart automation gives IT teams their lives back

IT teams are facing a critical challenge. The demand for support is growing exponentially, driven by…