Popular searches
//

From Inference to Governance: Why Agent Metadata Matters When LLMs Already Understand Your Data

15.5.2026 | 6 minutes reading time

ask_your_data.png

Modern LLMs demonstrate strong capability in inferring meaning from column names. A tool such as Genie can typically resolve pct_cust_attrit_q to "churn" or map rev_mrr_usd to a"MRR" through pattern recognition alone. On a small, well-structured table, inference produces correct results in the majority of cases.

However, "the majority of cases" does not constitute a governance standard. As schemas scale to hundreds of columns, as multiple columns become plausible matches for a single query, and as different departments adopt divergent terminology for the same metric, inference alone is no longer sufficient. Databricks Agent Metadata addresses this gap, not by enabling AI tools to interpret data for the first time, but by ensuring they do so consistently, correctly, and at scale. Agent Metadata requires Databricks Runtime 17.3 and YAML version 1.1.

Where Inference Falls Short

On a small table with six descriptive columns, an LLM can often infer the correct mapping through pattern recognition. Consider the following metrics table:

ColumnExample Value
rev_mrr_usd48250.00
pct_cust_attrit_q0.034
n_active_subs1205
dt_cohort_start2025-01-15

Genie can reasonably deduce that rev_mrr_usd relates to MRR and that pct_cust_attrit_q involves customer attrition. With a limited number of columns and recognizable abbreviations, inference produces adequate results.

Production schemas, however, rarely present this level of clarity. When a table contains rev_mrr_usd, rev_nrr_usd, rev_grr_usd, rev_arr_usd, and rev_exp_usd, a query referencing "revenue" could plausibly match any of them. When Finance refers to a metric as "net retention" while Product uses "expansion revenue", both expecting resolution to distinct columns, the model has no basis for disambiguation. When a schema spans hundreds of columns across multiple business domains, the probability of incorrect resolution increases proportionally.

In these scenarios, inference becomes unreliable, not due to a limitation in model capability, but because ambiguity, scale, and terminological inconsistency exceed what pattern recognition alone can resolve.

Agent Metadata: Embedding Business Context into Data Definitions

Agent Metadata in Unity Catalog enables organizations to attach business context directly to their data definitions through three mechanisms: display names, synonyms, and format specifications. This metadata is governed within Unity Catalog and automatically consumed by downstream tools including dashboards and AI assistants.

The following example demonstrates a metric view definition for a SaaS metrics model:

1version: 1.1
2
3source: analytics.saas.subscription_metrics
4
5dimensions:
6  - name: dt_cohort_start
7    expr: dt_cohort_start
8    display_name: 'Cohort Start Date'
9    synonyms:
10      - signup date
11      - cohort date
12      - when they signed up
13
14  - name: plan_tier
15    expr: plan_tier
16    display_name: 'Plan Tier'
17    synonyms:
18      - pricing plan
19      - subscription level
20      - plan type
21
22  - name: region_code
23    expr: region_code
24    display_name: 'Region'
25    synonyms:
26      - geography
27      - market
28      - territory
29
30measures:
31  - name: rev_mrr_usd
32    expr: SUM(rev_mrr_usd)
33    display_name: 'Monthly Recurring Revenue'
34    synonyms:
35      - MRR
36      - monthly revenue
37      - recurring revenue
38    format:
39      type: currency
40      currency_code: USD
41      decimal_places:
42        type: exact
43        places: 2
44
45  - name: pct_cust_attrit_q
46    expr: AVG(pct_cust_attrit_q)
47    display_name: 'Quarterly Churn Rate'
48    synonyms:
49      - churn
50      - attrition rate
51      - customer churn
52      - churn rate
53    format:
54      type: percentage
55      decimal_places:
56        type: exact
57        places: 1
58
59  - name: n_active_subs
60    expr: SUM(n_active_subs)
61    display_name: 'Active Subscriptions'
62    synonyms:
63      - active customers
64      - subscriber count
65      - active users
66    format:
67      type: number
68      decimal_places:
69        type: exact
70        places: 0

This definition leverages three complementary metadata capabilities:

Display names replace technical column names with human-readable labels across dashboards and reports. For example, rev_mrr_usd is presented as "Monthly Recurring Revenue" in any downstream visualization.

Synonyms enable AI discoverability. When a user asks Genie "What's our churn?", the synonym mapping resolves that term to the correct measure. Each dimension or measure supports up to 10 synonyms of up to 255 characters, providing sufficient coverage for the terminology variations that exist across teams and departments.

Format specifications define how values are rendered in visualization tools. Churn is displayed as 3.4% rather than 0.034, and MRR as $48,250.00 rather than a raw numeric value. These formatting rules propagate automatically to all dashboards built on the metric view.

The Difference in Practice: Where Metadata Changes the Outcome

On a simple schema, Genie's inference is often sufficient to resolve common abbreviations and return correct results without any metadata. The value of Agent Metadata becomes apparent in three areas: ambiguity resolution, presentation quality, and consistency across teams.

Ambiguity Resolution

Consider the query: "Show me MRR by pricing plan"

On the six-column table above, Genie resolves this correctly without metadata. The abbreviation mrr is present in the column name, and plan_tier is the only plausible grouping column. There is no ambiguity to resolve.

Now consider a production schema containing the following revenue columns:

ColumnDescription
rev_mrr_usdMonthly Recurring Revenue
rev_nrr_usdNet Revenue Retention
rev_grr_usdGross Revenue Retention
rev_arr_usdAnnual Recurring Revenue
rev_exp_usdExpansion Revenue

The same query, "Show me revenue by plan", now presents five plausible matches. Without metadata, the model must infer which revenue column the user intends, with no mechanism to guarantee a correct selection. With synonyms explicitly mapping "MRR" to rev_mrr_usd and "expansion revenue" to rev_exp_usd, the resolution becomes deterministic.

Presentation Quality

Even when Genie resolves the correct column without metadata, the output quality differs significantly.

Without Agent Metadata:

pct_cust_attrit_qregion_code
0.034NA
0.051EMEA
0.028APAC

With Agent Metadata:

RegionQuarterly Churn Rate
NA3.4%
EMEA5.1%
APAC2.8%

The underlying data is identical. However, display names replace technical column headers, and format specifications render 0.034 as 3.4%. This represents the minimum improvement that Agent Metadata provides, independent of whether inference would have succeeded. For stakeholders consuming results in dashboards or Genie responses, this distinction is not cosmetic, a table of raw decimals requires the reader to infer the unit and context. A properly formatted table communicates both immediately.

Consistency Across Teams

The most significant benefit of Agent Metadata is one that does not surface in a single-user test. When Finance queries "net retention" and Product queries "expansion revenue", synonyms ensure both terms resolve to their respective correct columns. Without metadata, both queries rely on the LLM's interpretation, which may vary depending on phrasing, context window content, or model version.

Agent Metadata eliminates this variability. The mapping is explicit, governed, and version-controlled within Unity Catalog. Every user, across every tool, resolves the same term to the same column, not because the model inferred correctly, but because the definition is authoritative.

The Bigger Picture

Agent Metadata does not address a problem that is immediately visible on a small, well-structured dataset. An LLM will often produce correct results without it. This can lead organizations to underestimate its value, until schema complexity increases, teams scale, or a quarterly report surfaces a discrepancy traceable to an ambiguous column resolution.

The value of Agent Metadata is structural. It elevates the semantic layer from a presentation concern, historically managed at the BI tool level, to a governed component of the data catalog. Business meaning is defined once within Unity Catalog, version-controlled, subject to access policies, and consumed automatically by every downstream tool, dashboards, Genie, notebooks, and any future integration.

For organizations operating at scale, this represents the difference between an AI tool that produces correct results in most cases and one that does so reliably. Synonyms eliminate ambiguity. Display names ensure readability. Format specifications enforce consistent presentation. None of these mechanisms depend on model inference, and none degrade as schemas grow in complexity.

The question organizations should consider is not whether an LLM can interpret their data without metadata, in many cases, it can. The question is whether inference alone provides a sufficient foundation for organizational reporting and decision-making.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.