Modern LLMs demonstrate strong capability in inferring meaning from column names. A tool such as Genie can typically resolve pct_cust_attrit_q to "churn" or map rev_mrr_usd to a"MRR" through pattern recognition alone. On a small, well-structured table, inference produces correct results in the majority of cases.
However, "the majority of cases" does not constitute a governance standard. As schemas scale to hundreds of columns, as multiple columns become plausible matches for a single query, and as different departments adopt divergent terminology for the same metric, inference alone is no longer sufficient. Databricks Agent Metadata addresses this gap, not by enabling AI tools to interpret data for the first time, but by ensuring they do so consistently, correctly, and at scale. Agent Metadata requires Databricks Runtime 17.3 and YAML version 1.1.
Where Inference Falls Short
On a small table with six descriptive columns, an LLM can often infer the correct mapping through pattern recognition. Consider the following metrics table:
| Column | Example Value |
|---|---|
rev_mrr_usd | 48250.00 |
pct_cust_attrit_q | 0.034 |
n_active_subs | 1205 |
dt_cohort_start | 2025-01-15 |
Genie can reasonably deduce that rev_mrr_usd relates to MRR and that pct_cust_attrit_q involves customer attrition. With a limited number of columns and recognizable abbreviations, inference produces adequate results.
Production schemas, however, rarely present this level of clarity. When a table contains rev_mrr_usd, rev_nrr_usd, rev_grr_usd, rev_arr_usd, and rev_exp_usd, a query referencing "revenue" could plausibly match any of them. When Finance refers to a metric as "net retention" while Product uses "expansion revenue", both expecting resolution to distinct columns, the model has no basis for disambiguation. When a schema spans hundreds of columns across multiple business domains, the probability of incorrect resolution increases proportionally.
In these scenarios, inference becomes unreliable, not due to a limitation in model capability, but because ambiguity, scale, and terminological inconsistency exceed what pattern recognition alone can resolve.
Agent Metadata: Embedding Business Context into Data Definitions
Agent Metadata in Unity Catalog enables organizations to attach business context directly to their data definitions through three mechanisms: display names, synonyms, and format specifications. This metadata is governed within Unity Catalog and automatically consumed by downstream tools including dashboards and AI assistants.
The following example demonstrates a metric view definition for a SaaS metrics model:
1version: 1.1 2 3source: analytics.saas.subscription_metrics 4 5dimensions: 6 - name: dt_cohort_start 7 expr: dt_cohort_start 8 display_name: 'Cohort Start Date' 9 synonyms: 10 - signup date 11 - cohort date 12 - when they signed up 13 14 - name: plan_tier 15 expr: plan_tier 16 display_name: 'Plan Tier' 17 synonyms: 18 - pricing plan 19 - subscription level 20 - plan type 21 22 - name: region_code 23 expr: region_code 24 display_name: 'Region' 25 synonyms: 26 - geography 27 - market 28 - territory 29 30measures: 31 - name: rev_mrr_usd 32 expr: SUM(rev_mrr_usd) 33 display_name: 'Monthly Recurring Revenue' 34 synonyms: 35 - MRR 36 - monthly revenue 37 - recurring revenue 38 format: 39 type: currency 40 currency_code: USD 41 decimal_places: 42 type: exact 43 places: 2 44 45 - name: pct_cust_attrit_q 46 expr: AVG(pct_cust_attrit_q) 47 display_name: 'Quarterly Churn Rate' 48 synonyms: 49 - churn 50 - attrition rate 51 - customer churn 52 - churn rate 53 format: 54 type: percentage 55 decimal_places: 56 type: exact 57 places: 1 58 59 - name: n_active_subs 60 expr: SUM(n_active_subs) 61 display_name: 'Active Subscriptions' 62 synonyms: 63 - active customers 64 - subscriber count 65 - active users 66 format: 67 type: number 68 decimal_places: 69 type: exact 70 places: 0
This definition leverages three complementary metadata capabilities:
Display names replace technical column names with human-readable labels across dashboards and reports. For example, rev_mrr_usd is presented as "Monthly Recurring Revenue" in any downstream visualization.
Synonyms enable AI discoverability. When a user asks Genie "What's our churn?", the synonym mapping resolves that term to the correct measure. Each dimension or measure supports up to 10 synonyms of up to 255 characters, providing sufficient coverage for the terminology variations that exist across teams and departments.
Format specifications define how values are rendered in visualization tools. Churn is displayed as 3.4% rather than 0.034, and MRR as $48,250.00 rather than a raw numeric value. These formatting rules propagate automatically to all dashboards built on the metric view.
The Difference in Practice: Where Metadata Changes the Outcome
On a simple schema, Genie's inference is often sufficient to resolve common abbreviations and return correct results without any metadata. The value of Agent Metadata becomes apparent in three areas: ambiguity resolution, presentation quality, and consistency across teams.
Ambiguity Resolution
Consider the query: "Show me MRR by pricing plan"
On the six-column table above, Genie resolves this correctly without metadata. The abbreviation mrr is present in the column name, and plan_tier is the only plausible grouping column. There is no ambiguity to resolve.
Now consider a production schema containing the following revenue columns:
| Column | Description |
|---|---|
rev_mrr_usd | Monthly Recurring Revenue |
rev_nrr_usd | Net Revenue Retention |
rev_grr_usd | Gross Revenue Retention |
rev_arr_usd | Annual Recurring Revenue |
rev_exp_usd | Expansion Revenue |
The same query, "Show me revenue by plan", now presents five plausible matches. Without metadata, the model must infer which revenue column the user intends, with no mechanism to guarantee a correct selection. With synonyms explicitly mapping "MRR" to rev_mrr_usd and "expansion revenue" to rev_exp_usd, the resolution becomes deterministic.
Presentation Quality
Even when Genie resolves the correct column without metadata, the output quality differs significantly.
Without Agent Metadata:
| pct_cust_attrit_q | region_code |
|---|---|
| 0.034 | NA |
| 0.051 | EMEA |
| 0.028 | APAC |
With Agent Metadata:
| Region | Quarterly Churn Rate |
|---|---|
| NA | 3.4% |
| EMEA | 5.1% |
| APAC | 2.8% |
The underlying data is identical. However, display names replace technical column headers, and format specifications render 0.034 as 3.4%. This represents the minimum improvement that Agent Metadata provides, independent of whether inference would have succeeded. For stakeholders consuming results in dashboards or Genie responses, this distinction is not cosmetic, a table of raw decimals requires the reader to infer the unit and context. A properly formatted table communicates both immediately.
Consistency Across Teams
The most significant benefit of Agent Metadata is one that does not surface in a single-user test. When Finance queries "net retention" and Product queries "expansion revenue", synonyms ensure both terms resolve to their respective correct columns. Without metadata, both queries rely on the LLM's interpretation, which may vary depending on phrasing, context window content, or model version.
Agent Metadata eliminates this variability. The mapping is explicit, governed, and version-controlled within Unity Catalog. Every user, across every tool, resolves the same term to the same column, not because the model inferred correctly, but because the definition is authoritative.
The Bigger Picture
Agent Metadata does not address a problem that is immediately visible on a small, well-structured dataset. An LLM will often produce correct results without it. This can lead organizations to underestimate its value, until schema complexity increases, teams scale, or a quarterly report surfaces a discrepancy traceable to an ambiguous column resolution.
The value of Agent Metadata is structural. It elevates the semantic layer from a presentation concern, historically managed at the BI tool level, to a governed component of the data catalog. Business meaning is defined once within Unity Catalog, version-controlled, subject to access policies, and consumed automatically by every downstream tool, dashboards, Genie, notebooks, and any future integration.
For organizations operating at scale, this represents the difference between an AI tool that produces correct results in most cases and one that does so reliably. Synonyms eliminate ambiguity. Display names ensure readability. Format specifications enforce consistent presentation. None of these mechanisms depend on model inference, and none degrade as schemas grow in complexity.
The question organizations should consider is not whether an LLM can interpret their data without metadata, in many cases, it can. The question is whether inference alone provides a sufficient foundation for organizational reporting and decision-making.
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Blog author
Niklas Niggemann
Working Student Data & AI
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.