Popular searches
//

Talk to your Data Part 2: Limits and Performance Enhancements

19.2.2026 | 8 minutes reading time

In part one of this series, we introduced the MotherDuck MCP server in combination with opencode and showcased initial context engineering. We also showed deeper knowledge retrieval using natural language instead of SQL. In this article we will dive deeper into the retrieval capabilities and test limits of prompting and precise formulations to analyze data from a chat interface.

For context: we will be using the same weather dataset for Munich for the last 10 years (2015-2025) and measurements from sensors to monitor cycling activity in Munich for the same period.

Finding Edge Cases and Forcing Hallucinations

We have probably all dealt with ChatGPT, Claude or Gemini hallucinating or presenting incorrect facts with complete confidence. This makes it hard to separate fact from fiction when the underlying data or sources are not available to us. We wanted to test the robustness of our setup and find information in a dataset we do not know well. This would show us how quickly we can get familiar with new data and whether we can verify the claims made about our data afterwards.

We started where we left off in the first part, having created a baseline context and analyzed heat development during the summer months. Next, we decided to explore the connection between weather and cycling activity. To avoid overloading the context, we limited our prompt to the year 2025.

The prompt: Use the MotherDuck query tool to find a connection between weather and cycling activity for the year 2025. Identify the columns for dates, bike counts, and locations in the sensor data along with the temperature, rain, and wind columns in the weather data. Look for relationships between daily bike traffic and weather conditions throughout the year. Identify specific tipping points like a certain temperature or amount of rain where cycling volume changes significantly. Present the findings in a table with a plain text summary explaining how weather patterns drove cycling activity in 2025.

The agent executed seven queries and took about one minute to complete. It correctly identified the daily weather aggregations table and used it instead of the hourly weather data to prevent unnecessary queries. The queries still contained redundancy since CTEs were recreated often during individual queries because results from previous queries were not available.

The first major issue appeared in the results. The database contained datasets for multiple locations. The agent chose the wrong table as its source of weather data, linking weather activity from other parts of Germany to the cycling activity in Munich. After explicitly stating that the Munich weather dataset should be used and that the cycling dataset measures activity in Munich, we received different results.

The response contained different tables showing correlations between temperature, observed rainfall, average number of cyclists for these conditions, min and max values, and indications for the variation of the observed number of cyclists under those conditions.

We noticed the agent did not correctly understand that there are different stations measuring activity for the same day. It reported that there were 137 days below zero degrees in Munich, for example. This happened even after we created a context window and asked the agent to inspect the cycling activity tables beforehand.

This clearly shows that sufficient annotation is necessary to prevent hallucinations, assumptions, and guesswork during ad hoc analysis using natural language. Even with prior context priming, the agent ran many queries and generated context and content that was reported to us with high confidence and nice formatting. However, it fell apart when we asked follow up questions like "are there really so many days below zero degrees celsius in Munich in 2025?".

Even after several corrections, the report remained inaccurate. This led us to conclude that asking very specific questions on aggregated data or general exploration tasks at table level are better use cases than sophisticated analysis requiring extensive meta knowledge about the data or detailed annotations for the agent to inspect.

Testing with Annotated Data

DuckDB (and therefore MotherDuck) offers the ability to add comments to tables, columns and views. You can add comments by running SQL in the MotherDuck web app or via a custom script that uses the DuckDB client and establishes a connection to MotherDuck with a token. We added comments to each table and column to potentially increase retrieval performance. The comments described the general purpose of the tables and the contents of each column.

To make the analysis comparable despite the ambiguity of natural language and varying behavior of generative AI models, we recreated the dataset with annotations and started a new session with the agent. We rebuilt the same context window with the same prompts until the point where retrieval performance had significantly worsened earlier. Our goal was to drastically improve performance and prevent the assumptions that appeared before. The problematic query improved somewhat without additional intervention, but the agent focused on one of the measuring stations while five exist in the dataset. This decision changed the result drastically and omitted information about the other measurement stations, resulting in an incomplete analysis.

We followed up with this prompt to clarify our intent: Can you check the comments on the 2025 bike activity table?. Only after explicitly mentioning the comments and forcing the agent to inspect them did the results become accurate and insightful. We learned that cycling activity increased when the temperature rose above 15 degrees and that activity increased identically across all stations. Furthermore, rain has a significant impact on cycling activity but is tolerated up to a certain point. We validated these findings against the actual data.

Exhaustive Prompting vs. Iterative Conversation

When we think of "talking to our data," we often imagine asking simple questions and receiving direct, reliable answers. Our initial query, however, was relatively verbose: instructing the agent to identify columns, analyze relationships, find tipping points, and format results in a specific way. This led us to explore whether a more minimal approach could achieve similar results.

We started a fresh session with the annotated database, provided the same baseline context as before, and reduced our query to a single sentence: "Look for relationships between daily bike traffic and weather conditions throughout the year 2025 and identify tipping points." This approach was largely exploratory and we did not expect particularly strong results. Surprisingly, the agent produced highly accurate insights that matched the quality of our verbose prompt.

Encouraged by this outcome, we experimented with a third approach. The MotherDuck documentation recommends breaking complex queries into smaller, conversational steps rather than asking everything at once. We tested this iterative method against both our original detailed query and the concise single-question prompt. While this recommendation has merit for general use, it did not significantly improve results in our case. The core correlations between weather conditions and cycling activity remained consistent across all three approaches, the accuracy of identifying weather-cycling tipping points varied by only approximately 2% between annotated and non-annotated datasets. Please refer to our detailed summary of our findings for more information.

Although the factual outcomes were largely stable, we observed notable differences in what the agent chose to emphasize. The iterative conversation approach led the agent to focus on rain duration as a key variable. The detailed comprehensive query produced more station-level breakdowns. The simplified prompt encouraged broader behavioral pattern analysis across all stations.

These variations are both intriguing and impressive, but they highlight an inherent challenge: without carefully structured prompting, the analyst has limited control over the agent's interpretive focus. In our case, the simplified query happened to align well with our analytical intent. However, repeating the same prompt under identical conditions may yield different emphases, some potentially more insightful, others less so. This non-determinism is characteristic of LLM-based systems and requires verification of results regardless of prompt complexity.

We evaluated all three prompting strategies on both annotated and non-annotated versions of the database. Annotations did not substantially increase accuracy. They expanded the agent's exploratory scope instead. With comments present, the agent incorporated variables such as rain duration, sunshine hours, and weekday–weekend distinctions into its analysis. Without annotations, these columns were often ignored, likely because their meaning was less transparent from column names alone.

Conclusion

We have seen that annotations and exact prompting are necessary to retrieve specific information or create ad-hoc analysis that ask for more than facts that are already aggregated and self-explanatory. Issues with these queries can be remediated but require deeper inspections of the results and the queries to catch them. This is where the natural language capabilities end or demand very specific formulations about the data that would have to be acquired prior to the prompts. Even though results were interpreted incorrectly or the generated SQL narrowed the scope of our analysis prematurely, we could leverage it to easily run and adapt our own queries to generate insights about our data.

While our testing proved to yield insightful results, any work with agents is of course a variable matter – the same setup can lead to different results. Nevertheless, MCP holds incredible general potential, which we will discuss in our next and final part in addition to an outlook of its place in the context of data architecture and data technology stacks.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.