Big Data – What to do with it? (Part 1 of 2)

21.7.2013 | 6 minutes reading time

Analyze it, of course!

Uh, right….. but how?

That’s what I’ll be going to talk about in this first part of a two-part blog series about Big Data analytics. So, if you are interested in getting some ideas and answers, please bear with me 🙂

Data’s the new currency

We all know the hype: how companies like Google or Facebook that offer quality services “free of charge” make billions of dollars in the process, by leveraging the data provided by their customers. This scared the established players and, in doing so, created a whole new breed of data applications around one problem: Big Data.

But, what exactly is Big Data?

Unfortunately, that’s one of those questions where you ask two people and get three different opinions. One definition that has come to be more widely accepted is the 3 V’s:

– High volume (amount of data),
– High velocity (speed of data in and out),
– High variety (range of data types, sources and structures).

As a rule of thumb: if it becomes a problem to get your data into your RDBMS of choice, then most likely you’ve encountered Big Data.

In my opinion, the first V (volume) is overrated (in germany, at least) – it may be called Big Data, but it’s the third V (variety) that truly distinguishes Big Data-storages from traditional RDBMS.

Now, the good things of Big Data and, more to the point, most systems that sprang into being are

– that you can store the data as it happens to come along, no matter its structure
– that the systems scale virtually without limit, and
– that you therefore have the possibility to discover new insights, trends and opportunities in your business to act upon.

Of course, at the end of the day, “act upon” means “earn money”.

Data is useless unless combined with other data to create information

That is a simple fact and that is also the point where it gets bad: not only do we face (possibly huge amounts of) semi-structured data, but querying and interconnecting Big Data is much harder when compared to traditional RDBMS supporting SQL. Most of the Big Data systems out there use their own mechanism and/or language to query the data (as of today – maybe in the future there will be something like SQL for Big Data or traditional RDBMS get up to speed). This means, you’ll need at least one expert with the particular system at hand to efficiently query the data.

Analyzing the data to extract the required information to decide on the right action is the real challenge

And this is where it gets ugly: as of today, there exists far from optimal support for data analysts to do their work and analyze the data stored in Big Data storages efficiently. Usually, it looks like this:

The data analyst and the expert work together with the former asking the questions and the latter doing his best to provide the answers by querying the system. This is simpler said than done, because depending on the complexity of the question, it may take quite a while before an answer can be provided with multiple map-reduce-jobs required in the process. And that’s not all: usually the data analyst needs to provide his findings in a way that can be understood by the business people (i.e. reports), which is a lot easier with the right tools.

As we all know, time is money and the time required to analyze Big Data and extract useful information out of it can become such a big factor that Big Data can overwhelm an organization.

I’m not saying that with traditional RDBMS you won’t face the same problems. I’m saying that with those systems

– data analysts can work mostly without the need for additional experts, because most of them know SQL and
– there exist a lot of tools that assist them in their work

making it a lot easier and therefore way faster for them to extract and provide useful information out of the data.

So, now what? Stick with traditional RDBMS whenever possible and only try Big Data if you really, really know what you are doing and getting yourself into?

Well… it’s always best if you know what you are doing, isn’t it?

Enter: JasperReports

Huh? Where did that come from?

First of all: the heading might also have been “Enter: Pentaho” or maybe even “Enter: BIRT”. JasperReports is not the only tool suite providing features for Big Data analysis — it’s the one that I know the best.

The point is: there really are tools out there that assist data analysts when working with Big Data, but that fact doesn’t seem to be widely known — even among people trying to sell Big Data.

For all of you who don’t know: JasperReports has been around for quite a while (since 2001) and is at its heart an open source java reporting library. Over the years, there has been created a whole ecosystem of tools around this core, so that today JasperReports offers a whole BI Suite.

Now, how does JasperReports aid data analysts in working with Big Data?

In many ways; some are part of the Community Edition (Open Source), others are only available in the professional editions and have to be paid for.

At the core is still the open source reporting library which has available connectors to many Big Data storages like MongoDB, Hadoop (Hive and HBase) or Google BigQuery. The main advantage is that it is much more likely to find someone being able to create a useful JasperReports-report than someone to extract useful information out of said storages. The second advantage is the open source, graphical report developing tool iReport that aids in creating the reports and provides additional features for Big Data like auto-discovery of available fields.

There is one caveat: you still need the knowledge of the query-functionality of the storage system at hand to create the connection to your data and to keep that connection effective (so, you probably should keep the expert). However, when that is done, the report designer (aka data analyst) can work with the data like he is used to. In this blog I provide a JasperReports tutorial connecting to a MongoDB-instance which demonstrates how that works.

However, it is still a lot of work to create the report (if you took a peek at the tutorial, you’d agree), which let’s one wonder, if there isn’t an easier option to work with the data.
Well, there is!

The JasperReports-ecosystem features a server component which, at the community edition level, functions as a central repository for report storage and provides a GUI and APIs for report execution and -scheduling. At the professional edition level, this component provides (among other things) an ad-hoc reporting GUI that allows data analysts to explore the data on-the-fly, literally “playing around” with it.

I find this feature to be extremely powerful, especially when considering the nature of Big Data: data analysts don’t have to know how the data in storage is structured up-front; they can explore the data as it is made available by the initial connection query (remember the caveat above!). In the second part of this blog I will explore this in more detail.

In closing, I would like to repeat, that in my opinion the discussion around Big Data focused too much on the “huge amount/high velocity of data”-part where it instead should focus more on the “high variety of data”-part. This part offers greater possibilities, in my opinion much greater than huge amount/velocity of data in itself, but also bears greater risks, because of the difficulty of extracting useful information.

Ok, on to part two .

Was this post helpful?

Blog author

Jan Malcomess

Do you still have questions? Just send me a message.

Ask Your Data(bricks) with Natural Language

The hottest topic in data and AI today is arguably talking to your own data. Writing SQL queries is far from intuitive when exploring data, so the ability to simply ask questions in natural language and receive AI-powered answers backed by your business...

Data
Big Data

16.4.2026 | 9 minutes reading time

Niklas Niggemann

MotherDuck Dives: From Natural Language to Live Dashboards

Dives are interactive visualizations created through natural language, built directly on top of data in MotherDuck. Users describe what they want to see, and an AI agent generates a persistent, interactive component that lives in their workspace alongside...

MotherDuck
Data
Data Science
Big Data

9.3.2026 | 8 minutes reading time

Niklas Niggemann

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 minutes reading time

Manuel Zapf

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 minutes reading time

Manuel Zapf

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

When Gateway API debuted in October last year, this concluded a nearly four-year-long process that started in summer 2019. Gateway API is the successor of core Ingress definition, aiming towards various goals. This blog post will give a brief overview...

API
Open Source
Cloud
Networking
Kubernetes
Cloud native

15.3.2024 | 6 minutes reading time

Manuel Zapf

How to gain visibility as a software developer?

No matter if junior, medior or senior, introverted or extroverted: Every software developer can increase their visibility with different tools and should treat the topic as important. The only question is: how and with what effort? In this blog post,...

Training
Software development
Community
Open Source

21.2.2024 | 6 minutes reading time

Building desktop apps with web technologies

Building desktop apps with web technologies In this article I share insights into Electron and what to consider when shipping an desktop app with Electron. After that I introduce you to a new alternative called Tauri. It the end I provide an estimation...

Frontend
JavaScript
Node.js
Open Source
Webdevelopment

20.9.2023 | 13 minutes reading time

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Introduction to GitOps with ArgoCD

In this post you will learn what GitOps is about and see the steps to create a setup on your laptop to gain some experience with ArgoCD. Using an industry standard container orchestrator such as Kubernetes, this enables developers to continuously deploy...

CI/CD
Kubernetes
GitHub
Open Source
DevOps
Container
Infrastructure as Code
Infrastructure
Spring

31.10.2022 | 10 minutes reading time

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

GitHub Actions CI pipeline: GitHub Packages, Codecov, release to Maven...

Stuck with TravisCI? Looking for a worthy alternative to GitLab CI? Here’s a guide on how to create a full CI pipeline publishing GitHub Packages, Codecov reports, releasing to Maven Central and GitHub, including dynamic commitlogs.GitHub Actions – blog...

Open Source
CI/CD
DevOps
GitHub

22.2.2021 | 22 minutes reading time

Creating integration flows with the Reedelk Data Integration Platform

The integration of data from systems of record or legacy systems is one of the elements of a software development project that does not start on a greenfield. In other words, it can help modernize software. Usually the question arises how to transfer...

Agile transformation
Container
Software architecture
Java
Microservices
Open Source
API

3.9.2020 | 8 minutes reading time

Daniel Kocot

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Kick-start your microservice project with JHipster

I recently looked for a solution on how to prototype a customer project in a short time and came across JHipster. The target architecture used Spring Boot in the backend and an Angular frontend. JHipster can scaffold this in its simplest variant as...

Node.js
Angular
Software development
Container
NoSQL
Cloud
JavaScript
Java
Keycloak
Kubernetes
Microservices
IT-Security
Open Source
React
Spring

12.5.2020 | 13 minutes reading time

Jörg Riegel

Golang, Gin & MongoDB – Building microservices easily

Golang, a.k.a. Go, has been around in the industry for quite some time now, but people are still reluctant to just go ahead and use it. To help you get started, follow me on this journey and create your first microservice using Golang, Gin and Docker...

Cloud
Container
Go
Microservices
NoSQL

21.4.2020 | 10 minutes reading time

Big Data – What to do with it? (Part 1 of 2)

Data’s the new currency

Data is useless unless combined with other data to create information

Analyzing the data to extract the required information to decide on the right action is the real challenge

Enter: JasperReports

Was this post helpful?

Blog author

More articles in this subject area

Ask Your Data(bricks) with Natural Language

MotherDuck Dives: From Natural Language to Live Dashboards

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

Becoming a Data-Driven Company with Applied Data Products

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

How to gain visibility as a software developer?

Building desktop apps with web technologies

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Introduction to GitOps with ArgoCD

The universal recommender in Action(ML)

GitHub Actions CI pipeline: GitHub Packages, Codecov, release to Maven...

Creating integration flows with the Reedelk Data Integration Platform

Thinking AI means re-thinking data

Kick-start your microservice project with JHipster

Golang, Gin & MongoDB – Building microservices easily