Thoughts after completing the Coursera “Data Engineering, Big Data, and Machine Learning on GCP Specialization”

8.9.2019 | 6 minutes reading time

Having worked with Google Cloud Platform’s Big Data Services for almost a year, I wanted to have a broader view on GCP’s capabilities. In this post, I will give you an overview of the services touched by the Coursera specialization . I have been a GCP fan already, and now I am even more convinced.
In the hands-on course (version: September 2019), which is quite up to date regarding the service maturity, but not the service names, I gained a deeper understanding of:

PubSub: Fully managed message queue for buffering messages
Dataflow: Serverless (i.e. autoscaling) Apache Beam service, can be used to process both batch or streaming data
Dataproc: Fully managed Hadoop cluster with Spark on YARN. Special focus was put on the separation of storage and compute, which also allows you to use cheap preemptible instances
BigQuery: Serverless data storage and analytics service, feels like an SQL database, with latency in seconds
Bigtable: A data sink for terabytes of data with millisecond latency, where you need to put some thoughts into schema design (especially access keys) and cluster configuration (it’s not serverless). I guess I will use it for IoT applications in the future!
ML Engine (new name: AI platform): Serverless training of TensorFlow models with seamless deployment to an API
Vision API: Query pre-trained models for already solved tasks, for example image classification, sentiment analysis
AutoML: Bring your own data to pre-trained models in the cloud. Then trigger transfer learning and finetuning with automated hyperparameter optimization. The GUI (which comes with AutoML Vision, I don’t know about other domains like language) makes the model easily accessible to your customer’s end users.
Datalab: Their notebook service which is built on the Jupyter ecosystem, but integrates nicely with Google’s infrastructure (CPU, GPU, TPU)
Data Studio: External dashboarding service which can be connected to BigQuery very easily

I especially enjoyed the hands-on labs where

we used AutoML to finetune a pretrained model to detect types of clouds in the sky (pun intended, I guess)
we set up a pipeline to monitor (simulated) traffic in San Diego to get the average lane speed and congestion information. I used PubSub to buffer messages, processed them with Dataflow (get average speed), redirected them to BigQuery and built a Dashboard with Data Studio, which got updated regularly.

Some more words about …

BigQuery

I really enjoy using BigQuery! Because

it’s serverless, you simply type SQL queries. I don’t care about node numbers and their uptime.
pricing is easy and transparent, it’s divided into storage and query costs (no artificial units involved, no dependency on cluster specs).
it handles nested data easily, and offers all the SQL functionalities you already know from your RDBMS.
you can create ML models with an SQL query. Of course, they are either the absolutely basic (LinReg and LogReg) or a custom TensorFlow model, but I guess the service will evolve and it might be possible in the future to make inference in the SQL query.

AutoML

In my opinion, Machine Learning (ML) is currently in the process of being commoditized. AutoML is an outcome of this process, and I also enjoy using it. You can call it “No-Ops” machine learning, you simply “recognize” the model parameters after hyperparameter tuning and assess the model quality.

If my next problem at a customer site has been solved already, like sentiment analysis or image classification, but needs a tweak towards one’s own data, then this is the way to go for me. Models also need to be retrained continuously. With the offered GUI, this can be done by (almost) non-technical personnel.
Besides that, AutoML models can also serve as a breakthrough to assess whether the business problem is suited for ML at all. AutoML then delivers the benchmark to beat from upcoming pipelines.

AutoML automates a lot of work for Data {Scientists, Engineers} (i.e. myself). IMHO, Data Scientists will still be in demand, though. But they will spend more time on insight generation, productivizing these, and implementing (semi-)automated decision making systems. Solutions for common use cases will be implemented more quickly, which will also increase the footprint and visibility of ML in the company more rapidly. With the increased awareness of ML, new use cases will arrive in the pipeline, and possibly some of them cannot be solved with a template or service (yet) such so that I can do hardcore modeling again.

A note on GCP AI services naming scheme

I personally consider Google to be the leader in non-robotic application of Machine Learning (Google AI, Colab with DeepMind, unlimited training data in Google Photos Storage, their reinforcement learning researchers are pleasantly anticipating the take-off of the Stadia gaming platform, I guess). But when browsing the website, I get really confused about the AI services they offer. Some snippets:

Cloud AI
ML engine (which is now being renamed to “AI Platform”)
AI Hub
Vision AI
Vision API
AutoML Vision
…

It is a little hard for me to differentiate between

Cloud services
Google Cloud consulting services
Which APIs I can query out of the box, what problem they are solving exactly

Apache Beam (Dataflow)

They are advertising Apache Beam heavily, understandingly, as it originated at Google .
I haven’t seen any other managed Beam service from other cloud providers. GCP’s Dataflow is a runner for Apache Beam.

It’s a unified framework for batch and stream processing with nice monitoring in Google Cloud Dataflow. The next time I need to send (streaming) data from A to B (for example, PubSub to BigQuery) and don’t need any JOIN or complex operations, I will definitely consider using it. Alternatives are, for example, Spark and Spark Streaming.

Datalab

Just a feeling, but I personally prefer to use “plain” Jupyter notebooks, as I am already used to it and know my shortcuts. Of course, I definitely want the ML machine out of the box, but I don’t see the need for a special Jupyter notebook look.

Summary

Most of our customers’ business models are not built around ML. But we use ML to add new features to their business model (recommendations), increase productivity (for example by doing anomaly detection with early warnings) or make their established processes more cost-efficient through automation.

Therefore, my main objective in my career at this moment is to productionize ML, i.e. to implement a production pipeline of insights that can be used for automated decision making. I am a fan of GCP, their products seem well thought out, and the high-quality Coursera specialization gave a nice overview after one year of GCP application. I especially welcome GCP’s growing AutoML capabilities. Using AutoML lets me focus on model quality and business problem solving, ultimatively enhancing my customer’s business models for a reasonable cost.

The next step on my path is now the Google Cloud Professional Data Engineer certification.

Was this post helpful?

Blog author

Niklas Haas

Service Lead GenAI

Do you still have questions? Just send me a message.

MotherDuck Dives: From Natural Language to Live Dashboards

Dives are interactive visualizations created through natural language, built directly on top of data in MotherDuck. Users describe what they want to see, and an AI agent generates a persistent, interactive component that lives in their workspace alongside...

MotherDuck
Data
Data Science
Big Data

9.3.2026 | 8 minutes reading time

Niklas Niggemann

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

Talk to your Data Part 3: The Potential of Natural Language

This is the last and final part of our article series covering the new MCP server by MotherDuck. We have already presented the basics and challenges in previous parts. Now, we want to conclude with our findings and comments on the current state and give...

MotherDuck
Data

27.2.2026 | 7 minutes reading time

Hendrik Kamp

Niklas Niggemann

Talk to your Data Part 2: Limits and Performance Enhancements

In part one of this series, we introduced the MotherDuck MCP server in combination with opencode and showcased initial context engineering. We also showed deeper knowledge retrieval using natural language instead of SQL. In this article we will dive ...

MotherDuck
Data

19.2.2026 | 8 minutes reading time

Niklas Niggemann

Hendrik Kamp

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

MotherDuck's new MCP server gives us the opportunity to have a conversation with an AI models like Claude or ChatGPT and ask questions about our data that are directly transformed into SQL. The queries are executed against the actual data in our cloud...

MotherDuck
Data

12.2.2026 | 6 minutes reading time

Niklas Niggemann

Hendrik Kamp

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

Update 02.02.26 – After helpful insights from the Polars team on LinkedIn, we enhanced our benchmark setup with a configuration of Polars where async is forced. This is elaborated in the article. Our previous benchmark compared DuckDB, Polars, and Pandas...

MotherDuck
Data Science
Data

20.1.2026 | 15 minutes reading time

Niklas Niggemann

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

Ingress NGINX Retirement — Don't Panic, We've Got You Covered

Ingress NGINX Retirement - Don't Panic, We've Got You Covered At KubeCon NA 2025, the Kubernetes community faced a significant announcement: the ingress-nginx controller officially entered retirement by March 2026. Adding to the uncertainty, its designated...

Kubernetes
Cloud native
DevOps
Cloud

18.11.2025 | 5 minutes reading time

Manuel Zapf

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Serverless from Europe: My Experience with Scaleway as an Alternative ...

In addition to dominant US providers like AWS, Azure, and GCP, the French company Scaleway now offers a comprehensive serverless computing portfolio. This includes services for Function as a Service, a lightweight Key/Value Store, and a simple messaging...

Compliance
Infrastructure
data protection
Cloud native
Cloud
Infrastructure as Code

28.5.2025 | 5 minutes reading time

Florian Lüdiger

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Thoughts after completing the Coursera “Data Engineering, Big Data, and Machine Learning on GCP Specialization”

Some more words about …

BigQuery

AutoML

A note on GCP AI services naming scheme

Apache Beam (Dataflow)

Datalab

Summary

Was this post helpful?

Blog author

More articles in this subject area

MotherDuck Dives: From Natural Language to Live Dashboards

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

Talk to your Data Part 3: The Potential of Natural Language

Talk to your Data Part 2: Limits and Performance Enhancements

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

MotherDuck: Access Management and Scalable Analytics Overview

DuckDB vs. DataFrame Libraries

Ingress NGINX Retirement — Don't Panic, We've Got You Covered

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Zero-ETL with MotherDuck: A Technical Deep Dive

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Serverless from Europe: My Experience with Scaleway as an Alternative ...

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)