AWS SageMaker Machine Learning Data handling

17.1.2020 | 10 minutes reading time

Seven ways of handling image and machine learning data with AWS SageMaker and S3

If you start using AWS machine learning services, you will have to dive into data handling with AWS SageMaker and S3. We want to show you seven ways of handling image and machine learning data with AWS SageMaker and S3 in order to speed up your coding and make porting your code to AWS easier.

If you are working on computer vision and machine learning tasks, you are probably using the most common libraries such as OpenCV , matplotlib , pandas and many more. As soon as you are working with or migrating to AWS SageMaker, you will be confronted with the challenge of loading, reading, and writing files from the recommended AWS storage solution, which is the Simple Storage Service or S3. This article helps you migrate your existing code to the AWS environment. If you want to migrate already existing code that was not written for SageMaker, you need to know some techniques to get the job done fast. This article gives a short overview of how to handle computer vision and machine learning data with SageMaker and options how to port your notebooks that have not been written for SageMaker.

If you are new around here, please take a look at at our AI portfilio , YouTube channel and our Deep Learning Bootcamp.

Storage Architecture of SageMaker and S3

In order to get a better understanding of the setup, we will take a short look at the storage architecture of SageMaker.

AWS SageMaker storage architecture

With your local machine learning setup you are used to managing your data locally on your disk and your code probably in a Git repository on GitHub. For coding you probably use a Jupyter notebook, at least for experimenting. In this setup you are able to access your data directly from your code.

In contrast to that, machine learning in AWS relies on somewhat temporary SageMaker machine learning instances that can be started and stopped. As soon as you terminate (delete) your instance and load your notebook into a new instance, all the data on the instance is gone unless it was stored elewhere. All data that should be permanent or needs to be shared between different instances, e.g. being available to a training instance, should be held outside of the instance storage. The place where to put the ML input data is the Amazon Simple Storage Service – S3. You’ll find additional hints on proper access rights and cost considerations at the very end of the article.

The data on your instance either reside on the instances file system, Elastic Block Storage, or in memory. Additional sources of data and code could be public or private Git repositories, either hosted on Github or in AWS CodeCommit. The code can reside in your instances storage as part of the inference/training job in the AWS Training/Inference Images that are held in the Elastic Container Registry.

During the creation steps of the the SageMaker ML instance you define how much instance storage you want to assign. It needs to be enough to handle all the ML data that you want to work with. In our case we assign the standard 5GB. At this step you should be aware that this instance is different from the training instance, which you will spawn from your notebook. The notebook instance might need much less of storage and compute power than the training instance. Often it is enough to use a small instance for the data preparation steps and not an accelerated one.

Specifying the instance volume size

Seven ways to access your machine learning data and to reuse your existing code

Depending on what you want to do with your data and how often you need it during work, you have the following options

Using a Code Repository for data delivery
Code based data replication
Copying data to the instance with the AWS client
Streaming data from S3 to the instance-memory
Using temporary files on the instance
Make use of S3 compatible framework method
Replace ML framework functions with AWS custom methods

1 – Using a code repository for data delivery

One way to bring the original code and small ML datasets on the SageMaker instance is the use of your Git repository. The repository is cloned initially into you SageMaker instance. All the data will be available at the root directory of your jupyter notebook. This method is not necessarily recommended for all cases.

Adding a repository

The issue might be that source control management systems such as Git do not cope very well with bigger chunks of data. Especially they try to generate diffs for files which does not work well with large binary files. A good article about the pros and cons of holding training data in Git repositories can be found here . An alternative to Git is DVC, which stands for Data version control. We have already published a walkthrough of DVC and and article about DVC dependency management . The idea of DVC is that the information about the ML binary data is placed in small text file in your Git repository, but the actual binary data is managed with DVC. After checking out your code base version from GIT you would use a command like ‘!dvc checkout’ to get the data from your binary storage, which could also be AWS S3. Shell commands are executed by placing a ‘!’ in front of the command you want to execute from your notebook.

2 – Code based data replication

Another easy way to work with your already existing scripts, without too much of a modification, is to make a full copy of your training data on the SageMaker instance. You can do it as part if your code or by using command line tools (see below). Basically you traverse the whole ML data tree, create locally all the directories you need create all files that are needed.

1# Download all S3 data to the your instance 
2import boto3 
3from botocore.exceptions import ClientError 
4s3 = boto3.resource('s3', region_name='us-east-2') 
5bucket = s3.Bucket('sagemaker-cc-people-counter-trainingsset') 
6for my_bucket_object in bucket.objects.all():    
7    key = my_bucket_object.key    
8    print(key)    
9    if not os.path.exists(os.path.dirname(key)):           
10        os.makedirs(os.path.dirname(key))     
11    try:         
12        bucket.download_file(key, key)     
13    Except ClientError as e:         
14        if e.response['Error']['Code'] == "404":             
15            print("No object with this key.")        
16        else:             
17            raise

copying the bucket to your instance

3 – Copying data to the instance with the AWS client

A very simple and easy way to copy data from your S3 bucket to your instance is to use the AWS command line tools. You can copy your data back and forth between s3:// and your instance storage, as well between s3:// bucket and s3:// bucket. It is important that you set your IAM Policies correctly (see hints at the end of the article).

1!aws s3 cp s3://$bucket/train/images train/images/ --recursive

The documentation can be found here .

4 – Streaming data from S3 to the SageMaker instance-memory

Streaming means to read the object directly to memory instead of writing it to a file. Also interesting but not necessary for our current challenge, is the question of lazy reading with S3 resources – reading only the actually needed part of the file – you can find some more description here https://alexwlchan.net/2017/09/lazy-reading-in-python/

1import matplotlib.image as mpimage
2…
3image = mpimage.imread(img_fd)

original call

1import boto3 import io s3 = boto3.client('s3') 
2obj = s3.get_object(Bucket='bucket', Key='key') 
3image = mpimg.imread(io.BytesIO(obj['Body'].read()), 'jp2')

call using streaming data

5 – Using temporary files on the SageMaker instance

Another way to work with your usual methods is to create temporary files on your SageMaker instance and feed them into the standard methods as a file path. Tempfiles provides automatic cleanup. For more information you can refer to the documentation .

1from matplotlib import pyplot as plt
2...
3img= plt.imread(img_path)

original call

1import boto3
2import tempfile
3from matplotlib import pyplot as plt
4...
5s3 = boto3.resource('s3', region_name='us-east-2')
6bucket = s3.Bucket('sagemaker-cc-people-counter-trainingsset')
7object = bucket.Object(img_path)
8tmp = tempfile.NamedTemporaryFile()
9with open(tmp.name, 'wb') as f:
10    object.download_fileobj(f)
11    f.seek(0,2)
12    img = plt.imread(tmp.name)
13    print (img.shape)

new approach by using temporary files

6 – Make use of S3 compatible framework method

Some of the popular frameworks implement more options to access data than file path stings of file descriptors. As an example the pandas library uses the URI schemes to properly identify the method of accessing the data. While file:// will look on the local file system, s3:// accesses the data through the AWS boto library. You will find additional infos here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html. For pandas any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected.

1import pandas as pd
2data = pd.read_csv('file://oilprices_data.csv')

original call accessing local files

1import pandas as pd
2data = pd.read_csv('s3://bucket....csv')

new call with S3 URI

7 – Replace ML framework functions with AWS custom methods

Some further examples for using AWS native methods instead of machine learning library calls.

1plt.imshow(Image.open(img_paths[0]))

original call

can be replaced by

1from s3fs.core import S3FileSystem
2    with s3fs.open('{}/{}'.format('sagemaker-cc-people-counter-trainingsset', img_paths[0])) as f:   
3        display(Image.open(f))

call using s3fs

Another Example with scipy

1import scipy.io as io
2mat = io.loadmat(img_path.replace('.jpg','.mat'))

original call

can be replaced by

1from s3fs.core import S3FileSystem 
2s3fs = S3FileSystem() 
3mat = pio.loadmat(s3fs.open('{}/{}'.format('sagemaker-cc-people-counter-trainingsset', img_path.replace('.jpg','.mat'))

call using s3fs

Conclusion

The task of porting jupyter notebooks to AWS SageMaker can be a little bit tedious at first, but if you know what tricks to use it gets a lot easier. A key part of porting the notebooks is to get the data handling right and to decide what approach you want to take in order to enable or replace your usual ML framework calls. We have shown some options how to approach this task. If you have some additional tricks or hints, please let me know @kherings. I recommend you to have a look at our AI portfolio , youtube channel and our deep learning bootcamp.

Additional Hints

S3 access rights

In order to access your data from your SageMaker Instance you need to have proper access rights. Actually, the SageMaker Instance that is running needs to have the proper access rights to use the S3 Service and access the bucket (directory) where the data is held. Your SageMaker Instance needs to have a proper AWS service role, that contains a IAM policy with the rights to access the S3 Bucket. There are two options, either let SageMaker generate a AmazonSageMakerFullAccess role for you, or make a custom one.

The AmazonSageMaker-ExecutioRole lets the notebook access all S3 buckets, containing the string ‘sagemaker’ in its name. The other quick option is to create a S3 Full Access Policy to you custom role. (not recommended)

Considering S3 storage cost for your image data

If you are working [with data] on the AWS cloud, you should keep an eye on the cost of your actions in order not to have an unpleasant surprise. Typically you rather save money with AWS in comparison to a local setup, but you should be aware of the cost drivers and use the AWS Cost Explorer and the AWS SageMaker pricing tables . Depending on the size and frequency of the access to your ML data you might want to change the storage class of your S3 bucket or activate S3 Intelligent-Tiering. Price comparisons can be found here .

Was this post helpful?

Blog author

Kai Herings

Do you still have questions? Just send me a message.

fromKai Herings

Physical regression testing for the Thermomix

Automating physical regression testing of products with computer vision and robotics Testing a physical product can be a highly manual task. The advances in Deep Learning techniques and computer vision have led to a situation where we can start to strive...

AWS
IoT
Computer Vision
Product management
AI
Testing

31.3.2020 | 8 minutes reading time

Kai Herings

Deep Learning diesel car detection with AWS Deeplens

With this series, we would like to give you an understanding of different machine and deep learning approaches, illustrated by the example of recognizing diesel vehicles. In this article, we have summarized the approach based on deep learning in neural...

AWS
Computer Vision
AI
Machine Learning

12.11.2018 | 10 minutes reading time

Kai Herings

Diesel detection with Machine Learning: The HOG detector

Diesel city driving bans as a use case for machine and deep learning applications In this series of articles, we would like to give you an understanding of different machine and deep learning approaches using the example of detecting diesel cars by recognition...

Machine Learning

22.10.2018 | 9 minutes reading time

Kai Herings

Deep Diesel: Machine & Deep Learning for diesel car detection

In this article series, we show different machine and deep learning approaches on the use case of detecting diesel cars as well as environmental zone badges and type labels on the cars. This article gives an introduction and an overview of the article...

AWS
Machine Learning

15.10.2018 | 8 minutes reading time

Kai Herings

Implementing a simple Smart Contract for Asset Tracking [blockcentric ...

Our article series “blockcentric” discusses blockchain-related technology, projects, organization and business concerns. It contains knowledge and findings from our work, but also news from the area. Blockchain in the supply chain is covered in several...

Crypto
Blockchain

18.1.2018 | 7 minutes reading time

Kai Herings

Blockchain in the Supply Chain – Implementation with industrial shop floor...

Our article series “blockcentric” discusses Blockchain technology, projects, organization and business concerns. It contains knowledge and findings from our work but also news from the area. Blockchain in the supply chain is covered in several parts...

IoT
Blockchain
Software architecture
Agile methods

26.10.2017 | 4 minutes reading time

Kai Herings

Blockchain in the Supply Chain – A Practical Introduction [blockcentric...

Our article series “blockcentric” discusses Blockchain-related technology, projects, organization and business concerns. It contains knowledge and findings from our work but also news from the area. Blockchain in the supply chain is covered in several...

Blockchain
DevOps
IoT
Software architecture
Webdevelopment

23.10.2017 | 8 minutes reading time

Kai Herings

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 minutes reading time

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 minutes reading time

We deployed our SaaS Application on fly.io (and it was great).

How we deployed our application in a fraction of the time while saving 100% of the cost. Our team, a bunch of experienced software engineers without prior contact to cloud deployments, wanted to deploy our OCPP-compliant EV Charging Station Simulator...

AWS
Cloud

23.10.2024 | 4 minutes reading time

Jannis Mainczyk

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 minutes reading time

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 minutes reading time

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

AWS SageMaker Machine Learning Data handling

Seven ways of handling image and machine learning data with AWS SageMaker and S3

Storage Architecture of SageMaker and S3

AWS SageMaker storage architecture

Specifying the instance volume size

Seven ways to access your machine learning data and to reuse your existing code

1 – Using a code repository for data delivery

Adding a repository

2 – Code based data replication

copying the bucket to your instance

3 – Copying data to the instance with the AWS client

4 – Streaming data from S3 to the SageMaker instance-memory

original call

call using streaming data

5 – Using temporary files on the SageMaker instance

original call

new approach by using temporary files

6 – Make use of S3 compatible framework method

original call accessing local files

new call with S3 URI

7 – Replace ML framework functions with AWS custom methods

original call

call using s3fs

original call

call using s3fs

Conclusion

Additional Hints

S3 access rights

Considering S3 storage cost for your image data

Was this post helpful?

Blog author

More articles

Physical regression testing for the Thermomix

Deep Learning diesel car detection with AWS Deeplens

Diesel detection with Machine Learning: The HOG detector

Deep Diesel: Machine & Deep Learning for diesel car detection

Implementing a simple Smart Contract for Asset Tracking [blockcentric ...

Blockchain in the Supply Chain – Implementation with industrial shop floor...

Blockchain in the Supply Chain – A Practical Introduction [blockcentric...

More articles in this subject area

Zero-ETL with MotherDuck: A Technical Deep Dive

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

Simplifying LLM Application Development: A Newcomer's Perspective

We deployed our SaaS Application on fly.io (and it was great).

Function Calling with GPT Models

Dangling DNS in cloud infrastructures

When Business Meets Technology: From Data Product to Data Architecture...

Spring Boot and HTMX: Deployment to AWS Lambda

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook