Going serverless: How to move files from on-prem SFTP to AWS S3

26.2.2019 | 7 minutes reading time

Motivation

It is not so rare that we as developers land in a project where the customer uses SFTP (SSH File Transfer Protocol) for exchanging data with their partners. Actually, I can hardly remember a project where SFTP wasn’t in the picture. In my last project, for example, the customer was using field data loggers that were logging and transferring measured values to an on-premises SFTP server. But what was different that time was that we were building a serverless solution in AWS. As you probably know, when an application operates in the AWS serverless world, it is absolutely essential to have your data in S3, so it can easily be used with other AWS services for all kinds of purposes: processing, archiving, analytics, and so on.

What we needed was a mechanism to poll the SFTP server for new files and move them into the S3 bucket. As a result, we built a custom serverless solution with combination of AWS managed services. It is reasonable to ask why we didn’t use AWS Transfer for SFTP . While the answer is simple (it didn’t exist at that time), I think a custom solution still maintains its value for small businesses, where traffic is not heavy and the SFTP server is already part of the existing platform. If this sounds interesting, keep on reading to find out more.

From SFTP to AWS S3: What you will read about in this post

Custom solution for moving files from SFTP to S3
In-depth description of the architecture
Solution constraints and limitations
Full source code
Infrastructure as Code
Detailed guide on how to run it in AWS
Video instructions

The architecture

Let’s briefly start by explaining what our solution will do. It will scan an SFTP folder and it will move (meaning both copy & delete) all files from it into an S3 bucket. Actually, it doesn’t have to be only one folder/bucket pair, you can configure as many source and destination pairs as you want. Another important thing to ask is: when does it get executed? It does so based on a schedule. You will use a Cron expression to schedule the execution, so it is pretty flexible there.

The following is a list of AWS services and tech stacks in use:

How it works

CloudWatch Event is scheduled to trigger Lambda, and Lambda is responsible for connecting to SFTP and moving files to their S3 destination.
This approach requires only one Lambda to be deployed, because it is source- (SFTP folder) and destination- (S3 bucket) agnostic. When CloudWatch Event triggers Lambda, it passes the source and destination as parameters. You can deploy a single Lambda, and many CloudWatch Events that will all trigger the same Lambda, but with different source/destination parameters.

Node.js and Lambda: Connect to FTP and download files to AWS S3

The centerpiece is a Node.js Lambda function. It uses the ftp client module for communicating with FTP server. Every time CloudWatch Event triggers Lambda, it will execute this method:

1async execute(event: ImportFilesEvent): Promise<void> {
2    const ftpConfig = await this.readFtpConfiguration();
3    this.ftp.configure(ftpConfig);
4    await this.ftp.connect();
5    const files = await this.ftp.list(event.ftp_path);
6    for (const ftpFile of files) {
7        const fileStream = await this.ftp.get(`${event.ftp_path}/${ftpFile.name}`);
8        await this.s3.put(fileStream, ftpFile.name, event.s3_bucket);
9        await this.ftp.delete(`${event.ftp_path}/${ftpFile.name}`);
10    }
11    this.ftp.disconnect();
12}

It iterates through the content of the given folder and moves each file to the S3 bucket. As soon as the file is successfully moved, it removes the file from its original location.
Notice event.ftp_path and event.s3_bucket in the code above. They are coming from the CloudWatch Event Rule definition, which will be described in a following section.

CloudWatch Event Rule

CloudWatch Event is scheduled to trigger Lambda by creating CloudWatch Event Rules. Every Rule consists of Cron expression and Input Constant. Input Constant is exactly the mechanism we can use to pass source and destination.

Now, when you take a look at the signature of the handler, you’ll see an ImportFilesEvent:

1const handler: Handler<ImportFilesEvent, void> = async (event: ImportFilesEvent) => {
2    console.log(`start execution for event ${JSON.stringify(event)}`);
3    ...
4};

This is exactly the value of the Input Constant and it is shown in the logged output as:

2019-01-16T07:30:11.430Z ... start execution for event
{
    "ftp_path": "source-one",
    "s3_bucket": "destination-one"
}

FTP connection parameters

FTP connection parameters are stored in another AWS Service called Parameter Store . Parameter Store is a nice way for storing configuration and secret data. Value is stored as JSON:


{
  "host": "18x.xxx.xxx.xx",
  "port": 21,
  "user": "*********",
  "password": "********"
}

When Lambda executes readFtpConfiguration(), it reads the FTP Configuration from Parameter Store.

Limitations and constraints

Be aware that this is not a solution to synchronization of SFTP and S3, neither is it in real time. Don’t expect that as soon as file is uploaded to SFTP, it will appear on S3. It will execute on schedule.
Another thing is how much data it can handle. AWS Lambda has its limitations . Since this solution is built to scan entire folder and transfer all files from it, if there are too many files, or files are very large, it can happen that Lambda hits one of its limits. It works well when there were dozen of files and each file was never larger than a few KBs.

If there are network issues during transfer, Lambda will break, but since Amazon CloudWatch Events invokes Lambda functions asynchronously, it will retry execution. But I encourage you to explore its limits on your own, and let me know in the comments section if you see how to build more resilience for failures.

Tests are missing. Testing Lambda is another big topic and I wanted to focus on the architecture instead. However, you can refer to another blogpost to find out more about this topic.

Run the code with Terraform

To use Lambda and other AWS services, you need an AWS account. If you don’t have an account, see Create and Activate an AWS Account .
Another thing you’ll need to install is Terraform , as well as Node.js .

When everything is set up, run git clone to get a copy of the repository , where the full source code is shared.

1$ git clone git@gitlab.codecentric.de:milica.zivkov/ftp-to-s3-transfer.git

You will run this code in a second. But before that, you’ll need to make two changes. First, go to the provision/credentials/ftp-configuration.json and put real SFTP connection parameters. Yes, this means you will need an SFTP server, too. This code will try to download folders named source-one and source-two, so make sure you have them created.
Second, go to the provision/variables.tf and change the value of default attribute. AWS has that rule for naming S3 buckets – names should be globally unique. You will use this parameter to achieve this uniqueness.

Next, build the Node.js Lambda package that will produce Lambda-Deployment.zip required by terraform.

1$ cd move-ftp-files-to-s3
2$ npm run build:for:deployment
3$ cd dist
4$ zip -r Lambda-Deployment.zip . ../node_modules/

When Lambda-Deployment.zip is ready, start creating the infrastructure.

1$ cd ../../provision
2$ terraform init
3$ terraform apply

If you prefer video instructions, have a look here:

Now, you should see a success message Apply complete! Resources: 17 added, 0 changed, 0 destroyed.. At this point all AWS Resources should be created and you can check them out by logging in to AWS Console. Navigate to the CloudWatch Event Rule section and see the Scheduler timetable, to find information when Lambda will be triggered. In the end, you should see files moved from

1. source-one FTP folder –> destination-one-id S3 bucket and
2. source-two FTP folder –> destination-two-id S3 bucket

Summary: Going serverless by moving files from SFTP to AWS S3

This was a presentation of a lightweight and simple solution for moving files from more traditional services to serverless world. It has its limitations for larger-scale data, but it proves stable for smaller-sized businesses. I hope it will help you or serve as an idea when you encounter a similar task. Thank you for reading.

Was this post helpful?

Blog author

Milica Živkov

Do you still have questions? Just send me a message.

fromMilica Živkov

Testing Spring Batch applications

It’s been a few years now since Spring introduced the Spring Batch framework, a powerful framework for developing batch processing applications. It eased up our everyday work when it comes to importing data provided by another system, digesting larger...

Testing

3.12.2015 | 7 minutes reading time

Milica Živkov

Tutorial: Move your application to CloudBees

Few days ago I started to play around with porting one simple web application to the Cloud, to see how quickly it can be done. Provider of choice was CloudBees. CloudBees is one among many platform-as-a-service products available on the market, best ...

CI/CD

3.6.2014 | 7 minutes reading time

Milica Živkov

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 14 minutes reading time

We deployed our SaaS Application on fly.io (and it was great).

How we deployed our application in a fraction of the time while saving 100% of the cost. Our team, a bunch of experienced software engineers without prior contact to cloud deployments, wanted to deploy our OCPP-compliant EV Charging Station Simulator...

AWS
Cloud

23.10.2024 | 4 minutes reading time

Jannis Mainczyk

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 minutes reading time

Building desktop apps with web technologies

Building desktop apps with web technologies In this article I share insights into Electron and what to consider when shipping an desktop app with Electron. After that I introduce you to a new alternative called Tauri. It the end I provide an estimation...

Frontend
JavaScript
Node.js
Open Source
Webdevelopment

20.9.2023 | 13 minutes reading time

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

In Part 1 of this series, Developing AWS Lambda Functions with Python and CDK, we covered the initial setup of a CDK and Python project. We walked through the process of creating a basic Hello World* Lambda function, testing it with a unit test, defining...

AWS
Serverless
Python

2.6.2023 | 2 minutes reading time

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

This blog post assumes that you are familiar with Python development and know the basic concepts of Amazon CDK. What's more, you should have an AWS account and have configured the AWS CLI. If you're new to CDK, go here, if you need to configure the AWS...

AWS
Serverless
Python

6.3.2023 | 6 minutes reading time

How to upgrade your Aurora Serverless database schema using CDK and Lambda

Imagine the following situation: You are building a serverless application using e.g. lambdas, you setup your system using CDK (or CloudFormation) and you store your data in Aurora Serverless. How would you automate your database schema adaptations or...

Cloud
Database
AWS
Infrastructure as Code
Serverless

16.1.2023 | 12 minutes reading time

CloudWatch on AWS: How to tackle high-security requirements

If you build cloud-native applications, you will also generate log output. Log outputs are essential to log the functionality of the application and to be able to localize errors very quickly in the event of a crash. However, log outputs of any kind ...

AWS
Cloud
IT-Security

23.8.2022 | 15 minutes reading time

Jörg Riegel

Intro to monorepo with Nx

What is a monorepo?To understand why you may want to use the build system Nx (see website ), we should first talk about what a monorepo is and why you may want to use one. A monorepo is a repository (probably a Git repository) that contains more than...

Frontend
Git
Node.js
React
JavaScript

10.7.2022 | 10 minutes reading time

Tame the multi-cloud beast with Crossplane: Let’s start with AWS S3

What if learning the Kubernetes API is all you need to provision any infrastructure? And we’re not only talking about AWS, Azure & Google – but also IONOS, DigitalOcean and even vSphere. Let’s have a look at Crossplane and how we can create an S3 Bucket...

AWS
CI/CD
Cloud
DevOps

3.7.2022 | 21 minutes reading time

Building an instant noodles DevOps starter pack with Terraform and AWS

How can we help a fictitious startup kickstart its software development process? Using Terraform and AWS services, we’ll build an IT infrastructure that is ready within minutes and ticks quite a few boxes on the technical DevOps capabilities list. Just...

Cloud
Infrastructure
AWS
CI/CD
DevOps

27.6.2022 | 21 minutes reading time

Secretless connections from GitHub Actions to AWS using OIDC

Imagine the following scenario: You set up your GitHub Actions in your repository. And it’s all cool until you want to access your cloud provider resources. Now you might be tempted to create an access key and secret access key, place it as a secret ...

Azure
Cloud
AWS
CI/CD
DevOps
GitHub

29.5.2022 | 8 minutes reading time

Manuel

Functions vs. containers – which is better?

According to the Lünendonk study 2021 “Cloud-Native Software Development” , 64% of the study participants are already partially or completely “cloud-native” in the private or public cloud. Products such as AWS Elastic Container Service (ECS) or Managed...

AWS
Cloud
Container
Serverless

24.2.2022 | 12 minutes reading time

From specification to infrastructure – automated API deployments

Deploying an API into the various stages of a software development pipeline involves not only the aspect of writing (designing) an API specification, but also having or simultaneously deploying a corresponding infrastructure. This article describes possible...

AWS
CI/CD
Infrastructure
Infrastructure as Code
API

27.1.2022 | 11 minutes reading time

Daniel Kocot

JavaScript test performance: getting the best out of Jest

In recent years Jest has established itself as the go-to testing framework for JavaScript and TypeScript development. It provides a complete toolkit (test runner, assertion library, mocking library, code coverage and more) out of the box, and requires...

Node.js
JavaScript
APM
Testing

12.11.2021 | 7 minutes reading time

Structuring serverless applications in the cloud

Serverless is a model in which cloud providers are solely responsible for operating the infrastructure. Compute resources are structured into functions with the Serverless approach. Therefore, this is called Functions as a Service (FaaS). The costs for...

Software architecture
AWS
Cloud
Serverless

14.6.2021 | 10 minutes reading time

Processing protobufs messages with AWS IoT Core

IntroductionThe Internet of Things (IoT) is gradually changing an ever increasing number of aspects of modern day life. From connected vehicles to sensors monitoring all sorts of metrics in our homes: chips can be put to use almost everywhere. They are...

AWS
Go
IoT
Serverless

2.7.2020 | 15 minutes reading time

Cost-effective batch jobs on AWS’ serverless infrastructure

There are batch jobs that require much engineering and fine-tuning on serious hardware to make them feasible. However, many batch jobs run on oversized infrastructure and accumulate much more costs than necessary. Migrating these jobs to a serverless...

Software architecture
AWS
Cloud
Serverless

3.6.2020 | 7 minutes reading time

PayPal integration with React Native

INTROIn this blog post we will share some of our learnings during the process of integrating PayPal into a React Native application. We will address some problems that we encountered, reveal how we solved them, and give you an insight into what you need...

Serverless
AWS
React

25.5.2020 | 6 minutes reading time

Going serverless: How to move files from on-prem SFTP to AWS S3

Motivation

From SFTP to AWS S3: What you will read about in this post

The architecture

How it works

Node.js and Lambda: Connect to FTP and download files to AWS S3

CloudWatch Event Rule

FTP connection parameters

Limitations and constraints

Run the code with Terraform

Summary: Going serverless by moving files from SFTP to AWS S3

Was this post helpful?

Blog author

More articles

Testing Spring Batch applications

Tutorial: Move your application to CloudBees

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Living on the edge: building serverless applications with Cloudflare Workers

We deployed our SaaS Application on fly.io (and it was great).

Dangling DNS in cloud infrastructures

Spring Boot and HTMX: Deployment to AWS Lambda

Building desktop apps with web technologies

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

How to upgrade your Aurora Serverless database schema using CDK and Lambda

CloudWatch on AWS: How to tackle high-security requirements

Intro to monorepo with Nx

Tame the multi-cloud beast with Crossplane: Let’s start with AWS S3

Building an instant noodles DevOps starter pack with Terraform and AWS

Secretless connections from GitHub Actions to AWS using OIDC

Functions vs. containers – which is better?

From specification to infrastructure – automated API deployments

JavaScript test performance: getting the best out of Jest

Structuring serverless applications in the cloud

Processing protobufs messages with AWS IoT Core

Cost-effective batch jobs on AWS’ serverless infrastructure

PayPal integration with React Native