Serverless application for scraping and filtering

26.4.2018 | 6 minutes reading time

In this article, I’m going to write about an application which I wrote for scraping and filtering advertisements from few different websites. The application uses the serverless framework (link ) and runs on AWS and the code is written in Python.

Background story

I am looking for a flat to buy. Since there are lots of recurring adverts on different ad websites, and the same flats are advertised multiple times to be always on the first page of results, I thought it would be an interesting idea to write an application which scrapes the advertisements from different sites and compares them. The main goal is to make a unique list of advertisements which are browseable.

In the past couple of months, I am working with serverless technologies and I came up with the idea to implement this application as a serverless application, document it, and share it in a blog post, as an intro to the serverless world with AWS and serverless framework.

The application is simple enough to understand, yet not a typical hello word application.

You can find the project on my GitLab account here .

Concept of serverless

In a nutshell, serverless means that you do not have to think about the servers. Just write the code which executes the business logic. The provider takes care of the rest (spinning up a container, initialization of the execution environment, code execution, scaling, etc.)

This enables fast project setup and efficient development.

Application architecture

The application is designed to be hosted entirely on AWS. Lambdas are used to implement the business logic (fetching, processing and filtering the advertisements), and serving the processed data through an HTTP endpoint using API Gateway. DynamoDB is used to store the data. A public S3 bucket is used to host the frontend.

The architecture of the application is in the picture below.

The architecture consists of the following functions:

scraper – scrapes the advertisements from three different sites for advertising. It extracts the data from the ads, formats the data and puts it into ScrapedAdverts DynamoDB table. This is a scheduled lambda which is executed 3 times per day.
aggregator – reads the data from ScrapedAdverts table and processes them. Checks if the given advertisement exists in FilteredAdverts table by performing a similarity check. If the advertisement exists, it is going to be updated. If it does not exists, the data is going to be inserted. This lambda is also scheduled and runs several times per day. It processes only a chunk of data from ScrapedAdverts (the amount of data which is returned in one scan by DynamoDB)
adverts_controller – acts as request handler for the API Gateway. It is mapped to GET /adverts/get?page= call.
db_cleaner – is executed once per day and cleans the ScrapedAdverts table. It deletes the entries which are older than 15 days.

The frontend is a static website (HTML and JS), which fetches the data from GET /adverts/get?page= endpoint and visualizes it. The next screenshot shows the frontend.

The frontend is hosted in a public S3 bucket and can be reached here . It is a simple static website which fetches and visualizes the data got from adverts_controller. The data is a list of scraped and filtered adverts in JSON format:

1{
2   "items":[
3      {
4         "metadata":[
5            "Key1=Value1",
6            "Key2=Value2"
7         ],
8         "location":[
9            "Location name 1",
10            "Location name 2"
11         ],
12         "area":55,
13         "processed":true,
14         "timestamp":1524463412,
15         "images":[
16            "https://url.to/image.jpg",
17         ],
18         "text":"Longer description of the property",
19         "link":"https://link.to/propery",
20         "advertiser":{
21            "name":"Advertiser name",
22            "phones":[
23               "066 1234567",
24               "021 1234567"
25            ]
26         },
27         "price":66000,
28         "title_hash":"11344e17595d494506e87fa61925018b34443016",
29         "title":"Title of advert",
30         "similar_adverts":[
31           {
32             "link":"http://link.to/similar-advert/1",
33             "title":"Similar advert 1"
34           }
35         ]
36      }
37   ],
38   "page":0,
39   "number_of_pages":5,
40   "count":124,
41   "page_count":25
42}

Project structure

The project is structured in a way that every function is in its own folder, and has its own dependencies (requirements.txt).
The exceptions are the following directories:

test – contains the unit tests.
utils – contains common helper functions. The content of this directory is included in every packed function.
client/dist – contains the frontend code (HTML and JS).

In the root of the project is the file serverless.yml, which is the main entry point. This file configures the serverless framework for this application.

The project has the following structure:

The serverless.yml

The main entry point of the project is the serverless.yml file. This file tells the serverless framework what do deploy and how.

It consists of the following parts:

provider: configures the cloud provider, which is AWS in our case. It defines the runtime, region and other common values which are applied to every function.
package: configures the way of packing the functions.
functions: defines the lambda functions. Under every function is the configuration for the given function. handler specifies the method which is called when the function is invoked. The global configuration values can be overridden in the functions. event defines what invokes the function.
resources: defines the resources which should be created when the application is deployed. The resources part must use CloudFormation syntax.
plugins: defines the plugins which are used by the serverless framework.
custom: defines the custom variables set by the user and the configuration values for the plugins.

You can find the project’s serverless.yml here .

Below is an example which defines a function, configures an HTTP event for the given function and creates a DynamoDB table:

1service: my-sls-service
2
3provider:
4  name: aws
5  runtime: python3.6
6  iamRoleStatements:
7    - Effect: "Allow"
8      Action:
9        - "dynamoDB:*"
10      Resource: "*"
11
12package:
13  individually: true
14
15functions:
16  my_controller:
17    handler: lambda_handler.handle
18    module: my_controller
19    environment:
20      USERS_TABLE: Users
21    events:
22      - http:
23          path: users/get/all
24          method: get
25
26resources:
27  Resources:
28    usersTable:
29      Type: AWS::DynamoDB::Table
30      Properties:
31        TableName: Users
32        AttributeDefinitions:
33          - AttributeName: email
34            AttributeType: S
35        KeySchema:
36          - AttributeName: email
37            KeyType: HASH
38        ProvisionedThroughput:
39          ReadCapacityUnits: 5
40          WriteCapacityUnits: 5
41
42plugins:
43  - serverless-python-requirements

The lambda handler

On AWS lambda, when using Python, the function which is used for handling the invocation should have the following signature:

1def handler(event, context):
2    return

event holds the data which is passed to function, e.g.: if the handler handles HTTP events, the request body, the query parameters, path parameters, etc. are passed in the event object.
context is injected by the AWS lambda runtime, and it can be used to gather information and interact with the runtime. You can find more info about this on this link .

Boto3 library is the de-facto standard in Python to interact with the AWS services. It is available in the AWS Python runtime.

The code which would satisfy the above-given example would look like this:

1import boto3
2from os import environ as env
3 
4 
5def handle(event, context):
6    users_table = boto3.resource('dynamodb').Table(env['USERS_TABLE'])
7    return users_table.scan()['Items']

Testing

So far, so good, it’s simple and easy to write functions. But what about the testing? Testing a lambda function by deploying it and invoking, and then watching the logs is a bad idea.

Writing unit tests is a crucial step in writing better code. Fortunately, the lambda functions are easily testable. Moto is a powerful library for testing lambda function. It mocks AWS services like DynamoDB and the mocked service behaves like the real service.

You can find more about testing in the project’s readme file .

Conclusion

We’ve seen that implementing a simple application is easy with the help of the serverless framework and AWS stack.

There are several things which can be added/improved:

security: deny all permissions and allow only needed permissions on function level
ability to manage ads: add an authenticated user which can manage the scraped adverts
make the scraper configurable
improve the adverts matching algorithm

Links

Project on my GitLab
Serverless framework
Boto3
Moto
Awesome Serverless

Was this post helpful?

Blog author

Jozef Jung

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

The Ultimate Tool for Engineers and Developers: Compass Premium

It’s not an every day activity that a tool comes and redefines how engineering and development teams operate, but Compass is the tool with a game-changing solution. As Atlassian's out-of-the-box internal developer platform, Compass helps teams to stay...

Atlassian
Cloud

3.12.2024 | 4 [Missing String "readingTime"]

Özge Kavas

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 14 [Missing String "readingTime"]

We deployed our SaaS Application on fly.io (and it was great).

How we deployed our application in a fraction of the time while saving 100% of the cost. Our team, a bunch of experienced software engineers without prior contact to cloud deployments, wanted to deploy our OCPP-compliant EV Charging Station Simulator...

AWS
Cloud

23.10.2024 | 4 [Missing String "readingTime"]

Jannis Mainczyk

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 [Missing String "readingTime"]

Markus Höfer

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 [Missing String "readingTime"]

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 [Missing String "readingTime"]

Manuel Zapf

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 [Missing String "readingTime"]

Manuel Zapf

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Ever wondered how the technology that seamlessly manages microservices traffic evolved from early implementations to lean, kernel-level solutions? Let's dive into the fascinating journey of service meshes, from Linkerd 1.x to the cutting-edge technologies...

Cloud
Networking
Infrastructure
Kubernetes
Linux

22.5.2024 | 10 [Missing String "readingTime"]

Manuel Zapf

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

When Gateway API debuted in October last year, this concluded a nearly four-year-long process that started in summer 2019. Gateway API is the successor of core Ingress definition, aiming towards various goals. This blog post will give a brief overview...

API
Open Source
Cloud
Networking
Kubernetes
Cloud native

15.3.2024 | 6 [Missing String "readingTime"]

Manuel Zapf

Cloud-native (application) networking in 2024

It's 2024 and Software is still eating the world. Whether it's powering an e-commerce platform, driving AI applications, or supporting critical business processes within organizations, there's a high likelihood that these applications are running in ...

Cloud
Networking
Infrastructure
Kubernetes

8.3.2024 | 2 [Missing String "readingTime"]

Manuel Zapf

Charge your APIs Volume 22: Mastering the Art of API Federation

API Federation is becoming essential in modern API management, addressing the complexities of evolving digital enterprises. It marks a shift from centralised, monolithic management to a dynamic, modular framework. Unlike traditional methods, API Federation...

API
Cloud
Cloud native

7.2.2024 | 11 [Missing String "readingTime"]

Daniel Kocot

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

In Part 1 of this series, Developing AWS Lambda Functions with Python and CDK, we covered the initial setup of a CDK and Python project. We walked through the process of creating a basic Hello World* Lambda function, testing it with a unit test, defining...

AWS
Serverless
Python

2.6.2023 | 2 [Missing String "readingTime"]

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

This blog post assumes that you are familiar with Python development and know the basic concepts of Amazon CDK. What's more, you should have an AWS account and have configured the AWS CLI. If you're new to CDK, go here, if you need to configure the AWS...

AWS
Serverless
Python

6.3.2023 | 6 [Missing String "readingTime"]

Simple Fraud Detection with PyMC

In one of my last projects, we were facing a prediction problem with very limited data. Each set of data took a specialist hours to compile, and results were not always successful. Therefore, we were looking for a tool to handle these requirements, as...

Python
Data Science

26.1.2023 | 7 [Missing String "readingTime"]

How to upgrade your Aurora Serverless database schema using CDK and Lambda

Imagine the following situation: You are building a serverless application using e.g. lambdas, you setup your system using CDK (or CloudFormation) and you store your data in Aurora Serverless. How would you automate your database schema adaptations or...

Cloud
Database
AWS
Infrastructure as Code
Serverless

16.1.2023 | 12 [Missing String "readingTime"]

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 [Missing String "readingTime"]

Denis Stalz-John

Let's build a modern CMD tool with Python using Typer and Rich

Let's build a modern CMD tool with Python using Typer and Rich I often have a need for a small CMD tool for my projects - e.g. to query an API or perform some operation. What do I want from the tool? Quick development cycle Nice output, e.g. with syntax...

API
Python

14.10.2022 | 12 [Missing String "readingTime"]

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

Heroku is cancelling their free plan! What about all my open-source projects? Luckily fly.io comes to the rescue! Here are the missing docs on how to run Spring Boot on fly.io.Why I love(d) HerokuHeroku was my go-to PaaS for open-source projects for ...

CI/CD
Java
Cloud
DevOps
Spring

18.9.2022 | 17 [Missing String "readingTime"]

CloudWatch on AWS: How to tackle high-security requirements

If you build cloud-native applications, you will also generate log output. Log outputs are essential to log the functionality of the application and to be able to localize errors very quickly in the event of a crash. However, log outputs of any kind ...

AWS
Cloud
IT-Security

23.8.2022 | 15 [Missing String "readingTime"]

Jörg Riegel

Serverless application for scraping and filtering

Background story

Concept of serverless

Application architecture

Project structure

The serverless.yml

The lambda handler

Testing

Conclusion

Links

Was this post helpful?

Blog author

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

The Ultimate Tool for Engineers and Developers: Compass Premium

Living on the edge: building serverless applications with Cloudflare Workers

We deployed our SaaS Application on fly.io (and it was great).

Dangling DNS in cloud infrastructures

Spring Boot and HTMX: Deployment to AWS Lambda

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

A/B Testing: Tool support and testing GrowthBook

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

Cloud-native (application) networking in 2024

Charge your APIs Volume 22: Mastering the Art of API Federation

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

Simple Fraud Detection with PyMC

How to upgrade your Aurora Serverless database schema using CDK and Lambda

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Let's build a modern CMD tool with Python using Typer and Rich

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

CloudWatch on AWS: How to tackle high-security requirements