Data crawlers in business platforms

21.5.2019 | 9 minutes reading time

Intro

Business nowadays is changing rapidly, and the market demands constant change and fast adoption. In order to support new modern requirements, business models are forced to evolve. These changes are rapidly accelerated by the increased popularity of online-businesses. Old models have to come up with new, more efficient approaches, and business platforms might just be one of them. The main goal of a business platform is to connect users and thereby create value and transaction among them. Depending on the participants’ role, there can be different types of platforms. For example, a platform where participants are businesses who provide services to other businesses is called a Business to Business (B2B) platform. When participants are businesses who provide services or sell goods to end users, this is called a Business to Customer (B2C) platform. In order to be successful, the platform must have a sufficient number of participants to create value. To attract and pull as many users as possible is one of the highest priorities, especially in the starting phase of a platform. Leveraging data from other platforms would be one possible solution to bootstrap the process. For example, we can find potential users on business directory sites, or any other platform or service with openly available data that is meant for public use. This process is not meant to be done manually, so it demands automation.

Acquiring data through data crawlers, scrapers, and RPA

Acquiring data from other sources can be done by scraping web pages or through various web API services. For this purpose, we can create data crawlers, scrapers, or even use Robotic Process Automation (RPA) tools to obtain and process data. We will focus mainly on data crawlers and scrapers.
A data crawler is an automated software that connects to other sites and downloads pages. Data crawlers are also called spiders or web robots, and they are often used for website indexing by search engines. When they are crawling websites, they can create a large number of requests and disrupt the normal operation of the website. Therefore, they must follow rules that are set by websites and are usually defined in the robots.txt file, in the root of the website. If a data crawler follows rules from the file or has its own rules that are not intrusive and not harmful to the site in any way, we consider it a “polite” crawler.
Scrapers are software tools for targeted content extraction from web pages and parse that data to a specific format.

User attraction

When we create platforms, we need to attract users, who are both producers and consumers. This is a classic chicken-and-egg problem. Without producers, there are no consumers and vice versa. We can use existing platforms such as social networks, forums or business directories for potential users data search. It can’t be done manually because of the large number of entries to process; we can only discover the source of data for potential producers and customers. For example, if we wanted to get all dental services in one city or region, we could search for business directories sites in that field or use other platforms that provide that type of data. For the automation of these processes, we can use data crawlers to search and scrapers to extract relevant data from search results.

Data scraping

There are multiple ways to scrape data from websites. The most common way would be to make an HTTP request to the server’s site, after which we get an entire page of the requested site as a response, and then we can select and scrape data that we need for further analysis. The other way to get data would be with API endpoints. This is the easiest and fastest way to scrape data, and it is formatted and often without the requirement for additional processing and formatting. The response is usually in JSON (JavaScript Object Notation) or XML (eXtensible Markup Language) format and therefore it makes it easy to process. On the other hand, the disadvantage of these services is in a limited number of free requests.

Here are a few examples of data crawling and scraping. As a programming language we will use Java and third-party libraries:

JSoup library for parsing HTML documents
HtmlUnit for executing async JS calls
Apache HTTP client for API requests.

Let’s assume, for example, that we need to crawl and scrape data about dental services, and that the site has contact information that we can use for sending promo materials to potential customers. Our goal, in this case, would be to attract them to our platform. Let’s also assume that this site has the ability to search medical branches by categories and city or country regions. We can use a JSoup library for making the request and extracting such data. The request with JSoup for all dentists from Berlin would look like this:

1Document document = Jsoup
2.connect("https://www.targetsite.info/search?city=berlin&category=dentists")
3       .userAgent("cc_webbot") // identifying as a bot 
4       .timeout(3000)
5       .get() // executing GET method request

After executing this request, JSoup will return results in parsed HTML format. These results contain basic information about dentists from Berlin. Normally, we need to open each result in a new page to get detailed information about the requested search query. Then we can select elements or collect data using CSS or JQuery-like selector syntax. For example, let’s select elements that are contained in “DIV” with “results” classes:

1Elements dentists = document.select("div.results");

Now, we have a list of results that we should iterate through and if we want to select name, address and a link to detail page, we can do the following:

1String name = element.select("p.name").text()    	  // selecting name of dentist
2String address= element.select("p.address").text()	  // selecting address
3String link = element.select("a.details").attr(‘href’) // and URL link to detail page

After the elements selection, we can use a link to create another request to a page that contains detailed information and scrape all the other data that we need for our platform.
The search results can be larger than a few dozen or even hundreds and because of that, these sites that provide services like this limit the number of results in order to save resources and to speed up the search. These results are paginated, and we should crawl all pages to get all possible results. Usually, pagination is done by adding a parameter to a requested URL, e.g. &pageNumber=23, or by using a selector to select the link for the next page from the parsed HTML document.

The previous example will work in most cases, but still there are sites that use JavaScript to create and render elements and data asynchronously. JSoup can’t handle this kind of requests. For scraping these sites we can use HtmlUnit, a headless simulated browser that can do almost everything like a real browser. If we assume that our site from the first example is dynamically creating elements and data, we can use HtmlUnit like this:

1WebClient webClient = new WebClient(BrowserVersion.CHROME);
2webClient.getOptions().setThrowExceptionOnScriptError(false);
3webClient.getOptions().setJavaScriptEnabled(true);
4webClient.waitForBackgroundJavaScript(3000);
5Page page = webClient.getPage(“https://www.targetsite.info/search?city=berlin&category=dentists”);
6 
7Document document = Jsoup.parse(page.getWebResponse().getContentAsString()); // parsing with JSoup

After the request is executed, we can get results from the response and parse them with JSoup and use them like we did in the previous example.

The disadvantage of both approaches is that scraping data relies on parsing HTML documents and selecting data from elements using selectors. Frequent design improvements of sites might lead to some changes in class names or order of elements, so we might need to re-implement the selectors to get required data. This also might be a very slow process with a certain dose of inaccuracies.
We must consider a “polite” approach to sites which we are crawling. For example, we don’t want to create too many requests in a short period of time or to crawl and scrape resources that are not allowed to be scraped. We must follow the rules that are defined in the robots.txt file.

The third approach of obtaining data for our platform could be to use other platforms or services that give us access to their data by means of API endpoints. The exchange data format from these API endpoints responses can be either XML or JSON. Converting this type of data is faster and easier than parsing an entire HTML response with JSoup, and it is also less prone to errors.

Let’s see how we can obtain those dentist services in Berlin from an API endpoint service. Usually, requests to such services are authenticated so you must have an API key that is issued by the service owner, and provide it in every request. We will use the Apache HTTP client for making a request against the API endpoint, and the request will look like this:

1String apiEndpointUrl = "https://api.service.com/v1/json?search=dentists+near+berlin&api-key=";
2HttpGet getRequest = new HttpGet(apiEndpointUrl);
3HttpClient httpClient = HttpClients.createDefault();
4HttpResponse response = httpClient.execute(getRequest);

In this request, we first provide a URL to an API endpoint together with search parameters and a key. We are also requesting the response to be in JSON format. After the execution of these commands, if there are no problems, we should get a response with results from the server, but first, we must extract and convert those results in readable Java objects, which is needed for further processing. We can use Jackson ObjectMapper in this case:

1ObjectMapper mapper = new ObjectMapper();
2ApiSearchResults searchResults = mapper.readValue(response.getEntity().getContent(), ApiSearchResults.class);

After converting the response to Java objects, we can process the data and use it for our platform. Usually, these services limit the number of free requests that we can make against their endpoints, but if we need more requests, some kind of payment plan is usually provided.

Summary

As mentioned before, there are many ways to attract users to business platforms. In this article, we showed how to use data crawlers and scrapers to preload your platform or other services. There are many other ways and techniques to collect data, and within this article, we’ve decided to cover the most common ones.

If we follow the first two examples of creating crawlers and scrapers, we ought to create “polite” ones which respect the rules given by those sites and services. Data availability and frequent site design are also things to keep in mind. The best way to collect data would certainly be through API services. The only thing is that it depends on the number of requests, which sometimes also means higher costs.

If you would like to get in touch with us about building B2B and B2C platforms, contact me via mail novislav.sekulic@codecentric.de .

Was this post helpful?

Blog author

Novislav Sekulic

Do you still have questions? Just send me a message.