Collect publications from a website by creating a "scraping bot" type source

User Role: Administrator, Explorer Duration: 30 min Objective: Learn how to properly create a bot builder

Use case

The "scraping bot" is a targeted collection robot dedicated to pages and websites.

It tracks new information available on a fixed URL and optionally on URLs located at a depth level of the fixed URL.

Many websites created with the help of a CMS show previews of articles on a fixed URL, but offer the entirety of these articles on dynamic URLs.

Dynamic URLs are generally accessible by clicking on a link of the type "Read more".

A dynamic URL is a different URL for each article, e.g. with the date or the title of the publication, which cannot be guessed in advance.

 

General functioning of the Cikisi scraping bot

  • On the fixed URL, corresponding to the first level, the robot will collect information related to different articles
    • At least the title of the article and the link to the full article, often the description and the date of publication, sometimes the image of the article
    • The fixed URL is the one to be added first in the robot creation form
  • On the so-called dynamic URLs, corresponding to the second level, the robot will collect information to complete the information already associated with the different articles
    • Often the full content of the article

Type 1 is by far the most frequent case.

 

Robot configuration

It is necessary to specify to Cikisi which parts of the web pages to associate with which fields (title, image, description, etc.).

It is therefore necessary to define selectors that are nothing more than HTML tags.
Note that web pages using Javascript (which is becoming rare) cannot be collected with this type of robot

Step 1 - Choosing a method to define selectors
The configuration of the robot can be done in two ways:

  • Via the interface visually representing the website by clicking on "Edit Selectors".
    • This configuration mode is ideal for people who do not have any knowledge of HTML language
    • The website to be monitored is represented within Cikisi (sort of iframe)
    • It allows to quickly select the useful areas of the web page with the mouse
      • These zones appear in yellow during the selection and in blue once the selection is done
      • When selecting, the "u" key allows you to select the upper zone (nesting)
    • Once defined, a selector can also be improved/corrected using the keyboard within an input field appearing in blue at the bottom of the screen
  • Via the input form by clicking on "Edit Selectors manually
    • This configuration mode is ideal for people with a good knowledge of HTML
    • It is the only possible option if the website refuses to be represented in Cikisi (in this case you are facing a blank page)

 

Step 2 - Definition of the "Wrapper" selector

The "wrapper" is the most important selector, without it the configuration cannot be done.

As "wrapper" you have to define the area of the web page that contains ONE AND ONLY ONE article.
In other words, within the "wrapper" you can never find two different articles.
In the same way, all the 1st level fields of the same article (title, description, etc.) must be found within this "wrapper".

 

Step 3 - Choosing selectors at the 1st or 2nd level

Whether it is within the interface representing the website or within the form, you can move the selectors from the first to the second level (and vice versa) using a "drag and drop". So if the date is not present on the first level, then drag this selector to the second level.

We recommend that you always take the information from the first level if it is available there.

 

Step 4 - Advanced Options

 

Close pop-up windows

Use "close pop-up modal" if your robot cannot access a level because a modal window appears above the content (example: "Subscribe" or "Accept cookies" window).

 

Anti "bot detection

Use the "Use US proxy" option in the creation form if your collection robot is blocked by the site. In 80% of the cases, the use of Cikisi's residential proxy allows you to bypass the bot detection system set up by the site.