Use case
The "scraping bot" is a targeted collection robot dedicated to pages and websites.
It tracks new information available on a fixed URL and optionally on URLs located at a depth level of the fixed URL.
Many websites created with the help of a CMS show previews of articles on a fixed URL, but offer the entirety of these articles on dynamic URLs.
Dynamic URLs are generally accessible by clicking on a link of the type "Read more".
A dynamic URL is a different URL for each article, e.g. with the date or the title of the publication, which cannot be guessed in advance.
General functioning of the Cikisi scraping bot
Type 1 is by far the most frequent case.
Robot configuration
It is necessary to specify to Cikisi which parts of the web pages to associate with which fields (title, image, description, etc.).
It is therefore necessary to define selectors that are nothing more than HTML tags.
Note that web pages using Javascript (which is becoming rare) cannot be collected with this type of robot
Step 1 - Choosing a method to define selectors
The configuration of the robot can be done in two ways:
Step 2 - Definition of the "Wrapper" selector
The "wrapper" is the most important selector, without it the configuration cannot be done.
As "wrapper" you have to define the area of the web page that contains ONE AND ONLY ONE article.
In other words, within the "wrapper" you can never find two different articles.
In the same way, all the 1st level fields of the same article (title, description, etc.) must be found within this "wrapper".
Step 3 - Choosing selectors at the 1st or 2nd level
Whether it is within the interface representing the website or within the form, you can move the selectors from the first to the second level (and vice versa) using a "drag and drop". So if the date is not present on the first level, then drag this selector to the second level.
We recommend that you always take the information from the first level if it is available there.
Step 4 - Advanced Options
Close pop-up windows
Use "close pop-up modal" if your robot cannot access a level because a modal window appears above the content (example: "Subscribe" or "Accept cookies" window).
Anti "bot detection
Use the "Use US proxy" option in the creation form if your collection robot is blocked by the site. In 80% of the cases, the use of Cikisi's residential proxy allows you to bypass the bot detection system set up by the site.