Collecting publications from a website by creating a website crawler

User Role: Administrator, Explorer Duration: 15 min Objective: To learn when and how to properly create a website crawler

Agenda of the training session

 

  • Use cases
  • Advantages and disadvantages
  • Technical principles
  • Definition of the 1st level
  • Targeting of certain pages
  • Frequency of collection
  • Main content
  • PDF associated with a page

 

User Manual(s)

Sources

Download PDF View Fullscreen Close Fullscreen Page: /

FAQs

  • What is the crawl depth of the website crawler?

    In practice, the crawl depth is unlimited. The crawl depth is 4 starting from the starting URL chosen in the source configuration. However, every second crawl session starts from a page already created during a previous crawl session, chosen randomly. Iteratively, the Cikisi crawler will always go deeper into the site.

  • Can I choose the depth of the crawl?

    No, because we have opted for a more precise limitation, based on the structure that the URL of the article to be created must have. So you can ask to collect only articles with /en/news in their URL.