This article is for Account Owners and Project Managers.
The Context Crawler feature allows you to grab context from a website. All that you need is the URL of the website that will be crawled. The crawler will then find the pages to visit on its own. That's it! This feature eliminates having to use third-party tools or the availability of a tech team to do the site crawl for you. It's useful for non-GDN web projects where there is no previous context in the project, and no need for targeting specific pages. (However, Context Crawler can still be used for GDN projects.)
Currently, there is a limit of 1000 pages per crawl. Once the crawler has started, you can navigate away from that page if you need to.
If by chance your site has more than 1000 pages, you will need to run the crawler for a second time, using a different domain/URL starting point. In the initial crawl of the site, you'll see a list of all pages that were processed. You can verify against that list to make sure that you are using a starting point or page that has not yet been crawled. For example, your initial crawl may have used the domain https://xyz.com, but your second crawl might be https://xyz.com/blogpages.
To access Context Crawler:
- Go to Projects (select a project) > Content > Context Crawler. Enter the the URL (include the https) of the site you wish to crawl in the Domain field.
- If you have previously crawled that domain, you can choose to turn ON/OFF the Overwrite capability.
- If your site is password protected, you will need to enter your user name and password in the Protected Website section in order for the crawler to be able to crawl all pages.
The crawler works best for websites where different pages live on different URLs so that all context can be captured. For dynamic web applications, JS library would be a better option as the Context Crawler can't simulate user activity such as opening menus, submitting forms, etc.