Capture Your Website Content With The GDN Crawler

Regularly capturing and translating new content is essential to maintaining a fully translated website. Content capture is triggered by loading a page running through the Global Delivery Network (GDN). It can happen through organic traffic from end-users or by implementing the GDN Crawler - Smartling's automated solution for discovering new content on your website.

Use Cases

Here are some ways to use and benefit from the GDN Crawler.

Automate Content Capture

Rather than relying on organic traffic to capture new content, you can configure and schedule the GDN Crawler to browse each page automatically on a schedule. Automation is beneficial if your site has frequent content updates or an unpredictable schedule for publishing changes. New content detection will reduce the possibility of untranslated text appearing on your site for long periods. You can also set up the GDN Crawler in staging or testing environments to capture and translate new content before it's pushed to production.

Tip: Once new content is captured, it sits in the Awaiting Authorization queue in your GDN project. You can automate Job creation by settings up a Jobs Automation rule that works with your crawling schedule.

Word Count Indication

In addition to capturing content for translation, the GDN Crawler can indicate how much content (word count) you have on your website. Knowing your word count total can help you align expectations and budgeting with your translation provider and internal procurement before starting a new project. Note that the crawl results will only count newly discovered text. Content already in your Awaiting Authorization queue or already in a Job is not included in the word count total for a crawl.

Tip: You can view the crawl results in the Crawl Schedule tab to check the word count captured.

How It Works

The GDN Crawler browses your site on a schedule. You can configure the site path the crawler browses, including the whole site or a subset of pages. It views pages like users do, simulating a browser that loads all content on the page. Text not previously ingested into your project will be detected during the crawl. Then, subsequent crawls of your site will see only new or updated content for translation. The crawl frequency should match the frequency of your website content updates.

New content appears in your GDN Project awaiting authorization and needs to be bundled into a Job to initiate the translation process. You can create a Job manually for all or some strings to control what content is translated and when.

Alternatively, you can automate the Job creation process with Job Automation rules. Depending on how the Job Automation rule is configured, it can also authorize the Job for translation. Creating and authorizing Jobs automatically may be helpful if you are familiar with your website's content and want new content sent for translation without requiring further approval.

Note: Whether you choose to manually create and authorize Jobs or automate any part of it, ensure that Job creation happens after each scheduled crawl has finished. Otherwise, you might not include all new content, and untranslated text will remain visible on your website.

Tip: The GDN Crawler does not generate billable GDN requests.

Domains

The GDN Crawler can be set up to crawl your entire website or specified paths within the site. Only one target language domain must be crawled because the detected content will be available for translation into all languages. In other words, you don't need to configure the crawler to browse all translated versions of your website. However, if you have configured your site with localized or bespoke content for a language, and you would like to keep the translation up to date with the content produced for that language, then you should include that target domain as well.

If your website has multiple domains, for example, https://www.smartling.com/ and https://resources.smartling.com, you can set up the GDN to capture each domain's content in your GDN project settings (Settings > Domains). Assuming all relevant domains are already configured in your GDN project, you can further control how the crawler inspects each by defining a crawl depth and Includes and Excludes path rules.

Crawl Depth

Crawl Depth defines how many links away from the starting point of a crawl should be followed before stopping. The starting point of a crawl is the domain(s) being scanned; then any Includes paths you specified. An Includes path set to /products means the page is level 0. Any links from that page are level 1. Any links from those pages are level 2, and so forth. The default setting of 5 should capture most content on a typical site, but you may want to adjust this up or down based on the structure of your website.

Note: You can set one crawl depth per Crawl Configuration for the domains/paths defined in the configuration. If you require a different crawl depth for a domain, you must create another Crawl Configuration.

Rule Syntax

Includes and Excludes rules are entered as URL paths that support simple wildcards. For example, /products or /blog/*/releasenotes. Multiple paths can be entered by putting each path on a separate line.

If a path is entered that only contains wildcards, such as *blog*, we will automatically assume the crawl starts at the root folder / and will look for links with blog in the path. This is less efficient than starting on a specific path like /aboutus/news/blog

Includes

The includes field allows you to define the scope of focus for the crawler within the specified domain(s). Here, you can define website paths targeting areas you want the crawler to scan.

Tip: Includes paths are filters and direct the Crawler to visit only pages within those paths. You can specify multiple Includes paths to target multiple areas of your website. The crawler will start at the site's homepage if no Includes value is set.

Example

Domain: https://fr.smartling.com

Crawl depth: 3

If you want the crawler to capture content:

In every path within the domain, leave the Includes rule blank.
In a specific marketing page within the domain, insert the path in the Includes field, e.g. /resources/

In the first scenario, the crawler will capture content in https://fr.smartling.com and browse every link on that page to 3 levels deep.

In the second scenario, the crawler will start to capture content at https://fr.smartling.com/resources/. The levels defined in the crawl depth guide the crawler on how far to continue. Instead of crawling 3 levels deep across the entire website, the crawler dives 3 levels deep within the specified path and only captures content that includes /resources/ in the URL.

To illustrate this, the following are some pages the crawler would browse based on the example configuration in scenario 2:

Level 0: https://fr.smartling.com/resources/

Level 1: https://fr.smartling.com/resources/101

Level 2: https://fr.smartling.com/resources/101/languageai-webinar/

Level 3: https://fr.smartling.com/resources/101/languageai-webinar/thank-you/

Excludes

The Excludes field lets you specify any website path in which the crawler should never capture content. If the paths specified in Includes and Excludes overlap, the Excludes rule takes higher priority and overrides the Includes rule.

For example, if you set an Includes rule to scan https://fr.smartling.com/resources/101 and an Excludes rule to https://fr.smartling.com/resources/, the Excludes rule would override the Includes rule, and no pages would be scanned.

Each domain can only be part of one include and exclude rule. Multiple paths can be defined in a single include and exclude rule.

Simple wildcard expressions (*) are supported in the Includes and Excludes fields.

User Agents

The GDN Crawler behaves like a user browsing your website rapidly. When it connects to your site, it will follow HTTP standards and send an identifier known as a User Agent. This helps websites to know what kind of browser and computer is visiting the site, and in response, the site can alter what it displays, for example, by showing a mobile-optimized page to a browser running on a phone.

As the GDN Crawler will visit pages much more rapidly than a user can, this may trigger alerts with your web operations team if they have not been made aware of the crawl. To help ensure no one within your organization blocks this unknown traffic, you can notify your internal teams before deploying the GDN Crawler and work with them to specify a User-Agent value that identifies the GDN Crawler and exempts it from being blocked.

Tip: The default is SmartlingGDNCrawler/1.0. You can enter a custom value by selecting Custom User Agent and inserting a user agent identifier.

HTTP Headers

You can define and use specific HTTP headers that the GDN Crawler will apply when connecting to your domain(s). For example, if your website has a shop locator popup, you can bypass this by inserting a default location cookie, so the crawler can access the content behind the popup.

Screenshot_2023-05-16_at_14.22.40.png

Tip: For more information on HTTP headers, read MDN's Web Documentation.

Authentication / Session Cookies

Authentication or session cookies can be included here. This is a simplified way of applying an HTTP Cookie header. The size limit for most cookie headers is 4 KB, and this field will allow you to paste up to 4096 text characters.

Screenshot_2023-05-16_at_18.38.08.png

Best Practices

Periodically run full crawls of your sites to ensure translation coverage.
Schedule smaller, more frequent crawls on areas of your site that contain high-value content.
The intervals between scheduled crawls should be at least as long as it takes to complete the crawl. This time will vary depending on the size of your site and the areas included or excluded in your configuration.
You can use the Includes and Excludes options in your configuration to limit the total scope of the crawl.
If you use Job Automation rules, ensure they run after scheduled crawls are complete.
Websites sitting behind complex authentication mechanisms (SSO, 2FA, etc.) will likely not work with the GDN Crawler unless your web operations team can provide some bypass mechanism.

Prerequisites

A configured GDN project, including any content integration rules needed to capture dynamic content. This is covered in the onboarding phase of your deployment with Smartling.
A website that does not require interactive authentication to access it (e.g., SSO, 2FA, etc.)

How To Access The Crawler

To access The GDN Crawler:

In Smartling, go to Account Settings > GDN Hub
- You can access Domain Management and GDN Crawler settings for any project. There is also a link to take you directly to GDN documentation.
Click Manage Crawler
- The Configurations tab lists all existing domain settings for crawls. Here you can create, edit, disable, and delete crawl configurations. Crawl Configurations allow you to define which domains to visit and provide tools for narrowing the scope and connection settings of the crawl.
- The Schedule tab lists all scheduled crawls and the status of each. Here you can start/pause/stop, edit, disable, and delete crawler schedules.

How To Configure The Crawler

In Smartling, go to Account Settings > GDN Hub > Site Crawler > Add Configuration
Name the configuration (required)
- This is visible in the Configurations tab
Choose the GDN project containing the domains you’d like to scan (required)
Crawl depth (required): define how far down a website’s page hierarchy to crawl.
- The default value of 5 will work for most sites.
Domain (required): choose from a list of domains configured in the selected GDN project.
Includes: narrow the scope of paths to be crawled in your website by inserting the website path(s).
Excludes: define website paths that content should never be captured in.
- If the value of both Includes and Excludes overlap, the Excludes rule takes higher priority and overrides the Includes rule.
User-Agents: define and/or customize the user agent value.
- The default is SmartlingGDNCrawler/1.0, but this can be customized by selecting Custom User Agent and inserting a user agent identifier, or you can use some common User-Agent values provided.
HTTP Header: define specific HTTP headers to be used during the scan.
Authentication / Session Cookie: insert an authentication or session cookie value needed to bypass interactive authentication or other mechanisms that require user input.
- The crawler will automatically bypass GDN authentication on sites set to Protected within the Domains page of your project.
Click Save Configuration.

Tip: Next, create the schedule the configuration should run on.

How To Schedule The Crawler

In Smartling, go to Account Settings > GDN Hub > Site Crawler > Schedule
- To start a crawl outside its normal schedule, use the Actions button in the list of schedules and select the Start action. You can also Pause, Resume or Stop an already running crawl.
To schedule a crawl, click Add Schedule
Name the schedule (required)
- This is visible in the Crawl Schedules tab
Select the crawl configuration you are scheduling (required)
Repeats every: choose a frequency of daily or weekly by selecting every week and defining the days, or monthly and defining the day of the month. If you want the crawler to run on the last day of the month, we recommend using "Last day" instead of 28, 30, or 31.
Time: choose the approximate time or cadence the crawl should run on. Make sure to choose the timezone appropriate to your web content creators.
Click Save Schedule.

Note: You can use the same Crawl Configuration in multiple schedules. You must add a Crawl Configuration to a schedule to be able to run it.

Viewing Crawl History

You can view the history of each crawl, which includes all the pages captured in the crawl schedule page.

Go to Account Settings > GDN Hub > Site Crawler > Schedule
On the schedule you want to view the history of, click the ellipses > View History
- Here you have a detailed view of the pages captured during the crawl, including the word and string count.
Click View Results to view details of each individual crawl, including additional information on the status of the crawl.

Viewing Crawl Results

To view crawl results:

Go to Account Settings > GDN Hub > Site Crawler > Schedule
Under the Most Recent Crawl column, click the word and string count
- Here is a detailed view of the crawl path, the crawl status, the time it took to complete the crawl and the captured string and word count.
Click the path to view details of each level of the path crawl

Hey! Hoi! ¡Oye! Ciao ! 你好! Hallo! Salut ! Hey! How can we help?