Web Crawler Connector

The Web Crawler connector recursively discovers and extracts content from websites. Starting from one or more seed URLs, it follows links, extracts page content using JSoup HTML parsing, and delivers each page as a searchable document.

How It Works

Start with one or more seed URLs (starting points for the crawl)
Fetch each page via HTTP GET with a randomized user agent
Parse the HTML response with JSoup
Extract content fields based on configured attribute mappings
Create a Job Item and submit it to the pipeline
Discover all links on the page
Filter links against allow/deny patterns and file extension rules
Add new, unvisited URLs to the crawl queue
Repeat until no more URLs remain

Key Features

Feature	Description
Recursive crawling	Follows links from seed URLs to discover all reachable pages
URL filtering	Allow and deny patterns (regex) control which URLs are crawled
File extension filtering	Configurable list of file extensions to include or exclude
Authentication	Basic HTTP authentication for protected sites
Random user agents	Generates random browser user-agent strings to avoid blocking
Visited URL tracking	CRC32 checksums prevent re-visiting the same URL in a single crawl
Locale detection	Extensible interface (`DumWCExtLocaleInterface`) for detecting content language
URL normalization	Normalizes URLs to avoid duplicate crawling of the same page
Incremental indexing	Checksum-based change detection — only re-indexes pages that changed

Configuration

Source Settings

Field	Description
Name	Identifier for this crawl source
Starting URLs	One or more seed URLs where the crawl begins
Endpoint	Base URL of the website
Username / Password	Optional HTTP Basic authentication credentials

URL Filtering

Field	Description
Allow URL Patterns	Regex patterns — only URLs matching these patterns are crawled
Deny URL Patterns	Regex patterns — URLs matching these patterns are skipped
Allowed File Extensions	List of file extensions to include (e.g., `.html`, `.php`, `.aspx`)

Attribute Mapping

Define which parts of the HTML page map to which fields in the search index:

Field	Description
Title	CSS selector or extraction rule for the page title
Text	CSS selector for the main content body
URL	The page URL (extracted automatically)
Date	CSS selector or meta tag for the publication date
Custom fields	Additional attribute mappings for any HTML element or meta tag

Example: Crawling a Documentation Site

Crawl a documentation site starting from the homepage, only following links within the /docs/ path:

Setting	Value
Starting URL	`https://docs.example.com/`
Allow Pattern	`https://docs\.example\.com/docs/.*`
Deny Pattern	`.*\.(css\|js\|png\|jpg\|gif\|svg)$`
Target SN Site	`Documentation`
Locale	`en_US`

The crawler will:

Start at the homepage
Follow all links matching /docs/ paths
Skip links to stylesheets, scripts, and images
Extract each page's title, body text, and URL
Send each page to the Documentation SN Site in Turing ES

Locale Detection

The Web Crawler supports automatic locale detection via the DumWCExtLocaleInterface extension point. Implement this interface to determine the locale of each page based on:

URL path patterns (e.g., /en/, /pt-br/)
HTML lang attribute
Content-Language HTTP header
Custom logic specific to your site structure

If no locale extension is configured, the default locale from the source configuration is used for all pages.

Limitations

The crawler follows HTML links only — it does not execute JavaScript. Single-page applications (SPAs) that render content via JavaScript are not supported without a pre-rendering solution.
robots.txt — The crawler does not currently enforce robots.txt directives. Ensure you have permission to crawl the target site.
Rate limiting — There is no built-in rate limiter. For large sites, monitor server load and consider adding delays between requests.

Page	Description
Connectors Overview	All available connectors
Core Concepts	Pipeline, strategies, and change detection
Turing ES — Integration	How Turing ES receives content from connectors

How It Works​

Key Features​

Configuration​

Source Settings​

URL Filtering​

Attribute Mapping​

Example: Crawling a Documentation Site​

Locale Detection​

Limitations​

Related Pages​