Skip to main content

Web Crawler Connector

The Web Crawler connector recursively discovers and extracts content from websites. Starting from one or more seed URLs, it follows links, extracts page content using JSoup HTML parsing, and delivers each page as a searchable document.


How It Works

  1. Start with one or more seed URLs (starting points for the crawl)
  2. Fetch each page via HTTP GET with a randomized user agent
  3. Parse the HTML response with JSoup
  4. Extract content fields based on configured attribute mappings
  5. Create a Job Item and submit it to the pipeline
  6. Discover all links on the page
  7. Filter links against allow/deny patterns and file extension rules
  8. Add new, unvisited URLs to the crawl queue
  9. Repeat until no more URLs remain

Key Features

FeatureDescription
Recursive crawlingFollows links from seed URLs to discover all reachable pages
URL filteringAllow and deny patterns (regex) control which URLs are crawled
File extension filteringConfigurable list of file extensions to include or exclude
AuthenticationBasic HTTP authentication for protected sites
Random user agentsGenerates random browser user-agent strings to avoid blocking
Visited URL trackingCRC32 checksums prevent re-visiting the same URL in a single crawl
Locale detectionExtensible interface (DumWCExtLocaleInterface) for detecting content language
URL normalizationNormalizes URLs to avoid duplicate crawling of the same page
Incremental indexingChecksum-based change detection — only re-indexes pages that changed

Configuration

Source Settings

FieldDescription
NameIdentifier for this crawl source
Starting URLsOne or more seed URLs where the crawl begins
EndpointBase URL of the website
Username / PasswordOptional HTTP Basic authentication credentials

URL Filtering

FieldDescription
Allow URL PatternsRegex patterns — only URLs matching these patterns are crawled
Deny URL PatternsRegex patterns — URLs matching these patterns are skipped
Allowed File ExtensionsList of file extensions to include (e.g., .html, .php, .aspx)

Attribute Mapping

Define which parts of the HTML page map to which fields in the search index:

FieldDescription
TitleCSS selector or extraction rule for the page title
TextCSS selector for the main content body
URLThe page URL (extracted automatically)
DateCSS selector or meta tag for the publication date
Custom fieldsAdditional attribute mappings for any HTML element or meta tag

Example: Crawling a Documentation Site

Crawl a documentation site starting from the homepage, only following links within the /docs/ path:

SettingValue
Starting URLhttps://docs.example.com/
Allow Patternhttps://docs\.example\.com/docs/.*
Deny Pattern.*\.(css|js|png|jpg|gif|svg)$
Target SN SiteDocumentation
Localeen_US

The crawler will:

  1. Start at the homepage
  2. Follow all links matching /docs/ paths
  3. Skip links to stylesheets, scripts, and images
  4. Extract each page's title, body text, and URL
  5. Send each page to the Documentation SN Site in Turing ES

Locale Detection

The Web Crawler supports automatic locale detection via the DumWCExtLocaleInterface extension point. Implement this interface to determine the locale of each page based on:

  • URL path patterns (e.g., /en/, /pt-br/)
  • HTML lang attribute
  • Content-Language HTTP header
  • Custom logic specific to your site structure

If no locale extension is configured, the default locale from the source configuration is used for all pages.


Limitations

  • The crawler follows HTML links only — it does not execute JavaScript. Single-page applications (SPAs) that render content via JavaScript are not supported without a pre-rendering solution.
  • robots.txt — The crawler does not currently enforce robots.txt directives. Ensure you have permission to crawl the target site.
  • Rate limiting — There is no built-in rate limiter. For large sites, monitor server load and consider adding delays between requests.

PageDescription
Connectors OverviewAll available connectors
Core ConceptsPipeline, strategies, and change detection
Turing ES — IntegrationHow Turing ES receives content from connectors