Skip to main content

Core Concepts

This page explains the fundamental concepts of Dumont DEP in plain terms. No configuration files, no code — just the mental model you need before diving into the technical documentation.


Connectors

A Connector is a component that knows how to extract content from a specific type of source. Each connector:

  1. Connects to a content source (a website, a database, a file folder, an AEM instance)
  2. Extracts documents according to its configuration
  3. Emits each document as a Job Item into the processing pipeline

Dumont DEP ships with five connectors:

ConnectorSourceHow it works
Web CrawlerWebsitesRecursively follows links from a starting URL, extracts page content via HTML parsing
DatabaseJDBC databasesExecutes SQL queries and maps each result row to a document
FileSystemLocal/network directoriesWalks directory trees, extracts text from files via Apache Tika
AEMAdobe Experience ManagerReads content from AEM author/publish instances via the JCR API
WordPressWordPress sitesPulls posts, pages, and custom content types from WordPress installations

Each connector implements the DumConnectorPlugin interface and provides three operations: crawl (full extraction), indexAll (re-index a source), and indexById (index specific documents by ID).


Job Items

A Job Item is a single document moving through the pipeline. It contains:

  • Fields — key-value pairs representing the document's content (title, text, URL, date, author, and any custom fields)
  • Action — what to do with this document: INDEX (add or update) or DELETE (remove from the index)
  • Provider — which connector produced this item
  • Locale — the language/country of the content (e.g., en_US, pt_BR)

Job Items are the universal data format inside Dumont DEP. Every connector produces them, every strategy evaluates them, and every indexing plugin consumes them.


The Processing Pipeline

When a connector extracts a document, it does not send it directly to the search engine. Instead, the document passes through a multi-stage pipeline designed for reliability, efficiency, and flexibility.

Dumont DEP — Processing Pipeline

Stage 1 — Extraction

The connector reads content from the source and produces Job Items. Each item includes all the fields needed for indexing.

Stage 2 — Strategy Evaluation

Every Job Item passes through a chain of Processing Strategies, evaluated in priority order. Each strategy decides whether the item should be indexed, re-indexed, de-indexed, ignored, or skipped:

PriorityStrategyWhat it does
10De-indexRemoves documents marked for deletion
20Ignore (Indexing Rules)Skips documents matching regex-based ignore rules
30IndexIndexes new documents (not seen before)
40Re-indexUpdates documents whose content has changed (checksum comparison)
50UnchangedSkips documents that haven't changed since the last run

The strategy chain uses checksum-based change detection — each document's content is hashed, and the hash is compared against the last indexed version. Only documents that have actually changed are re-indexed.

Stage 3 — Batching

Accepted Job Items are collected by the Batch Processor into groups (default: 50 items per batch). This reduces the number of messages sent to the queue and improves indexing throughput.

When a batch reaches its configured size, or when the connector signals it has finished, the batch is flushed to the message queue.

Stage 4 — Queue

The batch is sent to Apache Artemis (embedded JMS message queue). The queue decouples extraction from delivery — if the search engine is temporarily unavailable, messages wait in the queue and are delivered when the connection is restored.

Queue messages are persisted to disk (store/queue/) and survive application restarts.

Stage 5 — Delivery

The Indexing Plugin consumes messages from the queue and delivers them to the configured search engine. Dumont DEP supports three output targets:

PluginTargetDescription
Turing (default)Viglet Turing ESUses the Turing Java SDK to deliver documents via REST API
SolrApache SolrUses SolrJ to add documents directly to a Solr collection
ElasticsearchElasticsearchUses the Elasticsearch Java Client for bulk indexing

Indexing Rules

Indexing Rules allow you to filter content during extraction — before it enters the pipeline. A rule defines:

  • An attribute (a field name in the document)
  • A rule type (currently IGNORE)
  • One or more values (regex patterns)

When a document's attribute matches any of the values, the document is skipped entirely. For example, a rule with attribute = template and values = [error-page, redirect] will prevent any document with those templates from being indexed.

Indexing Rules are configured per source in the admin console and are evaluated by the IgnoreIndexingRuleStrategy at priority 20 — before the index/re-index/unchanged strategies run.


Change Detection

Dumont DEP tracks every document it has processed using a persistent indexing database. For each document, it stores:

  • Object ID — the unique identifier from the source
  • Checksum — a CRC32 hash of the document's content
  • Timestamp — when the document was last indexed
  • Status — the current state (preparing, indexed, de-indexed, etc.)

On subsequent runs, the connector compares the new checksum against the stored one. If they match, the document is unchanged and skipped. If they differ, the document is re-indexed. If a previously indexed document is no longer present in the source, it is de-indexed.

This mechanism enables efficient incremental indexing — only changed content is sent to the search engine, regardless of how large the source is.


Indexing Status Values

Every document's journey through the pipeline is tracked with a status code:

StatusMeaning
PREPARE_INDEXPreparing to index the document
PREPARE_UNCHANGEDNo changes detected since last indexing
PREPARE_REINDEXPreparing a re-indexation (content changed)
PREPARE_FORCED_REINDEXForced re-indexation triggered
RECEIVED_AND_SENT_TO_TURINGDocument received and forwarded to the search engine
SENT_TO_QUEUEDocument placed in the Artemis processing queue
RECEIVED_FROM_QUEUEDocument consumed from the queue by the indexing plugin
INDEXEDDocument successfully indexed
FINISHEDOperation finished
DEINDEXEDDocument removed from the index
NOT_PROCESSED / IGNOREDDocument skipped due to an Indexing Rule or strategy decision

Ready to go deeper?

I want to...Go to
Understand the full system architectureArchitecture
Install Dumont DEPInstallation Guide
Configure a Web CrawlerWeb Crawler Connector
Index a databaseDatabase Connector
Understand the indexing pluginsIndexing Plugins
See the full configuration referenceConfiguration Reference