Core Concepts

This page explains the fundamental concepts of Dumont DEP in plain terms. No configuration files, no code — just the mental model you need before diving into the technical documentation.

Connectors

A Connector is a component that knows how to extract content from a specific type of source. Each connector:

Connects to a content source (a website, a database, a file folder, an AEM instance)
Extracts documents according to its configuration
Emits each document as a Job Item into the processing pipeline

Dumont DEP ships with five connectors:

Connector	Source	How it works
Web Crawler	Websites	Recursively follows links from a starting URL, extracts page content via HTML parsing
Database	JDBC databases	Executes SQL queries and maps each result row to a document
FileSystem	Local/network directories	Walks directory trees, extracts text from files via Apache Tika
AEM	Adobe Experience Manager	Reads content from AEM author/publish instances via the JCR API
WordPress	WordPress sites	Pulls posts, pages, and custom content types from WordPress installations

Each connector implements the DumConnectorPlugin interface and provides three operations: crawl (full extraction), indexAll (re-index a source), and indexById (index specific documents by ID).

Job Items

A Job Item is a single document moving through the pipeline. It contains:

Fields — key-value pairs representing the document's content (title, text, URL, date, author, and any custom fields)
Action — what to do with this document: INDEX (add or update) or DELETE (remove from the index)
Provider — which connector produced this item
Locale — the language/country of the content (e.g., en_US, pt_BR)

Job Items are the universal data format inside Dumont DEP. Every connector produces them, every strategy evaluates them, and every indexing plugin consumes them.

The Processing Pipeline

When a connector extracts a document, it does not send it directly to the search engine. Instead, the document passes through a multi-stage pipeline designed for reliability, efficiency, and flexibility.

Dumont DEP — Processing Pipeline

Stage 1 — Extraction

The connector reads content from the source and produces Job Items. Each item includes all the fields needed for indexing.

Stage 2 — Strategy Evaluation

Every Job Item passes through a chain of Processing Strategies, evaluated in priority order. Each strategy decides whether the item should be indexed, re-indexed, de-indexed, ignored, or skipped:

Priority	Strategy	What it does
10	De-index	Removes documents marked for deletion
20	Ignore (Indexing Rules)	Skips documents matching regex-based ignore rules
30	Index	Indexes new documents (not seen before)
40	Re-index	Updates documents whose content has changed (checksum comparison)
50	Unchanged	Skips documents that haven't changed since the last run

The strategy chain uses checksum-based change detection — each document's content is hashed, and the hash is compared against the last indexed version. Only documents that have actually changed are re-indexed.

Stage 3 — Batching

Accepted Job Items are collected by the Batch Processor into groups (default: 50 items per batch). This reduces the number of messages sent to the queue and improves indexing throughput.

When a batch reaches its configured size, or when the connector signals it has finished, the batch is flushed to the message queue.

Stage 4 — Queue

The batch is sent to Apache Artemis (embedded JMS message queue). The queue decouples extraction from delivery — if the search engine is temporarily unavailable, messages wait in the queue and are delivered when the connection is restored.

Queue messages are persisted to disk (store/queue/) and survive application restarts.

Stage 5 — Delivery

The Indexing Plugin consumes messages from the queue and delivers them to the configured search engine. Dumont DEP supports three output targets:

Plugin	Target	Description
Turing (default)	Viglet Turing ES	Uses the Turing Java SDK to deliver documents via REST API
Solr	Apache Solr	Uses SolrJ to add documents directly to a Solr collection
Elasticsearch	Elasticsearch	Uses the Elasticsearch Java Client for bulk indexing

Indexing Rules

Indexing Rules allow you to filter content during extraction — before it enters the pipeline. A rule defines:

An attribute (a field name in the document)
A rule type (currently IGNORE)
One or more values (regex patterns)

When a document's attribute matches any of the values, the document is skipped entirely. For example, a rule with attribute = template and values = [error-page, redirect] will prevent any document with those templates from being indexed.

Indexing Rules are configured per source in the admin console and are evaluated by the IgnoreIndexingRuleStrategy at priority 20 — before the index/re-index/unchanged strategies run.

Change Detection

Dumont DEP tracks every document it has processed using a persistent indexing database. For each document, it stores:

Object ID — the unique identifier from the source
Checksum — a CRC32 hash of the document's content
Timestamp — when the document was last indexed
Status — the current state (preparing, indexed, de-indexed, etc.)

On subsequent runs, the connector compares the new checksum against the stored one. If they match, the document is unchanged and skipped. If they differ, the document is re-indexed. If a previously indexed document is no longer present in the source, it is de-indexed.

This mechanism enables efficient incremental indexing — only changed content is sent to the search engine, regardless of how large the source is.

Indexing Status Values

Every document's journey through the pipeline is tracked with a status code:

Status	Meaning
`PREPARE_INDEX`	Preparing to index the document
`PREPARE_UNCHANGED`	No changes detected since last indexing
`PREPARE_REINDEX`	Preparing a re-indexation (content changed)
`PREPARE_FORCED_REINDEX`	Forced re-indexation triggered
`RECEIVED_AND_SENT_TO_TURING`	Document received and forwarded to the search engine
`SENT_TO_QUEUE`	Document placed in the Artemis processing queue
`RECEIVED_FROM_QUEUE`	Document consumed from the queue by the indexing plugin
`INDEXED`	Document successfully indexed
`FINISHED`	Operation finished
`DEINDEXED`	Document removed from the index
`NOT_PROCESSED` / `IGNORED`	Document skipped due to an Indexing Rule or strategy decision

Ready to go deeper?

I want to...	Go to
Understand the full system architecture	Architecture
Install Dumont DEP	Installation Guide
Configure a Web Crawler	Web Crawler Connector
Index a database	Database Connector
Understand the indexing plugins	Indexing Plugins
See the full configuration reference	Configuration Reference

Connectors​

Job Items​

The Processing Pipeline​

Stage 1 — Extraction​

Stage 2 — Strategy Evaluation​

Stage 3 — Batching​

Stage 4 — Queue​

Stage 5 — Delivery​

Indexing Rules​

Change Detection​

Indexing Status Values​

Ready to go deeper?​