FileSystem Connector

The FileSystem Connector walks a directory tree, extracts text and metadata from files using Apache Tika, and indexes everything as searchable documents. It supports PDFs, Word documents, spreadsheets, presentations, plain text, HTML, and images (with OCR).

How It Works

Start at the configured source directory
Recursively traverse all subdirectories
For each file, extract text content using Apache Tika
Collect file metadata (size, modification date, extension, MIME type)
Create a Job Item with the extracted content and metadata
Submit to the pipeline in configurable batches
Repeat until all files are processed

Key Features

Feature	Description
Recursive traversal	Walks the full directory tree using Java's file visitor pattern
Apache Tika integration	Text extraction from 1000+ file formats
OCR support	Extract text from images and scanned PDFs (requires Tesseract)
File metadata	Captures file size, extension, modification date, and MIME type
Prefix replacement	Replace file path prefixes with custom URLs (e.g., replace local path with a web URL)
Standalone CLI	Run imports from the command line independently
Configurable batch size	Control memory usage with chunk-based processing

Supported File Formats

Apache Tika supports text extraction from a broad range of formats:

Category	Formats
Documents	PDF, DOCX, DOC, ODT, RTF, EPUB
Spreadsheets	XLSX, XLS, ODS, CSV
Presentations	PPTX, PPT, ODP
Web / Markup	HTML, XHTML, XML
Plain text	TXT, LOG, Markdown
Email	EML, MSG, MBOX
Images (with OCR)	PNG, JPEG, TIFF, BMP, GIF

Files that Tika cannot extract text from are silently skipped.

CLI Parameters

Parameter	Required	Default	Description
`--source-dir` / `-d`	Yes	—	Root directory to scan
`--server` / `-s`	Yes	—	Dumont DEP server URL
`--api-key` / `-a`	Yes	—	API key for authentication
`--site`	Yes	—	Target Semantic Navigation Site name
`--type` / `-t`	No	`Static File`	Content type label for all documents
`--locale`	No	`en_US`	Default locale
`--chunk` / `-z`	No	`100`	Batch size
`--file-size-field`	No	—	Field name to store file size
`--file-extension-field`	No	—	Field name to store file extension
`--prefix-from-replace`	No	—	Path prefix to replace (e.g., `/mnt/docs`)
`--prefix-to-replace`	No	—	Replacement prefix (e.g., `https://docs.example.com`)

Example: Indexing a Document Repository

java -cp dumont-fs.jar com.viglet.dumont.filesystem.DumFSImportTool \
  --source-dir /mnt/shared/documents \
  --server http://localhost:30130 \
  --api-key your-api-key \
  --site InternalDocs \
  --locale en_US \
  --chunk 50 \
  --file-size-field fileSize \
  --file-extension-field fileExtension \
  --prefix-from-replace /mnt/shared/documents \
  --prefix-to-replace https://intranet.example.com/docs

This will:

Scan all files under /mnt/shared/documents recursively
Extract text from each file using Apache Tika
Replace the local path with a web URL for each document
Index everything into the InternalDocs SN Site
Store file size and extension as searchable/facetable fields

Path Prefix Replacement

The --prefix-from-replace and --prefix-to-replace parameters transform file paths into web-accessible URLs. This is essential when the files are served by a web server:

Local Path	After Replacement
`/mnt/shared/documents/reports/q1-2026.pdf`	`https://intranet.example.com/docs/reports/q1-2026.pdf`
`/mnt/shared/documents/policies/security.docx`	`https://intranet.example.com/docs/policies/security.docx`

The transformed path becomes the document's URL field in the search index, allowing users to click through from search results to the actual file.

How It Works​

Key Features​

Supported File Formats​

CLI Parameters​

Example: Indexing a Document Repository​

Path Prefix Replacement​