Skip to main content
Version: 0.3.9

Viglet Turing ES: Connectors

There are several connectors to allow you to index content in Viglet Turing ES.

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Installation

Nutch 1.18 and 1.20

  1. Go to https://viglet.org/turing/download/ and click on "Integration > Apache Nutch 1.18 Plugin" or "Integration > Apache Nutch 1.20 Plugin" link to download it.

  2. Extract the plugin to <APACHE_NUTCH>/plugins/indexer-viglet-turing

Configuration

nutch-site.xml

Add the following properties to <APACHE_NUTCH>/conf/nutch-site.xml:

ParameterDescription
turing.urlURL of Turing ES Server (e.g., http://localhost:2700)
turing.apiKeyAPI Key for authentication
turing.snSiteSemantic Navigation Site name
turing.localeLocale for indexing (e.g., en_US)

turing-mapping.xml

Create or edit <APACHE_NUTCH>/conf/turing-mapping.xml to configure field mappings:

<mapping>
<fields>
<field source="title" dest="title"/>
<field source="content" dest="text"/>
<field source="url" dest="url"/>
<field source="tstamp" dest="modification_date"/>
</fields>
<siteUrl>
<value url="https://example.com" snSite="Sample" locale="en_US"/>
</siteUrl>
<uniqueKey field="url"/>
</mapping>

Indexing a Website

cd <APACHE_NUTCH>
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/turing -s urls crawl 5

Database

JDBC Connector that uses the same concept as sqoop, to create complex queries and map attributes to index based on the result.

Installation

Go to https://viglet.org/turing/download/ and click on "Integration > Database Connector" link to download the turing-jdbc.jar.

Usage

java -jar /appl/viglet/turing/jdbc/turing-jdbc.jar <PARAMETERS>

Parameters

ParameterDescription
--connectJDBC connection string
--driverJDBC driver class name
--querySQL query to execute
--siteSemantic Navigation Site name
--localeLocale for indexing
--chunkNumber of rows per chunk
--serverTuring ES server URL
--api-keyAPI Key for authentication
--file-path-fieldField containing file paths
--file-content-fieldField for file content
--file-extension-fieldField containing file extensions
--file-size-fieldField containing file sizes
--multi-valued-separatorSeparator for multi-valued fields
--remove-html-tags-fieldFields from which to remove HTML tags

Example

java -jar /appl/viglet/turing/jdbc/turing-jdbc.jar \
--connect "jdbc:mysql://localhost:3306/mydb" \
--driver "org.mariadb.jdbc.Driver" \
--query "SELECT id, title, content, url FROM articles" \
--site "Sample" \
--locale "en_US" \
--chunk 100 \
--server "http://localhost:2700" \
--api-key "your-api-key"

File System

FileSystem connector for indexing files with text extraction from Word, Excel, PDF and OCR for images.

Installation

Go to https://viglet.org/turing/download/ and click on "Integration > FileSystem Connector" link to download the turing-filesystem.jar.

Usage

java -jar /appl/viglet/turing/fs/turing-filesystem.jar <PARAMETERS>

Example

java -jar /appl/viglet/turing/fs/turing-filesystem.jar \
--server http://localhost:2700 \
--nlp <NLP_UUID> \
--source-dir /path/to/files \
--output-dir /path/to/output

Wordpress

Wordpress plugin that allows you to index posts.

Installation

  1. Upload the plugin folder to the /wp-content/plugins/ directory.
  2. Activate the plugin through the 'Plugins' menu in WordPress.
  3. Configure the hostname, port and URI of Turing ES.
  4. Click the settings button to load posts.