Version: 0.3.9

Viglet Turing ES: Connectors

There are several connectors to allow you to index content in Viglet Turing ES.

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Installation

Nutch 1.18 and 1.20

Go to https://viglet.org/turing/download/ and click on "Integration > Apache Nutch 1.18 Plugin" or "Integration > Apache Nutch 1.20 Plugin" link to download it.
Extract the plugin to <APACHE_NUTCH>/plugins/indexer-viglet-turing

Configuration

nutch-site.xml

Add the following properties to <APACHE_NUTCH>/conf/nutch-site.xml:

Parameter	Description
`turing.url`	URL of Turing ES Server (e.g., `http://localhost:2700`)
`turing.apiKey`	API Key for authentication
`turing.snSite`	Semantic Navigation Site name
`turing.locale`	Locale for indexing (e.g., `en_US`)

turing-mapping.xml

Create or edit <APACHE_NUTCH>/conf/turing-mapping.xml to configure field mappings:

<mapping>
  <fields>
    <field source="title" dest="title"/>
    <field source="content" dest="text"/>
    <field source="url" dest="url"/>
    <field source="tstamp" dest="modification_date"/>
  </fields>
  <siteUrl>
    <value url="https://example.com" snSite="Sample" locale="en_US"/>
  </siteUrl>
  <uniqueKey field="url"/>
</mapping>

Indexing a Website

cd <APACHE_NUTCH>
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/turing -s urls crawl 5

Database

JDBC Connector that uses the same concept as sqoop, to create complex queries and map attributes to index based on the result.

Installation

Go to https://viglet.org/turing/download/ and click on "Integration > Database Connector" link to download the turing-jdbc.jar.

Usage

java -jar /appl/viglet/turing/jdbc/turing-jdbc.jar <PARAMETERS>

Parameters

Parameter	Description
`--connect`	JDBC connection string
`--driver`	JDBC driver class name
`--query`	SQL query to execute
`--site`	Semantic Navigation Site name
`--locale`	Locale for indexing
`--chunk`	Number of rows per chunk
`--server`	Turing ES server URL
`--api-key`	API Key for authentication
`--file-path-field`	Field containing file paths
`--file-content-field`	Field for file content
`--file-extension-field`	Field containing file extensions
`--file-size-field`	Field containing file sizes
`--multi-valued-separator`	Separator for multi-valued fields
`--remove-html-tags-field`	Fields from which to remove HTML tags

Example

java -jar /appl/viglet/turing/jdbc/turing-jdbc.jar \
  --connect "jdbc:mysql://localhost:3306/mydb" \
  --driver "org.mariadb.jdbc.Driver" \
  --query "SELECT id, title, content, url FROM articles" \
  --site "Sample" \
  --locale "en_US" \
  --chunk 100 \
  --server "http://localhost:2700" \
  --api-key "your-api-key"

File System

FileSystem connector for indexing files with text extraction from Word, Excel, PDF and OCR for images.

Installation

Go to https://viglet.org/turing/download/ and click on "Integration > FileSystem Connector" link to download the turing-filesystem.jar.

Usage

java -jar /appl/viglet/turing/fs/turing-filesystem.jar <PARAMETERS>

Example

java -jar /appl/viglet/turing/fs/turing-filesystem.jar \
  --server http://localhost:2700 \
  --nlp <NLP_UUID> \
  --source-dir /path/to/files \
  --output-dir /path/to/output

Wordpress

Wordpress plugin that allows you to index posts.

Installation

Upload the plugin folder to the /wp-content/plugins/ directory.
Activate the plugin through the 'Plugins' menu in WordPress.
Configure the hostname, port and URI of Turing ES.
Click the settings button to load posts.

Apache Nutch​

Installation​

Nutch 1.18 and 1.20​

Configuration​

nutch-site.xml​

turing-mapping.xml​

Indexing a Website​

Database​

Installation​

Usage​

Parameters​

Example​

File System​

Installation​

Usage​

Example​

Wordpress​

Installation​

Apache Nutch

Installation

Nutch 1.18 and 1.20

Configuration

nutch-site.xml

turing-mapping.xml

Indexing a Website

Database

Installation

Usage

Parameters

Example

File System

Installation

Usage

Example

Wordpress

Installation