Difference between revisions of "Elastica and CirrusSearch extensions"

From Organic Design wiki
(Accessing Elastic documents from PHP: basic queries)
(CirrusSearch maintenance scripts: minor typos)
 
(6 intermediate revisions by one other user not shown)
Line 8: Line 8:
 
There are some maintenance scripts in the ''maintenance'' directory of the CirrusSearch extension which are used to initialise or rebuild the Elastic documents that compose the text search index of your wikis content.
 
There are some maintenance scripts in the ''maintenance'' directory of the CirrusSearch extension which are used to initialise or rebuild the Elastic documents that compose the text search index of your wikis content.
  
To clear the entire index, deleting all Elastic docuemnts for the wiki:
+
To clear the entire index, deleting all Elastic documents for the wiki:
 
<source lang="bash">
 
<source lang="bash">
 
php updateSearchIndexConfig.php --startOver
 
php updateSearchIndexConfig.php --startOver
 
</source>
 
</source>
  
Rebuilding the indexes takes to passes, one to build from the rendered content and one to index links from the source code. If you have a large wiki with thousands of pages, it's best to include the ''queue'' and ''maxJobs'' parameters otherwise the process may not index all the articles.
+
Rebuilding the indexes takes two passes, one to build from the rendered content and one to index links from the source code. If you have a large wiki with thousands of pages, it's best to include the ''queue'' and ''maxJobs'' parameters otherwise the process may not index all the articles.
 
<source lang="bash">
 
<source lang="bash">
 
php forceSearchIndex.php --skipLinks --indexOnSkip --queue --maxJobs=100
 
php forceSearchIndex.php --skipLinks --indexOnSkip --queue --maxJobs=100
Line 76: Line 76:
  
  
Crating or updating a field:
+
Creating or updating a field:
 
<source lang="php">
 
<source lang="php">
 
$index = '{!DBname!}-{!DBprefix!}_general'
 
$index = '{!DBname!}-{!DBprefix!}_general'
Line 89: Line 89:
 
<source lang="php">
 
<source lang="php">
 
$index = '{!DBname!}-{!DBprefix!}_general';
 
$index = '{!DBname!}-{!DBprefix!}_general';
$bulk = new \Elastica\Bulk( $conn->getClient() );
+
$bulk = new \Elastica\Bulk( {!$conn!}->getClient() );
 
$bulk->setType( "$index/page" );
 
$bulk->setType( "$index/page" );
$doc = new \Elastica\Document( {!PageID!}, ['{!FIELD!}' => '{!VALUE!}'], 'page', $index );
+
$bulk->addData( new \Elastica\Document( {!PAGEID1!}, ['{!FIELD1!}' => '{!VALUE1!}'], 'page', $index ), 'update' );
$bulk->addData( $doc, 'update' ); # call this to add each doc
+
$bulk->addData( new \Elastica\Document( {!PAGEID2!}, ['{!FIELD2!}' => '{!VALUE2!}'], 'page', $index ), 'update' );
 +
    . . .
 
$res = $bulk->send();
 
$res = $bulk->send();
 
</source>
 
</source>
 +
  
 
A better way, however, is to use the CirrusSearch's ''DataSender'' class because all the graceful exception handling is done:
 
A better way, however, is to use the CirrusSearch's ''DataSender'' class because all the graceful exception handling is done:
 
<source lang="php">
 
<source lang="php">
 
$index = '{!DBname!}-{!DBprefix!}_general';
 
$index = '{!DBname!}-{!DBprefix!}_general';
$docs = [
+
$conf = MediaWiki\MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' );
 +
$sender = new \CirrusSearch\DataSender( {!$conn!}, $conf );
 +
$sender->sendData( 'page', [
 
     new \Elastica\Document( {!PAGEID1!}, ['{!FIELD1!}' => '{!VALUE1!}'], 'page', $index ),
 
     new \Elastica\Document( {!PAGEID1!}, ['{!FIELD1!}' => '{!VALUE1!}'], 'page', $index ),
     new \Elastica\Document( {!PAGEID2!}, ['{!FIELD2!}' => '{!VALUE2!}'], 'page', $index )
+
     new \Elastica\Document( {!PAGEID2!}, ['{!FIELD2!}' => '{!VALUE2!}'], 'page', $index ),
];
+
    . . .
$search = MediaWiki\MediaWikiServices::getInstance()->getSearchEngineFactory()->create( 'CirrusSearch' );
+
] );
$conn = $search->getConnection();
 
$conf = MediaWiki\MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' );
 
$sender = new \CirrusSearch\DataSender( $conn, $conf );
 
$sender->sendData( 'page', $docs );
 
 
</source>
 
</source>
  

Latest revision as of 17:20, 5 November 2019

ElasticSearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected. PHP projects can use the Elastica PHP library to integrate with ElasticSearch in an efficient, scalable and well-structured manner.

The Elastica MediaWiki extension is used to integrate MediaWiki with a local ElasticSearch back-end using the Elastica library and integrating tightly with MediaWiki's object structure. The CirrusSearch extension uses the interface provided by the Elastica extension to provide MediaWiki with an alternative to the default SQL-based search engine. All the Wikimedia projects including Wikipedia use Cirrus Search and Elastica.

Learning the Elastic document structure used by MediaWiki, and the syntax for integrating with it is not well documented, so this page provides some examples of commonly needed processes. This article assumes that all the components are already installed and searching the wiki using the CirrusSearch engine is already working.

CirrusSearch maintenance scripts

There are some maintenance scripts in the maintenance directory of the CirrusSearch extension which are used to initialise or rebuild the Elastic documents that compose the text search index of your wikis content.

To clear the entire index, deleting all Elastic documents for the wiki:

php updateSearchIndexConfig.php --startOver

Rebuilding the indexes takes two passes, one to build from the rendered content and one to index links from the source code. If you have a large wiki with thousands of pages, it's best to include the queue and maxJobs parameters otherwise the process may not index all the articles.

php forceSearchIndex.php --skipLinks --indexOnSkip --queue --maxJobs=100
php forceSearchIndex.php --skipParse --queue --maxJobs=100

Accessing Elastic documents using Curl

Communicating with ElasticSearch is done over a standard HTTP connection which is on port 9200 by default. Here's some simple queries that can be done by connecting directly to the service on the command line using Curl.


Performing a basic search query:

curl 'localhost:9200/_search?q=searchterm&pretty'


Get info on a specific document including all its fields:

curl 'localhost:9200/DBname-DBprefix_general/page/PAGEID?pretty'


Delete a field from a document:

curl -H "Content-Type: application/json" -XPOST 'localhost:9200/DBname-DBprefix_general/page/PAGEID/_update' -d '{"script" : "ctx._source.remove(\"FIELDNAME\")"}'


Create or update a field:

curl  -H "Content-Type: application/json" -XPOST 'localhost:9200/DBname-DBprefix_general/page/PAGEID/_update' -d '{"doc":{"FIELDNAME":"VALUE"}}'


To create or update bulk fields, first create a file called data containing information to update in the following format:

{"update":{"_id":"PAGEID1"}}
{"doc":{"FIELDNAME1":"VALUE1"}}
{"update":{"_id":"PAGEID2"}}
{"doc":{"FIELDNAME2":"VALUE2"}}
{"update":{"_id":"PAGEID3"}}
{"doc":{"FIELDNAME3":"VALUE3"}}

Then perform the bulk request as follows:

curl -H "Content-Type: application/json" -XPOST 'localhost:9200/DBname-DBprefix_general/page/_bulk' --data-binary "@data"

Accessing Elastic documents from PHP

These snippets assume you already have established an Elastic connection in $conn.

Doing a basic request for a document:

$index = 'DBname-DBprefix_general'
$res = $conn->getClient()->request( "$index/page/$id" );
if ( $res->isOk() ) {
	$doc = $res->getData()['_source'];
	$val = $doc['FIELD'];
} else die( 'Elastic document not found!' );


Creating or updating a field:

$index = 'DBname-DBprefix_general'
$client = $conn->getClient();
if( $client->request( "$index/page/$id" )->isOk() ) {
	$client->request( "$index/page/$id/_update", Elastica\Request::POST, json_encode( ['doc' => ['FIELDNAME' => VALUE] ] ) );
} else die( 'Elastic document not found!' );


Here's how we can do a bulk query using the Elastica class:

$index = 'DBname-DBprefix_general';
$bulk = new \Elastica\Bulk( $conn->getClient() );
$bulk->setType( "$index/page" );
$bulk->addData( new \Elastica\Document( PAGEID1, ['FIELD1' => 'VALUE1'], 'page', $index ), 'update' );
$bulk->addData( new \Elastica\Document( PAGEID2, ['FIELD2' => 'VALUE2'], 'page', $index ), 'update' );
    . . .
$res = $bulk->send();


A better way, however, is to use the CirrusSearch's DataSender class because all the graceful exception handling is done:

$index = 'DBname-DBprefix_general';
$conf = MediaWiki\MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' );
$sender = new \CirrusSearch\DataSender( $conn, $conf );
$sender->sendData( 'page', [
    new \Elastica\Document( PAGEID1, ['FIELD1' => 'VALUE1'], 'page', $index ),
    new \Elastica\Document( PAGEID2, ['FIELD2' => 'VALUE2'], 'page', $index ),
    . . .
] );

Modifying the default Elastic document structure

Whenever articles in the wiki are created, moved and edited, CirrusSearch modifies the associated Elastic document. By default a number of useful metadata about the wiki article are stored in various fields of the associated Elastic document. You may have some other metadata you'd like to be included in the Elastica documents, or maybe you'd like some of the current fields to be adjusted. CirrusSearch provides a hook called CirrusSearchBuildDocumentParse in the Update class (includes/Updater.php) for this purpose. In the following snippet we adding a new field that will be present in all Elastica documents.

public static function onCirrusSearchBuildDocumentParse( $doc, $title, $content, $output, $conn ) {
	$doc->set( 'FILENAME', 'VALUE' );
	return true;
}

See also