How to use SolR for custom implementation
This is the first topic on the subject but as it deals with the WSE Status extension for eZPublish, it can be interesting to have some feedback on this feature.
So, this post is about the Search component from the eZC / AZC stack. In the following parts, we will explain what was the need, how to install and set up SolR and how the component can be used with our configuration.
As our project mainly relies on eZPublish and eZFind, we won't detail how they work but only how they have been modified to make our extension work.
The aim
Our goal was to provide a powerful search engine indexation to index extra data we put in specific tables in eZPublish. For example, let say we have the following table :
| Field | Type | Description |
| id | Integer | Primary key |
| field one | Text | Text |
| filed two | Integer | Integer |
And it is perfect because eZFind is the best search engine for eZPublish. However, after a few moments diving in the code, it appears that it was not usable directly as the its implementation was designed to only index eZ Content Object and it would not do the trick.
The best option was finally to use the instance of SolR packed in eZFind and configure it to add our own data.
The solution : a custom SolR schema
This was a bit tricky due to the documentation of SolR. There's a Wiki reachable here but important are not always documented and you have to get in the XML file to look how it works.
Here are the reference that can be helpful before going further :
- Official documentation and : http://wiki.apache.org/solr/
- eZFind : http://doc.ez.no/Extensions/eZ-Find
- IBM Tutorial : http://www.ibm.com/developerworks/java/library/j-solr1/
- Le blog d'un DSI : http://leblogdundsi.lesprost.fr/article34/moteur-de-recherche-solr-sous-windows-tomcat
- Gandbox : http://www.gandbox.fr/Blogs/Technologies-Web/Developpement-avance-avec-eZ-Find-partie-1-La-gestion-des-datatypes-entre-eZ-Find-Solr
eZFind configuration
eZFind comes with two configuration sets :
- the normal one with only one core / index : your website
- the shared one with mulitple cores / index : your website in fre-FR, eng-GB, esp-SP and so on
The normal set is located in extension/ezfind/java/solr meanwhile the multicore one is located in extension/ezfind/java/solr.multicore.
A set is made of two directories :
- conf : handle the configuration
- data : handle the binary data
You may also have other directories that will be helpful to use specfic filters or external features but SolR needs those two directories at least.
By default eZFind just use the solo configuration set, so we must enable the multi core one.
In extension/ezfind/settings/ezfind.ini :
MultiCore=enabled DefaultCore=eng-GB LanguagesCoresMap[eng-GB]=eng-GB LanguagesCoresMap[fre-FR]=fre-FR LanguagesCoresMap[nor-NO]=nor-NO LanguagesCoresMap[example]=example
This settings allow you to map
In extension/ezfind/settings/solr.ini :
Shards[] Shards[eng-GB]=http://localhost:8983/solr/eng-GB Shards[fre-FR]=http://localhost:8983/solr/fre-FR Shards[nor-NO]=http://localhost:8983/solr/nor-NO Shards[example]=http://localhost:8983/solr/example
SolR Configuration
SolR has another cool feature called the sharding that allows you to make one query on several cores / index. It's useful in case you have several index that are hetrogeneous : it means you can ask for one term in one dictionnary and have a result for all dictionaries. In eZFind, it's used to have translated result : you searchbanana and you will get results for banana in english and banane in french.
In SolR, there are three XML files to set up to have a full configuration :
- solr.xml
- solrconfig.xml
- schema.xml
solr.xml
This file is simple, it declares cores for the SolR system (sorry about this one :) ). A core is an index. In comparison, we can say that a core is a reference, like a dictionnay. You can have several dictionnaries : English, French, Spanish, Portuguese and so on. But you can also have several application domains, like dictionnary about medecine, about computer science or whatever.
Our only modification to this file was the add of a specific core :
<core name="example" instanceDir="example" />
The attributes are defined like this :
- name : name of your index / core, will be available at http://localhost:8983/<name>
- instanceDir : directory that contains all conf for this index / core (conf and data)
Then, copy the directory extension/ezfind/java/solr to extension/ezfin/java/solr.multicore/example.
solrconfig.xml
This file has been left by default for us but you may care about the language specifications made (search English for example).
schema.xml
This is the main file which will help you to map your data with fields inside Solr. But before some explanation about the SolR concept.
SolR can have different index, we've seen that just above with the core, it's useful because you can separate the index and query different index with one query (sharding).
For one index, SolR can handle several types of data :
- Structured data : identified fields that will be required for each piece of data you want to index. For example, if you want to index homogeneous documents that is in eZPublish, you will need to provide data like the node_id, the section and so on.
- Not structured data : not identified fields that will be indexed in the index and that are not required. For example, if you want to index heterogeneous documents, you can index data about a video and data about a picture even if picture and video content does not share fields together. Those fields are named dynamic fields.
- Mixed data : you can have both identified and required fields and not identified and not required fields.
In eZFind, the configuration is set to Mixed data and all content fields are required. This is why the eZFind implementation is not so extendable. So I used another client has indicated by Paul Borgermans.
ezcSearch
In ezcSearch, the eZComponents / Apache Zeta Components search component, the schema.xml that is provided is a base for what you want to do and is also Mixed Data. ezcSearch also need to implement some interfaces that are compliant with the Persistent Object definition.
I made some testing with this client and I found the following a bit strange (or I did not properly understand it) :
- There's a hard coded field called ezcsearch_type. If you know about this one, just share on the forum.
- The unique id field in the schema must be id.
More info about this in the Fisheye repository for Apache Zeta Component Search.
I finally inherited the SolR manager class to change the index function so requests can work whith my fields and I also kicked out all the static fields I don't want from the schema.xml and set my own fields. It's much better and you're free to do what you want to do.
Conclusion
SolR is very powerful and not so accessible due to in-code documentation. Maybe the best point would have been to buy a book on it before starting.
eZFind is an out-of-the-box solution that works only for eZPublish content, which is a bit restrictive in our case. According to Paul Borgermans, the next version of eZFind will be able to take care of extra non related content fields !
ezcSearch is useful to generate all the queries sent to SolR but it's too much restrictive for an out-of-the-box use. However it exists so thank you guys for having already done the job !
