Package org.eclipse.rdf4j.sail.lucene
Class LuceneSail
java.lang.Object
org.eclipse.rdf4j.sail.helpers.SailWrapper
org.eclipse.rdf4j.sail.helpers.NotifyingSailWrapper
org.eclipse.rdf4j.sail.lucene.LuceneSail
- All Implemented Interfaces:
FederatedServiceResolverClient
,NotifyingSail
,Sail
,StackableSail
A LuceneSail wraps an arbitrary existing Sail and extends it with support for full-text search on all Literals.
Setting up a LuceneSail
LuceneSail works in two modes: storing its data into a directory on the harddisk or into a RAMDirectory in RAM (which is discarded when the program ends). Example with storage in a folder:// create a sesame memory sail MemoryStore memoryStore = new MemoryStore(); // create a lucenesail to wrap the memorystore LuceneSail lucenesail = new LuceneSail(); // set this parameter to store the lucene index on disk lucenesail.setParameter(LuceneSail.LUCENE_DIR_KEY, "./data/mydirectory"); // wrap memorystore in a lucenesail lucenesail.setBaseSail(memoryStore); // create a Repository to access the sails SailRepository repository = new SailRepository(lucenesail); repository.initialize();Example with storage in a RAM directory:
// create a sesame memory sail MemoryStore memoryStore = new MemoryStore(); // create a lucenesail to wrap the memorystore LuceneSail lucenesail = new LuceneSail(); // set this parameter to let the lucene index store its data in ram lucenesail.setParameter(LuceneSail.LUCENE_RAMDIR_KEY, "true"); // wrap memorystore in a lucenesail lucenesail.setBaseSail(memoryStore); // create a Repository to access the sails SailRepository repository = new SailRepository(lucenesail); repository.initialize();
Asking full-text queries
Text queries are expressed using the virtual properties of the LuceneSail. An example query looks like this (SERQL):
SELECT Subject, Score, Snippet
FROM {Subject} invalid input: '<'http://www.openrdf.org/contrib/lucenesail#matches> {}
invalid input: '<'http://www.w3.org/1999/02/22-rdf-syntax-ns#type> {invalid input: '<'http://www.openrdf.org/contrib/lucenesail#LuceneQuery>};
invalid input: '<'http://www.openrdf.org/contrib/lucenesail#query> {"my Lucene query"};
invalid input: '<'http://www.openrdf.org/contrib/lucenesail#score> {Score};
invalid input: '<'http://www.openrdf.org/contrib/lucenesail#snippet> {Snippet}
In SPARQL:
SELECT ?subject ?score ?snippet ?resource WHERE {
?subject invalid input: '<'http://www.openrdf.org/contrib/lucenesail#matches> [
a invalid input: '<'http://www.openrdf.org/contrib/lucenesail#LuceneQuery> ;
invalid input: '<'http://www.openrdf.org/contrib/lucenesail#query> "my Lucene query" ;
invalid input: '<'http://www.openrdf.org/contrib/lucenesail#score> ?score ;
invalid input: '<'http://www.openrdf.org/contrib/lucenesail#snippet> ?snippet ;
invalid input: '<'http://www.openrdf.org/contrib/lucenesail#resource> ?resource
]
}
When defining queries, these properties type and query are mandatory. Also, the matches relation is
mandatory. When one of these misses, the query will not be executed as expected. The failure behavior can be
configured, setting the Sail property "incompletequeryfail" to true will throw a SailException when such patterns are
found, this is the default behavior to help finding inaccurate queries. Set it to false to have warnings logged
instead. Multiple queries can be issued to the sail, the results of the queries will be integrated. Note that
you cannot use the same variable for multiple Text queries, if you want to combine text searches, use Lucenes query
syntax.
Fields are stored/indexed
All fields are stored and indexed. The "text" fields (gathering all literals) have to be stored, because when a new literal is added to a document, the previous texts need to be copied from the existing document to the new Document, this does not work when they are only "indexed". Fields that are not stored, cannot be retrieved using full-text querying.Deleting a Lucene index
At the moment, deleting the lucene index can be done in two ways:- Delete the folder where the data is stored while the application is not running
- Call the repository's
method with no arguments.RepositoryConnection.clear(org.eclipse.rdf4j.model.Resource[])
clear()
. This will delete the index.
Handling of Contexts
Each lucene document contains a field for every contextIDs that contributed to the document. NULL contexts are marked using the StringSearchFields.CONTEXT_NULL
("null") and stored in the lucene field
SearchFields.CONTEXT_FIELD_NAME
("context"). This means that when
adding/appending to a document, all additional context-uris are added to the document. When deleting individual
triples, the context is ignored. In clear(Resource ...) we make a query on all Lucene-Documents that were possibly
created by this context(s). Given a document D that context C(1-n) contributed to. D' is the new document after
clear(). - if there is only one C then D can be safely removed. There is no D' (I hope this is the standard case:
like in ontologies, where all triples about a resource are in one document) - if there are multiple C, remember the
uri of D, delete D, and query (s,p,o, ?) from the underlying store after committing the operation- this returns the
literals of D', add D' as new document This will probably be both fast in the common case and capable enough in the
multiple-C case.
Defining the indexed Fields
The propertyINDEXEDFIELDS
is to configure
which fields to index and to project a property to another. Syntax:
# only index label and comment index.1=http://www.w3.org/2000/01/rdf-schema#label index.2=http://www.w3.org/2000/01/rdf-schema#comment # project http://xmlns.com/foaf/0.1/name to rdfs:label http\://xmlns.com/foaf/0.1/name=http\://www.w3.org/2000/01/rdf-schema#label
Set and select Lucene sail by id
The propertyINDEX_ID
is to configure the id
of the index and filter every request without the search:indexid predicate, the request would be:
?subj search:matches [ search:indexid my:lucene_index_id; search:query "search terms..."; search:property my:property; search:score ?score; search:snippet ?snippet ] .If a LuceneSail is using another LuceneSail as a base sail, the evaluation mode should be set to
TupleFunctionEvaluationMode.NATIVE
.
Defining the indexed Types/Languages
The propertiesINDEXEDTYPES
and
INDEXEDLANG
are to configure which fields to index by their language or type. INDEXEDTYPES
Syntax:
# only index object of rdf:type ex:mytype1, rdf:type ex:mytype2 or ex:mytypedef ex:mytype3 http\://www.w3.org/1999/02/22-rdf-syntax-ns#type=http://example.org/mytype1 http://example.org/mytype2 http\://example.org/mytypedef=http://example.org/mytype3
INDEXEDLANG
Syntax:
# syntax to index only French(fr) and English(en) literals fr en
Datatypes
Datatypes are ignored in the LuceneSail.-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
Set this key as sail parameter to configure the Lucene analyzer class implementation to use for text analysis.static final String
static final String
Set the default directory of the Lucene index files.static final String
static final String
Set this key as sail parameter to influence the fuzzy prefix length.static final String
Set this key as sail parameter to influence whether incomplete queries are treated as failure (Malformed queries) or whether they are ignored.static final String
Set this key to configure the SearchIndex class implementation.static final String
Set this key to configure the filtering of queries, if this parameter is set, the match object should contain the search:indexid parameter, see the syntax abovestatic final String
static final String
Set the parameter "indexedfields=..." to configure a selection of fields to index, and projections of properties.static final String
Set the parameter "indexedlang=..." to configure a selection of field language to index.static final String
Set the parameter "indexedtypes=..." to configure a selection of field type to index.static final String
Set the key "lucenedir=<path>" as sail parameter to configure the Lucene Directory on the filesystem where to store the lucene index.static final String
Set the key "useramdir=true" as sail parameter to let the LuceneSail store its Lucene index in RAM.static final String
Set the key "maxDocuments=<n>" as sail parameter to limit the maximum number of documents to return from a search query.protected final Properties
static final String
Set the parameter "reindexQuery=" to configure the statements to index over.static final String
Set this key as sail parameter to configureSimilarity
class implementation to use for text analysis.static final String
Set this key to configure which fields contain WKT and should be spatially indexed. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
protected static SearchIndex
createSearchIndex
(Properties parameters) Deprecated.Opens a connection on the Sail which can be used to query and update data.See EVALUATION_MODE_KEY parameter.SeeINDEX_TYPE_BACKTRACE_MODE
parameter.getParameter
(String key) See REINDEX_QUERY_KEY parameter.protected Collection<SearchQueryInterpreter>
void
init()
Initializes the Sail.protected void
boolean
When this is true, incomplete queries will trigger a SailException.mapStatement
(Statement statement) void
Sets a filter which determines whether a statement should be considered for indexing when performing complete reindexing.void
reindex()
Starts a reindexation process of the whole sail.void
setDataDir
(File dataDir) Sets the data directory for the Sail.void
See EVALUATION_MODE_KEY parameter.void
Sets theFederatedServiceResolver
to use for this client.void
setFuzzyPrefixLength
(int fuzzyPrefixLength) void
setIncompleteQueryFails
(boolean incompleteQueryFails) Set this to true, so that incomplete queries will trigger a SailException.void
SeeINDEX_TYPE_BACKTRACE_MODE
parameter.void
setLuceneIndex
(SearchIndex luceneIndex) void
setParameter
(String key, String value) void
setReindexQuery
(String query) See REINDEX_QUERY_KEY parameter.void
void
shutDown()
Shuts down the Sail, giving it the opportunity to synchronize any stale data.Methods inherited from class org.eclipse.rdf4j.sail.helpers.NotifyingSailWrapper
addSailChangedListener, getBaseSail, removeSailChangedListener, setBaseSail
Methods inherited from class org.eclipse.rdf4j.sail.helpers.SailWrapper
getDataDir, getDefaultIsolationLevel, getSupportedIsolationLevels, getValueFactory, isWritable, verifyBaseSailSet
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.eclipse.rdf4j.sail.Sail
getDataDir, getDefaultIsolationLevel, getSupportedIsolationLevels, getValueFactory, isWritable
-
Field Details
-
REINDEX_QUERY_KEY
Set the parameter "reindexQuery=" to configure the statements to index over. Default value is "SELECT ?s ?p ?o ?c WHERE {{?s ?p ?o} UNION {GRAPH ?c {?s ?p ?o.}}} ORDER BY ?s" . NB: the query must contain the bindings ?s, ?p, ?o and ?c and must be ordered by ?s.- See Also:
-
INDEXEDFIELDS
Set the parameter "indexedfields=..." to configure a selection of fields to index, and projections of properties. Only the configured fields will be indexed. A property P projected to Q will cause the index to contain Q instead of P, when triples with P were indexed. Syntax of indexedfields - see above- See Also:
-
INDEXEDTYPES
Set the parameter "indexedtypes=..." to configure a selection of field type to index. Only the fields with the specific type will be indexed. Syntax of indexedtypes - see above- See Also:
-
INDEXEDLANG
Set the parameter "indexedlang=..." to configure a selection of field language to index. Only the fields with the specific language will be indexed. Syntax of indexedlang - see above- See Also:
-
INDEX_TYPE_BACKTRACE_MODE
- See Also:
-
LUCENE_DIR_KEY
Set the key "lucenedir=<path>" as sail parameter to configure the Lucene Directory on the filesystem where to store the lucene index.- See Also:
-
DEFAULT_LUCENE_DIR
Set the default directory of the Lucene index files. The value is always relational to thedataDir
location as a parent directory.- See Also:
-
LUCENE_RAMDIR_KEY
Set the key "useramdir=true" as sail parameter to let the LuceneSail store its Lucene index in RAM. This is not intended for production environments.- See Also:
-
MAX_DOCUMENTS_KEY
Set the key "maxDocuments=<n>" as sail parameter to limit the maximum number of documents to return from a search query. The default is to return all documents. NB: this may involve extra cost for some SearchIndex implementations as they may have to determine this number.- See Also:
-
WKT_FIELDS
Set this key to configure which fields contain WKT and should be spatially indexed. The value should be a space-separated list of URIs. Default is http://www.opengis.net/ont/geosparql#asWKT.- See Also:
-
INDEX_CLASS_KEY
Set this key to configure the SearchIndex class implementation. Default is org.eclipse.rdf4j.sail.lucene.LuceneIndex.- See Also:
-
INDEX_ID
Set this key to configure the filtering of queries, if this parameter is set, the match object should contain the search:indexid parameter, see the syntax above- See Also:
-
DEFAULT_INDEX_CLASS
- See Also:
-
ANALYZER_CLASS_KEY
Set this key as sail parameter to configure the Lucene analyzer class implementation to use for text analysis.- See Also:
-
SIMILARITY_CLASS_KEY
Set this key as sail parameter to configureSimilarity
class implementation to use for text analysis.- See Also:
-
INCOMPLETE_QUERY_FAIL_KEY
Set this key as sail parameter to influence whether incomplete queries are treated as failure (Malformed queries) or whether they are ignored. Set to either "true" or "false". When omitted in the properties, true is default (failure on incomplete queries). seeisIncompleteQueryFails()
- See Also:
-
EVALUATION_MODE_KEY
- See Also:
-
FUZZY_PREFIX_LENGTH_KEY
Set this key as sail parameter to influence the fuzzy prefix length.- See Also:
-
parameters
-
-
Constructor Details
-
LuceneSail
public LuceneSail()
-
-
Method Details
-
setLuceneIndex
-
getLuceneIndex
-
getConnection
Description copied from interface:Sail
Opens a connection on the Sail which can be used to query and update data. Depending on how the implementation handles concurrent access, a call to this method might block when there is another open connection on this Sail.- Specified by:
getConnection
in interfaceNotifyingSail
- Specified by:
getConnection
in interfaceSail
- Overrides:
getConnection
in classNotifyingSailWrapper
- Throws:
SailException
- If no transaction could be started, for example because the Sail is not writable.
-
shutDown
Description copied from interface:Sail
Shuts down the Sail, giving it the opportunity to synchronize any stale data. Care should be taken that all initialized Sails are being shut down before an application exits to avoid potential loss of data. Once shut down, a Sail can no longer be used until it is re-initialized.- Specified by:
shutDown
in interfaceSail
- Overrides:
shutDown
in classSailWrapper
- Throws:
SailException
- If the Sail object encountered an error or unexpected situation internally.
-
setDataDir
Description copied from interface:Sail
Sets the data directory for the Sail. The Sail can use this directory for storage of data, parameters, etc. This directory must be set before the Sail isinvalid @link
initialized
- Specified by:
setDataDir
in interfaceSail
- Overrides:
setDataDir
in classSailWrapper
-
init
Description copied from interface:Sail
Initializes the Sail. Care should be taken that required initialization parameters have been set before this method is called. Please consult the specific Sail implementation for information about the relevant parameters.- Specified by:
init
in interfaceSail
- Overrides:
init
in classSailWrapper
- Throws:
SailException
- If the Sail could not be initialized.
-
createSearchIndex
Deprecated.The method is relocated toSearchIndexUtils.createSearchIndex(java.util.Properties)
.- Parameters:
parameters
-- Returns:
- search index
- Throws:
Exception
-
initializeLuceneIndex
- Throws:
Exception
-
setParameter
-
getParameter
-
getParameterNames
-
getReindexQuery
See REINDEX_QUERY_KEY parameter. -
setReindexQuery
See REINDEX_QUERY_KEY parameter. -
isIncompleteQueryFails
public boolean isIncompleteQueryFails()When this is true, incomplete queries will trigger a SailException. You can set this value either usingsetIncompleteQueryFails(boolean)
or using the parameter "incompletequeryfail"- Returns:
- Returns the incompleteQueryFails.
-
setIncompleteQueryFails
public void setIncompleteQueryFails(boolean incompleteQueryFails) Set this to true, so that incomplete queries will trigger a SailException. Otherwise, incomplete queries will be logged with level WARN. Default is true. You can set this value also using the parameter "incompletequeryfail".- Parameters:
incompleteQueryFails
- true or false
-
getEvaluationMode
See EVALUATION_MODE_KEY parameter. -
setEvaluationMode
See EVALUATION_MODE_KEY parameter. -
getIndexBacktraceMode
SeeINDEX_TYPE_BACKTRACE_MODE
parameter. -
setIndexBacktraceMode
SeeINDEX_TYPE_BACKTRACE_MODE
parameter. -
setFuzzyPrefixLength
public void setFuzzyPrefixLength(int fuzzyPrefixLength) -
getTupleFunctionRegistry
-
setTupleFunctionRegistry
-
getFederatedServiceResolver
-
setFederatedServiceResolver
Description copied from interface:FederatedServiceResolverClient
Sets theFederatedServiceResolver
to use for this client.- Specified by:
setFederatedServiceResolver
in interfaceFederatedServiceResolverClient
- Overrides:
setFederatedServiceResolver
in classSailWrapper
- Parameters:
resolver
- The resolver to use.
-
reindex
Starts a reindexation process of the whole sail. Basically, this will delete and add all data again, a long-lasting process.- Throws:
SailException
- If the Sail could not be reindex
-
registerStatementFilter
Sets a filter which determines whether a statement should be considered for indexing when performing complete reindexing. -
acceptStatementToIndex
-
mapStatement
-
getSearchQueryInterpreters
-