Class DirectoryTaxonomyWriter
- All Implemented Interfaces:
Closeable,AutoCloseable,TaxonomyWriter,TwoPhaseCommit
- Direct Known Subclasses:
ReindexingEnrichedDirectoryTaxonomyWriter
TaxonomyWriter which uses a Directory to store the taxonomy information on disk,
and keeps an additional in-memory cache of some or all categories.
In addition to the permanently-stored information in the Directory, efficiency
dictates that we also keep an in-memory cache of recently seen or all categories,
so that we do not need to go back to disk for every category addition to see which ordinal this
category already has, if any. A TaxonomyWriterCache object determines the specific
caching algorithm used.
This class offers some hooks for extending classes to control the IndexWriter instance
that is used. See openIndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.index.IndexWriterConfig).
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classDirectoryTaxonomyWriter.OrdinalMapmaintained on file systemstatic final classDirectoryTaxonomyWriter.OrdinalMapmaintained in memorystatic interfaceMapping from old ordinal to new ordinals, used when merging indexes with separate taxonomies. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final TaxonomyWriterCacheprivate booleanWe call the cache "complete" if we know that every category in our taxonomy is in the cache.private final AtomicIntegerprivate intprivate static final intprivate final Directoryprivate final Fieldstatic final StringProperty name of user commit data that contains the index epoch.private longprivate final IndexWriterprivate booleanprivate booleanprivate final AtomicIntegerprivate ReaderManagerprivate booleanprivate booleanprivate TaxonomyIndexArrays -
Constructor Summary
ConstructorsConstructorDescriptionCreate this withOpenMode.CREATE_OR_APPEND.DirectoryTaxonomyWriter(Directory directory, IndexWriterConfig.OpenMode openMode) Creates a new instance with a default cache as defined bydefaultTaxonomyWriterCache().DirectoryTaxonomyWriter(Directory directory, IndexWriterConfig.OpenMode openMode, TaxonomyWriterCache cache) Construct a Taxonomy writer. -
Method Summary
Modifier and TypeMethodDescriptionintaddCategory(FacetLabel categoryPath) addCategory() adds a category with a given path name to the taxonomy, and returns its ordinal.private intaddCategoryDocument(FacetLabel categoryPath, int parent) Note that the methods calling addCategoryDocument() are synchronized, so this method is effectively synchronized as well.voidaddTaxonomy(Directory taxoDir, DirectoryTaxonomyWriter.OrdinalMap map) Takes the categories from the given taxonomy directory, and adds the missing ones to this taxonomy.private voidaddToCache(FacetLabel categoryPath, int id) voidclose()Frees used resources as well as closes the underlyingIndexWriter, which commits whatever changes made to it to the underlyingDirectory.protected voidA hook for extending classes to close additional resources that were used.combinedCommitData(Iterable<Map.Entry<String, String>> commitData) Combine original user data with the taxonomy epoch.longcommit()The second phase of a 2-phase commit.protected IndexWriterConfigCreate theIndexWriterConfigthat would be used for opening the internal index writer.static TaxonomyWriterCacheDefines the defaultTaxonomyWriterCacheto use in constructors which do not specify one.(package private) voidDelete the taxonomy and reset all state for this writer.private voiddoClose()protected voidenrichOrdinalDocument(Document d, FacetLabel categoryPath) Child classes can implement this method to modify the document corresponding to a category path before indexing it.protected final voidVerifies that this instance wasn't closed, or throwsAlreadyClosedExceptionif it is.protected intfindCategory(FacetLabel categoryPath) Look up the given category in the cache and/or the on-disk storage, returning the category's ordinal, or a negative number in case the category does not yet exist in the taxonomy.getCache()Returns theTaxonomyWriterCachein use by this writer.Returns theDirectoryof this taxonomy writer.(package private) final IndexWriterUsed byDirectoryTaxonomyReaderto support NRT.Returns the commit user data iterable that was set onTaxonomyWriter.setLiveCommitData(Iterable).intgetParent(int ordinal) getParent() returns the ordinal of the parent category of the category with the given ordinal.intgetSize()getSize() returns the number of categories in the taxonomy.private TaxonomyIndexArraysfinal longExpert: returns current index epoch, if this is a near-real-time reader.private voidOpens aReaderManagerfrom the internalIndexWriter.private intAdd a new category into the index (and the cache), and return its new ordinal.protected IndexWriteropenIndexWriter(Directory directory, IndexWriterConfig config) Open internal index writer, which contains the taxonomy data.private voidlongprepare most of the work needed for a two-phase commit.private voidvoidreplaceTaxonomy(Directory taxoDir) Replaces the current taxonomy with the given one.voidrollback()Rollback changes to the taxonomy writer and closes the instance.voidsetCacheMissesUntilFill(int i) Set the number of cache misses before an attempt is made to read the entire taxonomy into the in-memory cache.voidsetLiveCommitData(Iterable<Map.Entry<String, String>> commitUserData) Sets the commit user data iterable.
-
Field Details
-
INDEX_EPOCH
Property name of user commit data that contains the index epoch. The epoch changes whenever the taxonomy is recreated (i.e. opened withIndexWriterConfig.OpenMode.CREATE.Applications should not use this property in their commit data because it will be overridden by this taxonomy writer.
- See Also:
-
DEFAULT_CACHE_SIZE
private static final int DEFAULT_CACHE_SIZE- See Also:
-
dir
-
indexWriter
-
cache
-
cacheMisses
-
nextID
-
fullPathField
-
indexEpoch
private long indexEpoch -
cacheMissesUntilFill
private int cacheMissesUntilFill -
shouldFillCache
private boolean shouldFillCache -
readerManager
-
initializedReaderManager
private volatile boolean initializedReaderManager -
shouldRefreshReaderManager
private volatile boolean shouldRefreshReaderManager -
cacheIsComplete
private volatile boolean cacheIsCompleteWe call the cache "complete" if we know that every category in our taxonomy is in the cache. When the cache is not complete, and we can't find a category in the cache, we still need to look for it in the on-disk index; Therefore when the cache is not complete, we need to open a "reader" to the taxonomy index. The cache becomes incomplete if it was never filled with the existing categories, or if a put() to the cache ever returned true (meaning that some cached data was cleared). -
isClosed
private volatile boolean isClosed -
taxoArrays
-
-
Constructor Details
-
DirectoryTaxonomyWriter
public DirectoryTaxonomyWriter(Directory directory, IndexWriterConfig.OpenMode openMode, TaxonomyWriterCache cache) throws IOException Construct a Taxonomy writer.- Parameters:
directory- TheDirectoryin which to store the taxonomy. Note that the taxonomy is written directly to that directory (not to a subdirectory of it).openMode- Specifies how to open a taxonomy for writing:APPENDmeans open an existing index for append (failing if the index does not yet exist).CREATEmeans create a new index (first deleting the old one if it already existed).APPEND_OR_CREATEappends to an existing index if there is one, otherwise it creates a new index.cache- ATaxonomyWriterCacheimplementation which determines the in-memory caching policy. See for exampleLruTaxonomyWriterCache. If null or missing,defaultTaxonomyWriterCache()is used.- Throws:
CorruptIndexException- if the taxonomy is corrupted.LockObtainFailedException- if the taxonomy is locked by another writer.IOException- if another error occurred.
-
DirectoryTaxonomyWriter
public DirectoryTaxonomyWriter(Directory directory, IndexWriterConfig.OpenMode openMode) throws IOException Creates a new instance with a default cache as defined bydefaultTaxonomyWriterCache().- Throws:
IOException
-
DirectoryTaxonomyWriter
Create this withOpenMode.CREATE_OR_APPEND.- Throws:
IOException
-
-
Method Details
-
getCache
Returns theTaxonomyWriterCachein use by this writer. -
openIndexWriter
protected IndexWriter openIndexWriter(Directory directory, IndexWriterConfig config) throws IOException Open internal index writer, which contains the taxonomy data.Extensions may provide their own
IndexWriterimplementation or instance.
NOTE: the instance this method returns will be closed upon calling toclose().
NOTE: the merge policy in effect must not merge none adjacent segments. See comment increateIndexWriterConfig(IndexWriterConfig.OpenMode)for the logic behind this.- Parameters:
directory- theDirectoryon top of which anIndexWritershould be opened.config- configuration for the internal index writer.- Throws:
IOException- See Also:
-
createIndexWriterConfig
Create theIndexWriterConfigthat would be used for opening the internal index writer.
Extensions can configure theIndexWriteras they see fit, including setting amerge-scheduler, ordeletion-policy, different RAM size etc.
NOTE: internal docids of the configured index must not be altered. For that, categories are never deleted from the taxonomy index. In addition, merge policy in effect must not merge none adjacent segments.- Parameters:
openMode- seeIndexWriterConfig.OpenMode- See Also:
-
initReaderManager
Opens aReaderManagerfrom the internalIndexWriter.- Throws:
IOException
-
defaultTaxonomyWriterCache
Defines the defaultTaxonomyWriterCacheto use in constructors which do not specify one.The current default is
LruTaxonomyWriterCache -
close
Frees used resources as well as closes the underlyingIndexWriter, which commits whatever changes made to it to the underlyingDirectory.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Throws:
IOException
-
doClose
- Throws:
IOException
-
closeResources
A hook for extending classes to close additional resources that were used. The default implementation closes theIndexReaderas well as theTaxonomyWriterCacheinstances that were used.
NOTE: if you override this method, you should include asuper.closeResources()call in your implementation.- Throws:
IOException
-
findCategory
Look up the given category in the cache and/or the on-disk storage, returning the category's ordinal, or a negative number in case the category does not yet exist in the taxonomy.- Throws:
IOException
-
addCategory
Description copied from interface:TaxonomyWriteraddCategory() adds a category with a given path name to the taxonomy, and returns its ordinal. If the category was already present in the taxonomy, its existing ordinal is returned.Before adding a category, addCategory() makes sure that all its ancestor categories exist in the taxonomy as well. As result, the ordinal of a category is guaranteed to be smaller then the ordinal of any of its descendants.
- Specified by:
addCategoryin interfaceTaxonomyWriter- Throws:
IOException
-
internalAddCategory
Add a new category into the index (and the cache), and return its new ordinal.Actually, we might also need to add some of the category's ancestors before we can add the category itself (while keeping the invariant that a parent is always added to the taxonomy before its child). We do this by recursion.
- Throws:
IOException
-
ensureOpen
protected final void ensureOpen()Verifies that this instance wasn't closed, or throwsAlreadyClosedExceptionif it is. -
enrichOrdinalDocument
Child classes can implement this method to modify the document corresponding to a category path before indexing it. -
addCategoryDocument
Note that the methods calling addCategoryDocument() are synchronized, so this method is effectively synchronized as well.- Throws:
IOException
-
addToCache
- Throws:
IOException
-
refreshReaderManager
- Throws:
IOException
-
commit
Description copied from interface:TwoPhaseCommitThe second phase of a 2-phase commit. Implementations should ideally do very little work in this method (followingTwoPhaseCommit.prepareCommit(), and after it returns, the caller can assume that the changes were successfully committed to the underlying storage.- Specified by:
commitin interfaceTwoPhaseCommit- Throws:
IOException
-
combinedCommitData
private Iterable<Map.Entry<String,String>> combinedCommitData(Iterable<Map.Entry<String, String>> commitData) Combine original user data with the taxonomy epoch. -
setLiveCommitData
Description copied from interface:TaxonomyWriterSets the commit user data iterable. SeeIndexWriter.setLiveCommitData(java.lang.Iterable<java.util.Map.Entry<java.lang.String, java.lang.String>>).- Specified by:
setLiveCommitDatain interfaceTaxonomyWriter
-
getLiveCommitData
Description copied from interface:TaxonomyWriterReturns the commit user data iterable that was set onTaxonomyWriter.setLiveCommitData(Iterable).- Specified by:
getLiveCommitDatain interfaceTaxonomyWriter
-
prepareCommit
prepare most of the work needed for a two-phase commit. SeeIndexWriter.prepareCommit().- Specified by:
prepareCommitin interfaceTwoPhaseCommit- Throws:
IOException
-
getSize
public int getSize()Description copied from interface:TaxonomyWritergetSize() returns the number of categories in the taxonomy.Because categories are numbered consecutively starting with 0, it means the taxonomy contains ordinals 0 through getSize()-1.
Note that the number returned by getSize() is often slightly higher than the number of categories inserted into the taxonomy; This is because when a category is added to the taxonomy, its ancestors are also added automatically (including the root, which always get ordinal 0).
- Specified by:
getSizein interfaceTaxonomyWriter
-
setCacheMissesUntilFill
public void setCacheMissesUntilFill(int i) Set the number of cache misses before an attempt is made to read the entire taxonomy into the in-memory cache.This taxonomy writer holds an in-memory cache of recently seen categories to speed up operation. On each cache-miss, the on-disk index needs to be consulted. When an existing taxonomy is opened, a lot of slow disk reads like that are needed until the cache is filled, so it is more efficient to read the entire taxonomy into memory at once. We do this complete read after a certain number (defined by this method) of cache misses.
If the number is set to
0, the entire taxonomy is read into the cache on first use, without fetching individual categories first.NOTE: it is assumed that this method is called immediately after the taxonomy writer has been created.
-
perhapsFillCache
- Throws:
IOException
-
getTaxoArrays
- Throws:
IOException
-
getParent
Description copied from interface:TaxonomyWritergetParent() returns the ordinal of the parent category of the category with the given ordinal.When a category is specified as a path name, finding the path of its parent is as trivial as dropping the last component of the path. getParent() is functionally equivalent to calling getPath() on the given ordinal, dropping the last component of the path, and then calling getOrdinal() to get an ordinal back.
If the given ordinal is the ROOT_ORDINAL, an INVALID_ORDINAL is returned. If the given ordinal is a top-level category, the ROOT_ORDINAL is returned. If an invalid ordinal is given (negative or beyond the last available ordinal), an IndexOutOfBoundsException is thrown. However, it is expected that getParent will only be called for ordinals which are already known to be in the taxonomy. TODO (Facet): instead of a getParent(ordinal) method, consider having a
getCategory(categorypath, prefixlen) which is similar to addCategory except it doesn't add new categories; This method can be used to get the ordinals of all prefixes of the given category, and it can use exactly the same code and cache used by addCategory() so it means less code.
- Specified by:
getParentin interfaceTaxonomyWriter- Throws:
IOException
-
addTaxonomy
public void addTaxonomy(Directory taxoDir, DirectoryTaxonomyWriter.OrdinalMap map) throws IOException Takes the categories from the given taxonomy directory, and adds the missing ones to this taxonomy. Additionally, it fills the givenDirectoryTaxonomyWriter.OrdinalMapwith a mapping from the original ordinal to the new ordinal.- Throws:
IOException
-
rollback
Rollback changes to the taxonomy writer and closes the instance. Following this method the instance becomes unusable (calling any of its API methods will yield anAlreadyClosedException).- Specified by:
rollbackin interfaceTwoPhaseCommit- Throws:
IOException
-
replaceTaxonomy
Replaces the current taxonomy with the given one. This method should generally be called in conjunction withIndexWriter.addIndexes(Directory...)to replace both the taxonomy and the search index content.- Throws:
IOException
-
deleteAll
Delete the taxonomy and reset all state for this writer.To keep using the same main index, you would have to regenerate the taxonomy, taking care that ordinals are indexed in the same order as before. An example of this can be found in
ReindexingEnrichedDirectoryTaxonomyWriter.reindexWithNewOrdinalData(BiConsumer).- Throws:
IOException
-
getDirectory
Returns theDirectoryof this taxonomy writer. -
getInternalIndexWriter
Used byDirectoryTaxonomyReaderto support NRT.NOTE: you should not use the obtained
IndexWriterin any way, other than opening an IndexReader on it, or otherwise, the taxonomy index may become corrupt! -
getTaxonomyEpoch
public final long getTaxonomyEpoch()Expert: returns current index epoch, if this is a near-real-time reader. Used byDirectoryTaxonomyReaderto support NRT.
-