java.lang.Object
org.apache.lucene.sandbox.codecs.quantization.KMeans
KMeans clustering algorithm for vectors
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumKmeans initialization methodsstatic final recordResults of KMeans clustering -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intstatic final intprivate final KMeans.KmeansInitializationMethodprivate final intstatic final intprivate final intprivate final intprivate final Randomprivate final intprivate final FloatVectorValues -
Constructor Summary
ConstructorsModifierConstructorDescriptionprivateKMeans(FloatVectorValues vectors, int numCentroids, Random random, KMeans.KmeansInitializationMethod initializationMethod, int restarts, int iters) -
Method Summary
Modifier and TypeMethodDescription(package private) static voidassignCentroids(FloatVectorValues vectors, float[][] centroids, List<Integer> unassignedCentroidsIdxs) For centroids that did not get any points, assign outlying points to them chose points by descending distance to the current centroid setstatic KMeans.Resultscluster(FloatVectorValues vectors, int numClusters, boolean assignCentroidsToVectors, long seed, KMeans.KmeansInitializationMethod initializationMethod, boolean normalizeCenters, int restarts, int iters, int sampleSize) Expert: Cluster vectors into a given number of clustersstatic KMeans.Resultscluster(FloatVectorValues vectors, VectorSimilarityFunction similarityFunction, int numClusters) Cluster vectors into a given number of clustersprivate float[][]computeCentroids(boolean normalizeCenters) private float[][]Initialize centroids using Forgy method: randomly select numCentroids vectors for initial centroidsprivate float[][]Initialize centroids using Kmeans++ methodprivate float[][]Initialize centroids using a reservoir sampling methodprivate static doublerunKMeansStep(FloatVectorValues vectors, float[][] centroids, short[] docCentroids, boolean useKahanSummation, boolean normalizeCentroids) Run kmeans step
-
Field Details
-
MAX_NUM_CENTROIDS
public static final int MAX_NUM_CENTROIDS- See Also:
-
DEFAULT_RESTARTS
public static final int DEFAULT_RESTARTS- See Also:
-
DEFAULT_ITRS
public static final int DEFAULT_ITRS- See Also:
-
DEFAULT_SAMPLE_SIZE
public static final int DEFAULT_SAMPLE_SIZE- See Also:
-
vectors
-
numVectors
private final int numVectors -
numCentroids
private final int numCentroids -
random
-
initializationMethod
-
restarts
private final int restarts -
iters
private final int iters
-
-
Constructor Details
-
KMeans
private KMeans(FloatVectorValues vectors, int numCentroids, Random random, KMeans.KmeansInitializationMethod initializationMethod, int restarts, int iters)
-
-
Method Details
-
cluster
public static KMeans.Results cluster(FloatVectorValues vectors, VectorSimilarityFunction similarityFunction, int numClusters) throws IOException Cluster vectors into a given number of clusters- Parameters:
vectors- float vectorssimilarityFunction- vector similarity function. For COSINE similarity, vectors must be normalized.numClusters- number of cluster to cluster vector into- Returns:
- results of clustering: produced centroids and for each vector its centroid
- Throws:
IOException- when if there is an error accessing vectors
-
cluster
public static KMeans.Results cluster(FloatVectorValues vectors, int numClusters, boolean assignCentroidsToVectors, long seed, KMeans.KmeansInitializationMethod initializationMethod, boolean normalizeCenters, int restarts, int iters, int sampleSize) throws IOException Expert: Cluster vectors into a given number of clusters- Parameters:
vectors- float vectorsnumClusters- number of cluster to cluster vector intoassignCentroidsToVectors- iftrueassign centroids for all vectors. Centroids are computed on a sample of vectors. If this parameter istrue, in results also return for all vectors what centroids they belong to.seed- random seedinitializationMethod- Kmeans initialization methodnormalizeCenters- for cosine distance, set to true, to use spherical k-means where centers are normalizedrestarts- how many times to run Kmeans algorithmiters- how many iterations to do within a single runsampleSize- sample size to select from all vectors on which to run Kmeans algorithm- Returns:
- results of clustering: produced centroids and if
assignCentroidsToVectors == truealso for each vector its centroid - Throws:
IOException- if there is error accessing vectors
-
computeCentroids
- Throws:
IOException
-
initializeForgy
Initialize centroids using Forgy method: randomly select numCentroids vectors for initial centroids- Throws:
IOException
-
initializeReservoirSampling
Initialize centroids using a reservoir sampling method- Throws:
IOException
-
initializePlusPlus
Initialize centroids using Kmeans++ method- Throws:
IOException
-
runKMeansStep
private static double runKMeansStep(FloatVectorValues vectors, float[][] centroids, short[] docCentroids, boolean useKahanSummation, boolean normalizeCentroids) throws IOException Run kmeans step- Parameters:
vectors- float vectorscentroids- centroids, new calculated centroids are written heredocCentroids- for each document which centroid it belongs to, results will be written hereuseKahanSummation- for large datasets use Kahan summation to calculate centroids, since we can easily reach the limits of float precisionnormalizeCentroids- if centroids should be normalized; used for cosine similarity only- Throws:
IOException- if there is an error accessing vector values
-
assignCentroids
static void assignCentroids(FloatVectorValues vectors, float[][] centroids, List<Integer> unassignedCentroidsIdxs) throws IOException For centroids that did not get any points, assign outlying points to them chose points by descending distance to the current centroid set- Throws:
IOException
-