embedding2d
Stages of the embedding2d:​*
type lay out entities, such as documents or labels, in 2D space, putting
similar entities close to each other. You can use the output of the 2D embedding stages to create 2D thematic "maps"
of large numbers of documents or labels.
Lingo4G 2d embedding stages take matrix:​*
on input rather than lists of documents or labels. Therefore, you can use the same 2d embedding algorithm, such as
embedding2d:​lv
, to process either documents or labels. It is the input similarity matrix that defines the entities to embed and
the similarity function.
The Similarity matrices tutorial contains step-by-step guides to different similarity functions available in Lingo4G and their applications to 2d mapping of labels and documents.
You can use the following 2D embedding stages in your analysis requests:
-
embedding2d:​lv
-
Computes a 2D embedding based on the similarity matrix you provide. Uses an optimized version of the LargeVis algorithm.
-
embedding2d:​lv​Overlay
-
Overlays new points on top of an existing 2D embedding you provide. You can use this stage to, for example, put labels on top of a 2D map of documents. Uses an optimized version of the LargeVis algorithm.
-
embedding2d:​transferred
-
Copies 2D coordinates from one list of entities, such as documents or labels, to another. You can use this stage to generate a time series of 2D embeddings.
embedding2d:​reference
-
References the results of another
embedding2d:​*
stage defined in the request.
The JSON output of the embedding2d:​*
stages has the following structure:
The result object contains a single property - the points
array containing the coordinates of the 2d
embedding. The points
array is aligned with the input square
matrix: it has the same size and the coordinates at index
i correspond to the ith row of the input matrix.
Note that for certain points the coordinates may be empty - your code should ignore them when visualizing the 2d embedding.
embedding2d:​lv
Computes a 2D embedding based on the similarity matrix you provide. This stage uses an optimized version of the LargeVis algorithm.
{
"type": "embedding2d:lv",
"initial": null,
"initializedLearningRate": 0.02,
"matrix": {
"type": "matrix:reference",
"auto": true
},
"maxIterations": 300,
"negativeEdgeCount": 5,
"negativeEdgeDenominator": 1,
"negativeEdgeWeight": 2,
"threads": "auto"
}
Characteristics
The embedding2d:​lv
stage embeds the input entities, such as documents or labels, in a
2-dimensional space. As a result, you get the input items laid out on a 2d map in such a way that similar items
are close to each other and dissimilar items are spatially-separated.
Examples
The following request computes a 2d embedding of labels based on how they co-occur in a set of documents.
The documents
stage selects documents containing the word clustering and passes them to
the labels
stage, which collects the top 500 labels occurring in those documents. Finally, the
embedding2d
stage computes a 2d map of the labels based on how the labels co-occur in the
documents.
Note that in the above request you can replace the
matrix:​cooccurrence​Label​Similarity
stage with
matrix:​knn​Vectors​Similarity
, which uses multidimensional embeddings to compute more "semantic" similarities between labels.
The following request computes a 2d embedding for a set of documents containing the clustering query:
In the above request, the documents
stage selects documents that match the
clustering query and the embedding2d
stage uses
embedding2d:​lv
to compute the 2d embedding. Like with 2d embeddings for labels, you can replace the
matrix:​keyword​Document​Similarity
with
matrix:​knn​Vectors​Similarity
to use multidimensional embedding vectors for similarity computation. The
tuning section explains how the characteristics of the 2d embeddings depend
on the underlying similarity matrix computation method.
You can extend both requests by adding an overlay 2d embedding of labels, so that the labels describe the densely-populated areas of the document maps.
Tuning
The output of the embedding2d:​lv
stage depends not only on the parameters you choose, but also on
the type and density of the similarity matrix you provide on input. The following subsections offer some advice
for a number of typical 2d embedding tuning scenarios.
Layout characteristics
The primary layout property you can control is the number and size of spatial clusters of the 2d points.
-
To get a layout with a small number of large clusters:
-
Use the
matrix:​knn​Vectors​Similarity
stage as the source of input similarity matrix. -
Increase
negative​Edge​Denominator
, possibly up to a value of 10. -
Decrease
negative​Edge​Count
to equal 3 or 2.
-
-
To get a layout with a large number of small clusters:
-
Use the
matrix:​keyword​Document​Similarity
stage as the source of input similarity matrix. -
Increase
negative​Edge​Count
, possibly up to 10 or more. -
Decrease
negative​Edge​Denominator
down to the 0.2-0.5 range.
-
Embedding time
The time required to compute a 2d embedding for a specific input depends on the following factors:
-
Number of elements in the input similarity matrix. The larger or the more dense the matrix, the longer it takes to compute the 2d embedding. Reducing the density of the input matrix speeds up processing but also creates a larger number of smaller spatial clusters on the 2d map.
Alternatively, when you intend to embed a large number of documents only to get an overview of the topics covered by the documents, you can speed up processing by taking a random sample of the input collection.
-
Range of repulsion between dissimilar points. You can decrease the embedding time by lowering the
negative​Edge​Count
parameter at the cost of lowered quality of the 2d embedding. -
Number of 2d embedding iterations. Embedding time depends on the number of processing iterations Lingo4G performs when processing the input matrix. Lowering
max​Iterations
decreases the clustering time at the cost of lower embedding quality.
initial
The initial coordinates to use when computing the 2d embedding.
If the initial
is not null, Lingo4G uses the 2d coordinates you provide to initialize the
computation of the 2d embedding in this stage. The number of points in the initial
embedding must
be the same as the size of the input matrix
.
Some points in the initial embedding may be null
. For each null
point in the initial embedding, Lingo4G initializes the corresponding 2d embedding point with a random value.
If initial
is null
, Lingo4G initializes all 2d embedding points with random values.
You can use initial
property to build a time series of embeddings where each step replaces a
portion of old data points with new ones and uses the initial
property to keep the coordinates of
the remaining points unchanged.
initialized​Learning​Rate
Learning rate multiplier for the explicitly initialized points.
If you use the initial
property to explicitly initialize the
position of certain embedding points, for those points Lingo4G uses the learning rate you provide in the
initialized​Learning​Rate
property.
Typically, the learning rate for the explicitly initialized points is smaller than 1.0 to ensure that the coordinates of those points are more stable during the computation.
matrix
The similarities to use to compute the 2d embedding.
max​Iterations
The maximum number of embedding learning iterations to perform.
For inputs larger than 20k data points, you may need to increase the maximum number of iterations to improve the quality of the 2d embedding.
Embedding computation time depends linearly on the
max​Iterations
property.
negative​Edge​Count
Range of repulsion between dissimilar data points.
To accurately embed a set of data points, Lingo4G must take into account not only each point's similar points,
but also a sample of the dissimilar ones. This property determines how many dissimilar pairs of points to
consider for each pair of similar points defined by the matrix
.
Values lower than 5
speed up processing, but may produce poorly-clustered maps. Values larger than
15
may lead to poorly-shaped maps with many ill-positioned documents. The larger the repulsion
range, the longer the embedding computation time.
negative​Edge​Denominator
Determines the strength of clustering of points on the 2d map.
The larger the denominator, the more tightly packed the groups of 2d points on the map and the larger the empty spaces between the groups.
negative​Edge​Weight
Strength of repulsion between dissimilar documents.
When changing you change negative​Edge​Count
, for example to speed up processing, adjust
negative​Edge​Weight
, so that the product of the two properties remains similar.
threads
The number of concurrent threads to use to compute the 2d embedding.
embedding2d:​lv​Overlay
Overlays new points on top of an existing 2D embedding you provide. You can use this stage to, for example, put labels on top of a 2D map of documents. Uses an optimized version of the LargeVis algorithm.
{
"type": "embedding2d:lvOverlay",
"embedding2d": {
"type": "embedding2d:reference",
"auto": true
},
"initial": null,
"initializedLearningRate": 0.02,
"matrix": {
"type": "matrix:reference",
"auto": true
},
"maxIterations": 300,
"negativeEdgeCount": 5,
"negativeEdgeDenominator": 1,
"negativeEdgeWeight": 2,
"threads": "auto"
}
To create a 2d embedding overlay, you need the following pieces of data:
-
The reference 2d embedding, which you provide in the
embedding2d
property. The reference embedding can be, for example, a 2d embedding of documents or labels produced by theembedding2d:​lv
stage. Theembedding2d:​lv​Overlay
stage does not modify the reference embedding, it uses the embedding to guide the location of new 2d points. -
Similarity matrix, which you provide in the
matrix
property. Columns of the similarity matrix must correspond to the reference 2d embedding. Rows of the matrix correspond to new data points you overlay on top of the reference embedding.
The following request overlays labels on top of a 2d document embedding to describe the densely populated areas of the document embedding.
The documents
stage selects the top 10k documents matching the clustering query and the
labels
stage collects the top 250 most frequent labels in those documents. The
label​Embedding2d​Overlay
uses the embedding2d:​lv​Overlay
stage to compute the 2d overlay of labels on top of the 2d document
embedding. It uses the
matrix:​keyword​Label​Document​Similarity
matrix that computes labels-to-document similarities.
embedding2d
The reference 2d embedding on top of which to overlay new points.
The number of points in the 2d embedding you supply in this property must be equal to the number of columns in
the similarity matrix
.
initial
The initial coordinates to use when computing the 2d embedding.
This property is the same as the analogous property in the embedding2d:​lv
stage. See the
embedding2d:​lv.initial
for more details.
initialized​Learning​Rate
Learning rate multiplier for the explicitly initialized points.
This property is the same as the analogous property in the embedding2d:​lv
stage. See the
embedding2d:​lv.initialized​Learning​Rate
for more details.
matrix
The matrix of similarities between new points to overlay on top of the reference embedding and the existing points in the reference embedding you provide.
Rows of the matrix you provide correspond to the new data points to overlay on top of the reference embedding. Columns of the matrix correspond to the points of the reference embedding. Therefore, the number of columns in the matrix must be equal to the number of points in the reference 2d embedding you provide.
If you overlay labels on top of a 2d document embedding, you can use the
matrix:​keyword​Label​Document​Similarity
matrix that computes label-to-document similarities.
max​Iterations
The maximum number of embedding learning iterations to perform.
This property is the same as the analogous property in the embedding2d:​lv
stage. See the
embedding2d:​lv.max​Iterations
for more details.
negative​Edge​Count
Range of repulsion between dissimilar data points.
This property is the same as the analogous property in the embedding2d:​lv
stage. See the
embedding2d:​lv.negative​Edge​Count
for more details.
negative​Edge​Denominator
Determines the strength of clustering of points on the 2d map.
This property is the same as the analogous property in the embedding2d:​lv
stage. See the
embedding2d:​lv.negative​Edge​Denominator
for more details.
negative​Edge​Weight
Strength of repulsion between dissimilar documents.
This property is the same as the analogous property in the embedding2d:​lv
stage. See the
embedding2d:​lv.negative​Edge​Weight
for more details.
threads
The number of concurrent threads to use to compute the 2d embedding overlay.
embedding2d:​transferred
Copies 2D coordinates from one list of entities, such as documents or labels, to another.
{
"type": "embedding2d:transferred",
"embedding2d": null,
"source": null,
"target": null
}
Lingo4G computes the result of this stage in the following way:
-
Create a 2d embedding of size equal to the number of
target
documents or labels, set all points tonull
. -
For each entity in the
target
list: if thesource
list contains the same entity, copy the corresponding coordinate from the inputembedding2d
to the output embedding. If the source list does not contain the entity, keep the output embedding point equal tonull
.
embedding2d
The 2d embedding from which to copy coordinates.
source
The list of labels or documents that gave rise to the input embedding.
The number of documents or labels on the list must be equal to the number of points in the input
embedding2d
.
target
The list of documents or labels to against which to transfer the input embeddings.
The size of the output embedding is equal to the size of the target list of entities.
embedding2d:​*
Consumers of
The following stages and components take embedding2d:​*
as
input:
Stage or component | Property |
---|---|
embedding2d:​lv​Overlay | embedding2d |
embedding2d:​transferred | embedding2d |
matrix:​knn2d​Distance​Similarity | embedding2d |