embedding2d

Stages of the embedding2d:​* type lay out entities, such as documents or labels, in 2D space, putting similar entities close to each other. You can use the output of the 2D embedding stages to create 2D thematic "maps" of large numbers of documents or labels.

Lingo4G 2d embedding stages take matrix:​* on input rather than lists of documents or labels. Therefore, you can use the same 2d embedding algorithm, such as embedding2d:​lv, to process either documents or labels. It is the input similarity matrix that defines the entities to embed and the similarity function.


You can use the following 2D embedding stages in your analysis requests:

embedding2d:​lv

Computes a 2D embedding based on the similarity matrix you provide. Uses an optimized version of the LargeVis algorithm.

embedding2d:​lv​Overlay

Overlays new points on top of an existing 2D embedding you provide. You can use this stage to, for example, put labels on top of a 2D map of documents. Uses an optimized version of the LargeVis algorithm.

embedding2d:​transferred

Copies 2D coordinates from one list of entities, such as documents or labels, to another. You can use this stage to generate a time series of 2D embeddings.


embedding2d:​reference

References the results of another embedding2d:​* stage defined in the request.


The JSON output of the embedding2d:​* stages has the following structure:

{
  "result" : {
    "embedding2d" : {
      "points" : [
        {
          "x" : -4.864052,
          "y" : -3.4744496
        },
        {
          "x" : 1.795944,
          "y" : -0.109130695
        },
        {
          "x" : 1.3687968,
          "y" : -0.1770406
        },
        {
          "x" : 1.5354644,
          "y" : -0.15393952
        },
        {
          "x" : 2.1472104,
          "y" : 0.0037807915
        },
        {
          "x" : 1.5754758,
          "y" : -0.14713359
        },
        {
          "x" : 4.2649994E-5,
          "y" : -6.257664E-5
        }
      ]
    }
  }
}

The result object contains a single property - the points array containing the coordinates of the 2d embedding. The points array is aligned with the input square matrix: it has the same size and the coordinates at index i correspond to the ith row of the input matrix.

Note that for certain points the coordinates may be empty - your code should ignore them when visualizing the 2d embedding.

embedding2d:​lv

Computes a 2D embedding based on the similarity matrix you provide. This stage uses an optimized version of the LargeVis algorithm.

{
  "type": "embedding2d:lv",
  "initial": null,
  "initializedLearningRate": 0.02,
  "matrix": {
    "type": "matrix:reference",
    "auto": true
  },
  "maxIterations": 300,
  "negativeEdgeCount": 5,
  "negativeEdgeDenominator": 1,
  "negativeEdgeWeight": 2,
  "threads": "auto"
}

Characteristics

The embedding2d:​lv stage embeds the input entities, such as documents or labels, in a 2-dimensional space. As a result, you get the input items laid out on a 2d map in such a way that similar items are close to each other and dissimilar items are spatially-separated.

Examples

The following request computes a 2d embedding of labels based on how they co-occur in a set of documents.

{
  "name": "Label 2d embedding by label co-occurrence similarity",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "labels": {
      "type": "labels:fromDocuments",
      "documents": {
        "type": "documents:reference",
        "use": "documents"
      },
      "maxLabels":{
        "type": "labelCount:fixed",
        "value": 500
      }
    },
    "embedding2d": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:cooccurrenceLabelSimilarity",
        "labels": {
          "type": "labels:reference",
          "use": "labels"
        },
        "documents": {
          "type": "documents:reference",
          "use": "documents"
        }
      }
    }
  },
  "output": {
    "stages": [
      "embedding2d",
      "labels"
    ]
  }
}

Computing a 2d embedding for a set of labels using the label co-occurrence similarity.

The documents stage selects documents containing the word clustering and passes them to the labels stage, which collects the top 500 labels occurring in those documents. Finally, the embedding2d stage computes a 2d map of the labels based on how the labels co-occur in the documents.

Note that in the above request you can replace the matrix:​cooccurrence​Label​Similarity stage with matrix:​knn​Vectors​Similarity, which uses multidimensional embeddings to compute more "semantic" similarities between labels.

The following request computes a 2d embedding for a set of documents containing the clustering query:

{
  "name": "Document 2d embedding by keyword similarity",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "embedding2d": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:keywordDocumentSimilarity",
        "documents": {
          "type": "documents:reference",
          "use": "documents"
        }
      }
    }
  },
  "output": {
    "stages": [
      "embedding2d",
      "documents"
    ]
  }
}

Computing a 2d embedding for a set of documents using the keyword document similarity.

In the above request, the documents stage selects documents that match the clustering query and the embedding2d stage uses embedding2d:​lv to compute the 2d embedding. Like with 2d embeddings for labels, you can replace the matrix:​keyword​Document​Similarity with matrix:​knn​Vectors​Similarity to use multidimensional embedding vectors for similarity computation. The tuning section explains how the characteristics of the 2d embeddings depend on the underlying similarity matrix computation method.

You can extend both requests by adding an overlay 2d embedding of labels, so that the labels describe the densely-populated areas of the document maps.

Tuning

The output of the embedding2d:​lv stage depends not only on the parameters you choose, but also on the type and density of the similarity matrix you provide on input. The following subsections offer some advice for a number of typical 2d embedding tuning scenarios.

Layout characteristics

The primary layout property you can control is the number and size of spatial clusters of the 2d points.

Embedding time

The time required to compute a 2d embedding for a specific input depends on the following factors:

  • Number of elements in the input similarity matrix. The larger or the more dense the matrix, the longer it takes to compute the 2d embedding. Reducing the density of the input matrix speeds up processing but also creates a larger number of smaller spatial clusters on the 2d map.

    Alternatively, when you intend to embed a large number of documents only to get an overview of the topics covered by the documents, you can speed up processing by taking a random sample of the input collection.

  • Range of repulsion between dissimilar points. You can decrease the embedding time by lowering the negative​Edge​Count parameter at the cost of lowered quality of the 2d embedding.

  • Number of 2d embedding iterations. Embedding time depends on the number of processing iterations Lingo4G performs when processing the input matrix. Lowering max​Iterations decreases the clustering time at the cost of lower embedding quality.

initial

Type
undefined
Default
null
Required
no

The initial coordinates to use when computing the 2d embedding.

If the initial is not null, Lingo4G uses the 2d coordinates you provide to initialize the computation of the 2d embedding in this stage. The number of points in the initial embedding must be the same as the size of the input matrix.

Some points in the initial embedding may be null. For each null point in the initial embedding, Lingo4G initializes the corresponding 2d embedding point with a random value.

If initial is null, Lingo4G initializes all 2d embedding points with random values.

You can use initial property to build a time series of embeddings where each step replaces a portion of old data points with new ones and uses the initial property to keep the coordinates of the remaining points unchanged.

initialized​Learning​Rate

Type
number
Default
0.02
Constraints
value >= 0
Required
no

Learning rate multiplier for the explicitly initialized points.

If you use the initial property to explicitly initialize the position of certain embedding points, for those points Lingo4G uses the learning rate you provide in the initialized​Learning​Rate property.

Typically, the learning rate for the explicitly initialized points is smaller than 1.0 to ensure that the coordinates of those points are more stable during the computation.

matrix

Type
matrix
Default
{
  "type": "matrix:reference",
  "auto": true
}
Required
no

The similarities to use to compute the 2d embedding.

max​Iterations

Type
integer
Default
300
Constraints
value >= 0
Required
no

The maximum number of embedding learning iterations to perform.

For inputs larger than 20k data points, you may need to increase the maximum number of iterations to improve the quality of the 2d embedding.

Embedding computation time depends linearly on the max​Iterations property.

negative​Edge​Count

Type
integer
Default
5
Constraints
value >= 0
Required
no

Range of repulsion between dissimilar data points.

To accurately embed a set of data points, Lingo4G must take into account not only each point's similar points, but also a sample of the dissimilar ones. This property determines how many dissimilar pairs of points to consider for each pair of similar points defined by the matrix.

Values lower than 5 speed up processing, but may produce poorly-clustered maps. Values larger than 15 may lead to poorly-shaped maps with many ill-positioned documents. The larger the repulsion range, the longer the embedding computation time.

negative​Edge​Denominator

Type
number
Default
1
Constraints
value > 0
Required
no

Determines the strength of clustering of points on the 2d map.

The larger the denominator, the more tightly packed the groups of 2d points on the map and the larger the empty spaces between the groups.

negative​Edge​Weight

Type
number
Default
2
Constraints
value >= 0
Required
no

Strength of repulsion between dissimilar documents.

When changing you change negative​Edge​Count, for example to speed up processing, adjust negative​Edge​Weight, so that the product of the two properties remains similar.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to compute the 2d embedding.

embedding2d:​lv​Overlay

Overlays new points on top of an existing 2D embedding you provide. You can use this stage to, for example, put labels on top of a 2D map of documents. Uses an optimized version of the LargeVis algorithm.

{
  "type": "embedding2d:lvOverlay",
  "embedding2d": {
    "type": "embedding2d:reference",
    "auto": true
  },
  "initial": null,
  "initializedLearningRate": 0.02,
  "matrix": {
    "type": "matrix:reference",
    "auto": true
  },
  "maxIterations": 300,
  "negativeEdgeCount": 5,
  "negativeEdgeDenominator": 1,
  "negativeEdgeWeight": 2,
  "threads": "auto"
}

To create a 2d embedding overlay, you need the following pieces of data:

  • The reference 2d embedding, which you provide in the embedding2d property. The reference embedding can be, for example, a 2d embedding of documents or labels produced by the embedding2d:​lv stage. The embedding2d:​lv​Overlay stage does not modify the reference embedding, it uses the embedding to guide the location of new 2d points.

  • Similarity matrix, which you provide in the matrix property. Columns of the similarity matrix must correspond to the reference 2d embedding. Rows of the matrix correspond to new data points you overlay on top of the reference embedding.

The following request overlays labels on top of a 2d document embedding to describe the densely populated areas of the document embedding.

{
  "name": "2d embedding of documents with a label embedding overlay.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      },
      "limit": 10000
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 250
      }
    },
    "documentEmbedding2d": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedDocumentEmbeddings",
          "documents": {
            "type": "documents:reference",
            "use": "documents"
          }
        }
      }
    },
    "labelEmbedding2dOverlay": {
      "type": "embedding2d:lvOverlay",
      "matrix": {
        "type": "matrix:keywordLabelDocumentSimilarity",
        "labels": {
          "type": "labels:reference",
          "use": "labels"
        },
        "documents": {
          "type": "documents:reference",
          "use": "documents"
        }
      },
      "embedding2d": {
        "type": "embedding2d:reference",
        "use": "documentEmbedding2d"
      }
    }
  }
}

Computing a 2d label embedding overlay on top of a 2d document embedding.

The documents stage selects the top 10k documents matching the clustering query and the labels stage collects the top 250 most frequent labels in those documents. The label​Embedding2d​Overlay uses the embedding2d:​lv​Overlay stage to compute the 2d overlay of labels on top of the 2d document embedding. It uses the matrix:​keyword​Label​Document​Similarity matrix that computes labels-to-document similarities.

embedding2d

Type
embedding2d
Default
{
  "type": "embedding2d:reference",
  "auto": true
}
Required
no

The reference 2d embedding on top of which to overlay new points.

The number of points in the 2d embedding you supply in this property must be equal to the number of columns in the similarity matrix.

initial

Type
undefined
Default
null
Required
no

The initial coordinates to use when computing the 2d embedding.

This property is the same as the analogous property in the embedding2d:​lv stage. See the embedding2d:​lv.initial for more details.

initialized​Learning​Rate

Type
number
Default
0.02
Constraints
value >= 0
Required
no

Learning rate multiplier for the explicitly initialized points.

This property is the same as the analogous property in the embedding2d:​lv stage. See the embedding2d:​lv.initialized​Learning​Rate for more details.

matrix

Type
matrix
Default
{
  "type": "matrix:reference",
  "auto": true
}
Required
no

The matrix of similarities between new points to overlay on top of the reference embedding and the existing points in the reference embedding you provide.

Rows of the matrix you provide correspond to the new data points to overlay on top of the reference embedding. Columns of the matrix correspond to the points of the reference embedding. Therefore, the number of columns in the matrix must be equal to the number of points in the reference 2d embedding you provide.

If you overlay labels on top of a 2d document embedding, you can use the matrix:​keyword​Label​Document​Similarity matrix that computes label-to-document similarities.

max​Iterations

Type
integer
Default
300
Constraints
value >= 0
Required
no

The maximum number of embedding learning iterations to perform.

This property is the same as the analogous property in the embedding2d:​lv stage. See the embedding2d:​lv.max​Iterations for more details.

negative​Edge​Count

Type
integer
Default
5
Constraints
value >= 0
Required
no

Range of repulsion between dissimilar data points.

This property is the same as the analogous property in the embedding2d:​lv stage. See the embedding2d:​lv.negative​Edge​Count for more details.

negative​Edge​Denominator

Type
number
Default
1
Constraints
value > 0
Required
no

Determines the strength of clustering of points on the 2d map.

This property is the same as the analogous property in the embedding2d:​lv stage. See the embedding2d:​lv.negative​Edge​Denominator for more details.

negative​Edge​Weight

Type
number
Default
2
Constraints
value >= 0
Required
no

Strength of repulsion between dissimilar documents.

This property is the same as the analogous property in the embedding2d:​lv stage. See the embedding2d:​lv.negative​Edge​Weight for more details.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to compute the 2d embedding overlay.

embedding2d:​transferred

Copies 2D coordinates from one list of entities, such as documents or labels, to another.

{
  "type": "embedding2d:transferred",
  "embedding2d": null,
  "source": null,
  "target": null
}

Lingo4G computes the result of this stage in the following way:

  1. Create a 2d embedding of size equal to the number of target documents or labels, set all points to null.

  2. For each entity in the target list: if the source list contains the same entity, copy the corresponding coordinate from the input embedding2d to the output embedding. If the source list does not contain the entity, keep the output embedding point equal to null.

embedding2d

Type
embedding2d
Default
null
Required
yes

The 2d embedding from which to copy coordinates.

source

Type
documents or labels
Default
null
Required
yes

The list of labels or documents that gave rise to the input embedding.

The number of documents or labels on the list must be equal to the number of points in the input embedding2d.

target

Type
documents or labels
Default
null
Required
yes

The list of documents or labels to against which to transfer the input embeddings.

The size of the output embedding is equal to the size of the target list of entities.

Consumers of embedding2d:​*

The following stages and components take embedding2d:​* as input:

Stage or component Property
embedding2d:​lv​Overlay
  • embedding2d
  • embedding2d:​transferred
  • embedding2d
  • matrix:​knn2d​Distance​Similarity
  • embedding2d