Learning embeddings

An optional indexing step is learning multidimensional label and document embeddings. Label embeddings help Lingo4G to connect semantically related labels even if the labels don't directly co-occur in your data. Document embeddings help Lingo4G to connect similar documents even if they don't share any common labels.

This article shows how to learn multidimensional embedding vectors for labels and documents. If your project contains more than 5 GB of text, make sure to read the Limitations section before you start.

Learning embeddings

To build label and document embedding vectors, perform the following steps:

  1. Index your documents if you haven't done so.

  2. Tune embedding parameters to match the size of your project.

    If you are working with one of the example data sets, the embedding parameters already match the expected size of the index.

    If you are learning embeddings for your own data, perform the following steps.

    1. Find out what size is your index:

      l4g stats -p <project-descriptor-path>

      You should see output similar to:

      DOCUMENTS INDEX (commit 'data/commits/_2')
      
      Live documents     2.02M
      Deleted documents  39.82k
      Size on disk       2.14GB
      Segments           14
      Directory type     MMapDirectory
    2. Based on the Size on disk value, edit your project descriptor to apply the following embedding parameter changes.

      Size on disk Embedding learning parameters
      < 5GB No parameter changes needed
      5GB — 50GB

      Use the following indexer.embedding.labels section in your project descriptor.

      {
        "input": { "minTopDf": 5 },
        "model": { "vectorSize": 128 },
        "index": { "constructionNeighborhoodSize": 384 }
      }
      > 50GB

      Use the following indexer.embedding.labels section in your project descriptor.

      {
        "input": { "minTopDf": 10 },
        "model": { "vectorSize": 160 },
        "index": { "constructionNeighborhoodSize": 512 }
      }
  3. Run embedding learning command:

    l4g learn-embeddings -p <project-descriptor-path>
    Heads up, large data set protection.

    If your project has more than 5 million documents, Lingo4G does not automatically compute document embeddings to prevent you from running into large memory usage issues without a warning.

    If you'd like to learn embeddings for more than 5 million documents, pass the --recompute-document-embeddings flag to the learn-embeddings command. Before you do that, make sure you give the JVM enough memory to compute the embeddings.

    Leave the command running until you see a stable completion time estimate of the label embedding learning task:

    1/1 Embeddings > Learning label embeddings   [    :    :6k docs/s]   4% ~18m 57s

    If the estimate is unreasonably high (multiple hours or days), edit the project descriptor and add a hard timeout on the label embedding learning time:

    {
      "indexer": {
        "embedding": {
          "labels": {
            "model": {
              "timeout": "2h"
            }
          }
        }
      }
    }

    As a rule of thumb, a timeout equal to 1x–2x indexing time should yield embeddings of sufficient quality. See the FAQ section for more ways to lower the label embedding learning time.

    Once the learn-embeddings command completes successfully, you can use label and document embeddings during analysis, for example to compute embedding-based similarity matrices.

Updating embeddings

When you add documents to your Lingo4G index using incremental indexing, Lingo4G does not automatically build embedding vectors for the new documents. Similarly, when you perform reindexing to discover new labels, Lingo4G does not build embedding vectors for the new labels.

To bring the embedding vectors into synchronization with the index, perform one of the following steps:

  • If you added documents to the index using incremental indexing and performed feature reindexing to discover new labels, rebuild both label and document embeddings:

    l4g learn-embeddings -p <project-descriptor-path> --recompute-label-embeddings --recompute-document-embeddings
  • If you added documents to the index using incremental indexing without the follow-up feature reindexing step, rebuild document embeddings:

    l4g learn-embeddings -p <project-descriptor-path> --recompute-document-embeddings

    There is no point in rebuilding label embeddings in this case because the labels did not change.

Also note that updating embeddings after adding documents or reindexing labels is not mandatory. All analysis operations will still work, omitting documents and labels with empty embedding vectors. This situation is perfectly normal – the least frequent labels will never have their corresponding embedding vectors due to the insufficient data.

Limitations and caveats

The current Lingo4G's implementation of label and document embeddings has the following limitations you should be aware of.

In-memory data structures

Currently, Lingo4G stores label and document embedding vectors in RAM. This means that all embedding vectors and the corresponding kNN index must fit into Java heap during both indexing and analysis.

The memory size of document and label embeddings depend on the following factors:

The following table summarizes the memory requirements of label and document embeddings in projects of typical sizes.

Project Docs Labels Embedding parameters Embedding memory footprint
All Embedded1 Labels (indexing2) Labels (analysis3) Docs3
arxiv 2.2M 508k 291k vector​Size = 96
max​Neighbors​Per​Node = 24
640 MB 148 MB 1.02 GB
pubmed 4.90M 1.70M 1.29M vector​Size = 128
max​Neighbors​Per​Node = 24
1.78 GB 854 MB 2.84 GB
uspto 11M 1.29M 1.10M vector​Size = 160
max​Neighbors​Per​Node = 24
1.60 GB 864 MB 7.62 GB

1 Labels with non-empty embedding vectors.

2 Estimated label embedding memory footprint at indexing time.

3 Actual embedding memory footprint at analysis time.

Note that learning label embeddings during indexing requires more memory than using label embeddings at analysis time. You can use the following formulae to estimate the amount of memory required for learning and using embeddings in your own project.

For label embeddings:

M LI = 8 * N LT * v + 4 * N LE * n M LA = 4 * N LE * ( v + n )

where:

M LI
memory required to learn label embeddings and create the kNN index during indexing, in bytes
M LA
memory size of label embedding vectors and the kNN index at analysis time, in bytes
N LT
total number of labels in your index
N LE
number of labels with non-empty embedding vectors in your index
v
vector​Size project descriptor parameter
n
max​Neighbors​Per​Node project descriptor parameter

For document embeddings:

M D = 4 * N D * ( v + n )

where:

M D
memory size required to learn document embedding vectors at indexing time and to use the vectors at analysis time, including the kNN index, in bytes
N D
number of documents in your index
v
vector​Size project descriptor parameter
n
max​Neighbors​Per​Node project descriptor parameter

While the amount of RAM in a typical workstation or server (16GB -- 32GB) should be sufficient to learn label embeddings even for very large projects, learning and using document embeddings for projects with tens of millions of documents or more currently requires substantial amounts of RAM. We may be able to address this issue in future releases of Lingo4G.

No incremental updates

When you use incremental indexing to add or update documents in an existing Lingo4G index, Lingo4G does not perform the corresponding incremental updates to label and document embeddings. For example, if you add new documents, their embedding vectors are empty.

Currently, the only way to bring embedding vectors into synchronization with the updated Lingo4G index is to re-learn the embeddings from scratch. See the Updating embeddings section for more details.

Time-consuming

Learning label embedding vectors is currently time-consuming and resource-intensive. To learn label embeddings, Lingo4G makes several passes over all documents in the index, which may take as much time as the process of indexing of the documents. See the Learning embeddings FAQ section for ways to make the learning process manageable.

FAQ

Lingo4G estimates "Learning label embeddings" to take a very long time. What can I do?

Learning label embeddings is usually very time-consuming and may indeed take multiple hours to complete under the default settings. You can explore and combine the following strategies to make the time manageable.

Use a faster machine, even temporarily

If there is a possibility to try a faster machine, even for the duration of embedding learning alone, this would be the best approach. Giving Lingo4G enough CPU time to perform the learning should result in high-quality embeddings for a large number of labels.

While the general indexing workload is a mix of disk and CPU access, embedding learning is almost entirely CPU-bound. Therefore, it may not make sense to perform both tasks on a very-high-CPU-count machine because the general indexing work is not able to saturate all CPUs, mainly due to disk access. Computing label embeddings, on the other hand, is almost entirely CPU-bound and scales linearly with the number of CPUs, so it can use a large-CPU-count machine effectively.

To perform indexing and label embedding on separate machines, follow these steps:

  1. Index your collection without learning embeddings:

    l4g index -p <project-descriptor-path>
  2. Transfer the index data to the machine you use for learning embeddings.

  3. Perform embedding learning:

    l4g learn-embeddings -p <project-descriptor-path>
  4. Transfer the index data to the machine you use for handling analysis requests.
Lower the quality of embeddings

Further embedding learning time reductions requires lowering the quality and/or coverage of the embedding. Consider editing the following parameters to lower the quality of label embeddings:

  1. Set model to C​B​O​W for a significant learning speed-up at the cost of lowered quality embeddings of low-frequency labels.

  2. Lower the vector​Size project descriptor property. We recommend the 96–192 range of values for this property, but a value of 64 should also produce reasonable embeddings, especially for small data sets.

Set a hard limit on the embedding learning time

Try editing the project descriptor to change the value of the timeout parameter to an acceptable value. In this case, Lingo4G shortens the learning time and discards the low-quality embeddings. As a rule of thumb, learning time equal to 1x–2x of indexing time should yield embeddings of sufficient quality.

"Learning document embeddings" fails with java.lang.OutOfMemoryError. What can I do?

Currently, Lingo4G keeps label and document embeddings in the main memory. Therefore, the Java heap must be large enough to keep label embedding vectors, document embedding vectors and the kNN index of the document vectors. Use the embedding memory footprint formulae to estimate how much memory your current data set requires.

You can lower the memory footprint by lowering the following embedding learning parameters:

Note that vector sizes of label and document embeddings must be the same. If you already have learnt label embeddings with a specific vector size, building document embeddings with a different vector size requires re-learning of label embeddings.

If your machine can meet the estimated memory requirements, plus at least 4 GB of extra heap memory for data structures not directly related to learning embeddings, increase Lingo4G's JVM heap memory size and restart embedding learning.