Learning embeddings

An optional indexing step is learning multidimensional label and document embeddings. Label embeddings help Lingo4G to connect semantically related labels even if the labels don't directly co-occur in your data. Document embeddings help Lingo4G to connect similar documents even if they don't share any common labels.

This article shows how to learn multidimensional embedding vectors for labels and documents.

Heads up, time-consuming operations!

Learning label and document embeddings is a time-consuming and resource-intensive process. If your project contains more than 10 GB of text, you may need to tune the learning process.

Learning embeddings

To build label and document embedding vectors, perform the following steps:

  1. Index your documents if you haven't done so.

  2. Tune embedding parameters to match the size of your project.

    If you are working with one of the example data sets, the embedding parameters already match the expected size of the index.

    If you are learning embeddings for your own data, perform the following steps.

    1. Find out what size is your index:

      l4g stats -p <project-descriptor-path>

      You should see output similar to:

      DOCUMENTS INDEX (commit 'data/commits/_2')
      
      Live documents     2.02M
      Deleted documents  39.82k
      Size on disk       2.14GB
      Segments           14
      Directory type     MMapDirectory
    2. Based on the Size on disk value, edit your project descriptor to apply the following embedding parameter changes.

      Size on disk Embedding learning parameters
      < 5GB No parameter changes needed
      5GB — 50GB

      Use the following indexer.embedding.labels section in your project descriptor.

      {
        "model": {
          "vectorSize": 128
        }
      }
      > 50GB

      Use the following indexer.embedding.labels section in your project descriptor.

      {
        "model": {
          "vectorSize": 160
        }
      }
  3. Run embedding learning command:

    l4g learn-embeddings -p <project-descriptor-path>

    Leave the command running until you see a stable completion time estimate of the label embedding learning task:

    1/1 Embeddings > Learning label embeddings   [    :    :6k docs/s]   4% ~18m 57s

    If the estimate is unreasonably high (multiple hours or days), edit the project descriptor and add a hard timeout on the label embedding learning time:

    {
      "indexer": {
        "embedding": {
          "labels": {
            "model": {
              "timeout": "2h"
            }
          }
        }
      }
    }

    As a rule of thumb, a timeout equal to 1x–2x indexing time should yield embeddings of sufficient quality. See the FAQ section for more ways to lower the label embedding learning time.

    Once the learn-embeddings command completes successfully, you can use label and document embeddings during analysis, for example to compute embedding-based similarity matrices.

Updating embeddings

When you add documents to your Lingo4G index (using incremental indexing) and if this index contains document embeddings, Lingo4G will automatically add embedding vectors for any new documents.

To bring the embedding vectors into synchronization with the index, perform one of the following steps:

  • If you performed feature reindexing to discover new labels, rebuild both label and document embeddings:

    l4g learn-embeddings -p <project-descriptor-path> --recompute-label-embeddings --recompute-document-embeddings
  • If you added documents to the index using incremental indexing without the follow-up feature reindexing step, document embeddings are updated automatically.

    There is no point in rebuilding label embeddings in this case because the labels did not change.

Also note that updating embeddings after adding documents or reindexing labels is not mandatory. All analysis operations will still work, omitting documents and labels with empty embedding vectors. This situation is perfectly normal – the least frequent labels will never have their corresponding embedding vectors due to the insufficient data.

Limitations and caveats

The current Lingo4G's implementation of label and document embeddings has the following limitations you should be aware of.

Time-consuming

Learning label embedding vectors is currently time-consuming and resource-intensive. To learn label embeddings, Lingo4G makes several passes over all documents in the index, which may take as much time as the process of indexing of the documents. See the Learning embeddings FAQ section for ways to make the learning process manageable.

FAQ

Lingo4G estimates "Learning label embeddings" to take a very long time. What can I do?

Learning label embeddings is usually very time-consuming and may indeed take multiple hours to complete under the default settings. You can explore and combine the following strategies to make the time manageable.

Use a faster machine, even temporarily

If there is a possibility to try a faster machine, even for the duration of embedding learning alone, this would be the best approach. Giving Lingo4G enough CPU time to perform the learning should result in high-quality embeddings for a large number of labels.

While the general indexing workload is a mix of disk and CPU access, embedding learning is almost entirely CPU-bound. Therefore, it may not make sense to perform both tasks on a very-high-CPU-count machine because the general indexing work is not able to saturate all CPUs, mainly due to disk access. Computing label embeddings, on the other hand, is almost entirely CPU-bound and scales linearly with the number of CPUs, so it can use a large-CPU-count machine effectively.

To perform indexing and label embedding on separate machines, follow these steps:

  1. Index your collection without learning embeddings:

    l4g index -p <project-descriptor-path>
  2. Transfer the index data to the machine you use for learning embeddings.

  3. Perform embedding learning:

    l4g learn-embeddings -p <project-descriptor-path>
  4. Transfer the index data to the machine you use for handling analysis requests.
Lower the quality of embeddings

Further embedding learning time reductions requires lowering the quality and/or coverage of the embedding. Consider editing the following parameters to lower the quality of label embeddings:

  1. Lower the vector​Size project descriptor property. We recommend the 96–192 range of values for this property, but a value of 64 should also produce reasonable embeddings, especially for small data sets.

Set a hard limit on the embedding learning time

Try editing the project descriptor to change the value of the timeout parameter to an acceptable value. In this case, Lingo4G shortens the learning time and discards the low-quality embeddings. As a rule of thumb, learning time equal to 1x–2x of indexing time should yield embeddings of sufficient quality.