Learning embeddings
An optional indexing step is learning multidimensional label and document embeddings. Label embeddings help Lingo4G to connect semantically related labels even if the labels don't directly co-occur in your data. Document embeddings help Lingo4G to connect similar documents even if they don't share any common labels.
This article shows how to learn multidimensional embedding vectors for labels and documents. If your project contains more than 5 GB of text, make sure to read the Limitations section before you start.
Learning embeddings
To build label and document embedding vectors, perform the following steps:
-
Index your documents if you haven't done so.
-
Tune embedding parameters to match the size of your project.
If you are working with one of the example data sets, the embedding parameters already match the expected size of the index.
If you are learning embeddings for your own data, perform the following steps.
-
Find out what size is your index:
l4g stats -p <project-descriptor-path>
You should see output similar to:
DOCUMENTS INDEX (commit 'data/commits/_2') Live documents 2.02M Deleted documents 39.82k Size on disk 2.14GB Segments 14 Directory type MMapDirectory
-
Based on the Size on disk value, edit your project descriptor to apply the following embedding parameter changes.
Size on disk Embedding learning parameters < 5GB No parameter changes needed 5GB — 50GB Use the following
indexer.embedding.labels
section in your project descriptor.{ "input": { "minTopDf": 5 }, "model": { "vectorSize": 128 }, "index": { "constructionNeighborhoodSize": 384 } }
> 50GB Use the following
indexer.embedding.labels
section in your project descriptor.{ "input": { "minTopDf": 10 }, "model": { "vectorSize": 160 }, "index": { "constructionNeighborhoodSize": 512 } }
-
-
Run embedding learning command:
l4g learn-embeddings -p <project-descriptor-path>
Heads up, large data set protection.If your project has more than 5 million documents, Lingo4G does not automatically compute document embeddings to prevent you from running into large memory usage issues without a warning.
If you'd like to learn embeddings for more than 5 million documents, pass the
--recompute-document-embeddings
flag to thelearn-embeddings
command. Before you do that, make sure you give the JVM enough memory to compute the embeddings.Leave the command running until you see a stable completion time estimate of the label embedding learning task:
1/1 Embeddings > Learning label embeddings [ : :6k docs/s] 4% ~18m 57s
If the estimate is unreasonably high (multiple hours or days), edit the project descriptor and add a hard timeout on the label embedding learning time:
{ "indexer": { "embedding": { "labels": { "model": { "timeout": "2h" } } } } }
As a rule of thumb, a timeout equal to 1x–2x indexing time should yield embeddings of sufficient quality. See the FAQ section for more ways to lower the label embedding learning time.
Once the
learn-embeddings
command completes successfully, you can use label and document embeddings during analysis, for example to compute embedding-based similarity matrices.
Updating embeddings
When you add documents to your Lingo4G index using incremental indexing, Lingo4G does not automatically build embedding vectors for the new documents. Similarly, when you perform reindexing to discover new labels, Lingo4G does not build embedding vectors for the new labels.
To bring the embedding vectors into synchronization with the index, perform one of the following steps:
-
If you added documents to the index using incremental indexing and performed feature reindexing to discover new labels, rebuild both label and document embeddings:
l4g learn-embeddings -p <project-descriptor-path> --recompute-label-embeddings --recompute-document-embeddings
-
If you added documents to the index using incremental indexing without the follow-up feature reindexing step, rebuild document embeddings:
l4g learn-embeddings -p <project-descriptor-path> --recompute-document-embeddings
There is no point in rebuilding label embeddings in this case because the labels did not change.
Also note that updating embeddings after adding documents or reindexing labels is not mandatory. All analysis operations will still work, omitting documents and labels with empty embedding vectors. This situation is perfectly normal – the least frequent labels will never have their corresponding embedding vectors due to the insufficient data.
Limitations and caveats
The current Lingo4G's implementation of label and document embeddings has the following limitations you should be aware of.
In-memory data structures
Currently, Lingo4G stores label and document embedding vectors in RAM. This means that all embedding vectors and the corresponding kNN index must fit into Java heap during both indexing and analysis.
The memory size of document and label embeddings depend on the following factors:
the number of labels with non-empty embedding vectors,
the number of documents with non-empty embedding vectors,
-
the
vector​Size
project descriptor parameter, -
the
max​Neighbors​Per​Node
parameter for the labels kNN index and the correspondingmax​Neighbors​Per​Node
for document vectors kNN index.
The following table summarizes the memory requirements of label and document embeddings in projects of typical sizes.
Project | Docs | Labels | Embedding parameters | Embedding memory footprint | |||
---|---|---|---|---|---|---|---|
All | Embedded1 | Labels (indexing2) | Labels (analysis3) | Docs3 | |||
arxiv
|
2.2M | 508k | 291k |
vector​Size
= 96max​Neighbors​Per​Node
= 24
|
640 MB | 148 MB | 1.02Â GB |
pubmed
|
4.90M | 1.70M | 1.29M |
vector​Size
= 128max​Neighbors​Per​Node
= 24
|
1.78 GB | 854 MB | 2.84Â GB |
uspto
|
11M | 1.29M | 1.10M |
vector​Size
= 160max​Neighbors​Per​Node
= 24
|
1.60 GB | 864 MB | 7.62 GB |
1 Labels with non-empty embedding vectors. |
|||||||
2 Estimated label embedding memory footprint at indexing time. |
|||||||
3 Actual embedding memory footprint at analysis time. |
Note that learning label embeddings during indexing requires more memory than using label embeddings at analysis time. You can use the following formulae to estimate the amount of memory required for learning and using embeddings in your own project.
For label embeddings:
where:
- memory required to learn label embeddings and create the kNN index during indexing, in bytes
- memory size of label embedding vectors and the kNN index at analysis time, in bytes
- total number of labels in your index
- number of labels with non-empty embedding vectors in your index
-
vector​Size
project descriptor parameter -
max​Neighbors​Per​Node
project descriptor parameter
For document embeddings:
where:
- memory size required to learn document embedding vectors at indexing time and to use the vectors at analysis time, including the kNN index, in bytes
- number of documents in your index
-
vector​Size
project descriptor parameter -
max​Neighbors​Per​Node
project descriptor parameter
While the amount of RAM in a typical workstation or server (16GB -- 32GB) should be sufficient to learn label embeddings even for very large projects, learning and using document embeddings for projects with tens of millions of documents or more currently requires substantial amounts of RAM. We may be able to address this issue in future releases of Lingo4G.
No incremental updates
When you use incremental indexing to add or update documents in an existing Lingo4G index, Lingo4G does not perform the corresponding incremental updates to label and document embeddings. For example, if you add new documents, their embedding vectors are empty.
Currently, the only way to bring embedding vectors into synchronization with the updated Lingo4G index is to re-learn the embeddings from scratch. See the Updating embeddings section for more details.
Time-consuming
Learning label embedding vectors is currently time-consuming and resource-intensive. To learn label embeddings, Lingo4G makes several passes over all documents in the index, which may take as much time as the process of indexing of the documents. See the Learning embeddings FAQ section for ways to make the learning process manageable.
FAQ
Lingo4G estimates "Learning label embeddings" to take a very long time. What can I do?
Learning label embeddings is usually very time-consuming and may indeed take multiple hours to complete under the default settings. You can explore and combine the following strategies to make the time manageable.
Use a faster machine, even temporarily
If there is a possibility to try a faster machine, even for the duration of embedding learning alone, this would be the best approach. Giving Lingo4G enough CPU time to perform the learning should result in high-quality embeddings for a large number of labels.
While the general indexing workload is a mix of disk and CPU access, embedding learning is almost entirely CPU-bound. Therefore, it may not make sense to perform both tasks on a very-high-CPU-count machine because the general indexing work is not able to saturate all CPUs, mainly due to disk access. Computing label embeddings, on the other hand, is almost entirely CPU-bound and scales linearly with the number of CPUs, so it can use a large-CPU-count machine effectively.
To perform indexing and label embedding on separate machines, follow these steps:
-
Index your collection without learning embeddings:
l4g index -p <project-descriptor-path>
-
Transfer the index data to the machine you use for learning embeddings.
-
Perform embedding learning:
l4g learn-embeddings -p <project-descriptor-path>
- Transfer the index data to the machine you use for handling analysis requests.
Lower the quality of embeddings
Further embedding learning time reductions requires lowering the quality and/or coverage of the embedding. Consider editing the following parameters to lower the quality of label embeddings:
-
Set model to
C​B​O​W
for a significant learning speed-up at the cost of lowered quality embeddings of low-frequency labels. -
Lower the
vector​Size
project descriptor property. We recommend the 96–192 range of values for this property, but a value of 64 should also produce reasonable embeddings, especially for small data sets.
Set a hard limit on the embedding learning time
Try editing the project descriptor to change the value of the
timeout
parameter to an acceptable value. In this case, Lingo4G shortens the learning time and discards the
low-quality embeddings. As a rule of thumb, learning time equal to 1x–2x of indexing time should yield
embeddings of sufficient quality.
"Learning document embeddings" fails with java.lang.OutOfMemoryError. What can I do?
Currently, Lingo4G keeps label and document embeddings in the main memory. Therefore, the Java heap must be large enough to keep label embedding vectors, document embedding vectors and the kNN index of the document vectors. Use the embedding memory footprint formulae to estimate how much memory your current data set requires.
You can lower the memory footprint by lowering the following embedding learning parameters:
-
vector​Size
, -
max​Neighbors​Per​Node
for label vectors, -
max​Neighbors​Per​Node
for document vectors.
Note that vector sizes of label and document embeddings must be the same. If you already have learnt label embeddings with a specific vector size, building document embeddings with a different vector size requires re-learning of label embeddings.
If your machine can meet the estimated memory requirements, plus at least 4 GB of extra heap memory for data structures not directly related to learning embeddings, increase Lingo4G's JVM heap memory size and restart embedding learning.