Indexing LLM embeddings

In this article, you'll learn how to index fields with embedding vectors computed using an external (to Lingo4G) large language model.

Lingo4G can produce document embeddings and does a good job when there is a lot of data to learn from. Large language models (LLMs) can be very useful to produce accurate embedding vectors even when the number of documents is very small. Their "understanding" relies on contextual knowledge drawn from a large body of text processed during the training phase. Models can also be fine-tuned to fit a particular task.

This article shows one particular step-by-step process of adding embedding vectors to JSON data, followed by indexing it in Lingo4G. It does not exhaust all the tools and possibilities (of which there are plenty, as LLMs are a rapidly evolving).

Heads up, time-consuming operations!

Computing sentence embeddings using LLMs typically requires GPU (graphic cards) or modern hardware. It may still be a lengthy process if the input data is large. For this reason, we highly recommend to start small, to ensure the process is working, then scale to larger data samples.

Preparation

We'll work with the dataset-json-records document source example, present in Lingo4G distribution. This project contains a few JSON files under the data folder. These files are StackExchange questions and answers, stored as json-record files. Here is an example file:

{
  "id": "94",
  "type": "question",
  "title": "What Windows services can I safely disable?",
  "tag": [
    "windows",
    "services"
  ],
  "created": "2009-07-15",
  "question": "I'm trying to improve the boot time and general performance of a Windows XP machine and figure the massive collection of services that Windows automatically starts have to have an impact. Are there any services that I can safely disable? If so what are they?\n\nObviously the services are there for a reason, so when listing a service, please provide reasoning & examples of when you'd not disable it.\n",
  ...
    

Let's add an embedding vector for the question field. We'll need a LLM sentence embedding model and the machinery to apply this model to input text. In this example, we'll use the Ollama project because of its simplicity in installation and deployment.

  1. Download and install Ollama according to instructions.
  2. Download and install a sentence embedding model. In this article, we'll use one of Ollama's library models, the one called mxbai-embed-large. Run the following command to download the model:

    ollama pull mxbai-embed-large
  3. If everything went well, you should have a working LLM model on your machine. Try to ask it to embed some text (quotes assume bash or similar shell, Windows users may have to quote it appropriately):

    curl -XPOST http://localhost:11434/api/embeddings --data '{"model":"mxbai-embed-large", "prompt":"Hello world."}'

    If you received a JSON with a long sequence of numbers, you're good to go.

    {
      "embedding": [
        0.5235398411750793,
        0.48802658915519714,
        0.4726874530315399,
        -0.39526
        48639678955,
        -0.7870591878890991,
        -0.020934749394655228,
        0.5342496037483215,
        0.63634
        99164581299,
        0.7467385530471802,
        0.3339114189147949,
        -0.029485151171684265,
        0.468538
        40351104736,
        -0.32256293296813965,
        0.052154384553432465,
        -1.1541720628738403,
        0.1468
        7027037143707,
        -0.3646547198295593,
        -0.12941092252731323,
        -0.8018029928207397,
        0.387
        24684715270996,
        -0.2168998122215271,
        0.3071088194847107,
        -1.1364370584487915,
        ...

Adding embedding vectors to JSON data

Let's add an embedding vector field question-embedding, encoding the text found in the question field of all documents present under the data folder of the datasets/dataset-json-records example.

Lingo4G comes with a command-line utility to help you out with this task. This utility takes JSON record files as input, asks an external service to encode an embedding vector for one or more fields, then writes back a copy of the input JSON, along with those new vector fields. Run this tool now against your Ollama service you set up in the previous step:

> mkdir datasets/dataset-json-records/data-embeddings
> l4g llm-vec --parallelism 4 --report-every 5
--backend ollama
--ollama-model mxbai-embed-large
--fields question
--output datasets/dataset-json-records/data-embeddings/stackxchg-with-embeddings.json.zst
datasets/dataset-json-records/data/

You should end up with one or more files under the data-embeddings folder. These files will contain the embedding vectors:

{
  "id" : "67",
  "type" : "question",
  "title" : "PDF Viewer on Windows",
  ...
  "question" : "I've tried Foxit and Adobe's reader, but I'm not satisfied with either. \nFoxit has update nagging for non-critical junk.\nAdobe PDF reader is bloatware.\nAre there other options you people like?\n",
  "question-embedding" : "0.02229069927045383,-0.054698395890743434,4.459132531833404E-4,-0.003377242019978439,-0.003233384262313286,-9.794070359603001E-4,0.04046424396001888,-0.021523714067476444,0.041624021461988084,0.043229097278370786,-0.01117457786453014,-0.003601036705659116,0.05046213702095343,0.006889291739273135,-0.010423603647854057,-0.025593951823563842,0.03664289388466606,-0.014391882979680899,-0.005151148029556899,-0.016675287166848116,-0.02039960684820681,0.028724057044930376,-0.06865118319438214,-0.03270286893383771,-0.006637946007480703,-0.007894755607465566,8.884546737630511E-4,0.019566645339608536,0.06573310085195767,0.040249356620883914,-0.01630173327093651,-0.004072705179263674,-0.0029919227074089666,0.008677419512545793,0.015039198158608911,-0.01947743129237517,-0.0038754770412554604,0.041360733095298606,-0.007902831817509736,-0.007465403526535213,-0.018225096452929603,-0.026553859528147056,0.04630780750635358,-0.05864483477861287,-0.028445103608310297,-0.0526171960104591,0.05527012237859657,0.010589140819441978,0.016630679291220665,-0.014904178308612097,-0.02020384374270951,0.028884555424852378,-0.028212339379298044,-0.08897882410128809,-0.028624852319463233,0.0010660980663149582,0.007162978897352965,-0.002658216502979147,-0.013845213276424134,0.01562011273055492,0.03979095438090028,0.0734497010084014,0.02098379654953459,-2.959310291346015E-4,-0.037625722382857066,0.013821898001832011,-0.061754353715868494,0.03329363615827352,0.05802028362331184,0.005911600865996817,-0.026633769617823475,0.03952654136000062,0.02387632708910805,-0.017395633300533602,-0.024545341278206432,-7.842759094481533E-7,-0.02741271023585939,-0.012302004517741338,-0.013014577757215097,0.06666785208668498,-0.01702915620603849,-0.004195675041007011,-0.026036157338899336,0.006573076037838543,-0.004008782645592514,-0.04289214406091445,0.05969192192872194,-0.021508539755746653,0.06125136064047272,-0.010326163436681827,0.002938862884989854,0.03378095223558799,0.01594432157098407,-0.03790363807036502,0.03833932911138912,0.02433295033061376,-0.001845874932917005,-0.019064853599391817,-0.015392003888364205,0.022575572477871134,0.025376527196369843,0.019455074529893996,-0.0018946329529821756,0.036642123666934236,0.006555857326277443,0.017865827369508827,0.02868314689602428,-0.05394198414136328,-0.06349197173494255,-0.03453422176926576,-0.00724850629400493,-0.03739396863861244,-0.010868421428195928,-0.005734334062203168,0.0041730750294523495,-0.024902962572807136,5.165284379171876E-4,0.027855258259522533,0.03740012015633783,-0.02036565933127465,0.022812291740855858,0.03943501836359332,-0.06741283666768108,-0.031768339221909386,-0.022462209037506425,-0.04310231126040972,-0.006102579079036256,0.04251662542817809,8.20307444712208E-4,0.010317263332227622,0.024793255962647046,0.002750027479025116,0.024256565861484107,0.004370714262654814,-0.020807886998890045,0.02003697402628468,0.018521689920434216,-0.023163902951518215,0.009834644390262192,-0.019455986181412855,0.013873138781267233,-0.046264297020591816,-0.004905139293198235,0.06681904650905003,-0.012802091384416714,0.01677465547838219,-0.024982930599215594,-0.027057656901705265,-0.052026967256826956,0.051799292940126536,-0.03849449049587736,0.03455570607272328,0.05449936958401508,0.01143360958543624,-0.04234977104794704,0.04638193585097675,0.01696807214623189,0.019880514177390143,0.03272945848580084,0.030210594307560002,0.033571930138561266,0.025601982877037452,-0.04616138093623108,-0.011404014139493176,-0.008886897435331942,0.07908092643181086,0.01142384809809833,-0.02099123460351556,0.026063803384211423,0.021677876895347763,-0.01655334568209848,0.03135089825147777,-0.028782430006481827,-0.0032005196510638085,0.0580781112979735,-0.016355232730910513,0.015468347460977125,0.004280600066047918,0.032359307520885804,0.024006500701871904,-0.007552228535582713,0.02355118785292321,-0.04067045101347709,0.040501142842242,-0.04141314198149303,0.05832244753719991,0.022861682804919636,-0.02328852997420004,-0.017044362894177358,-0.01636063107111938,0.01423337489690661,0.03586925788565483,-0.06779108173486637,0.03236844448433276,-0.012816520186726877,-0.00318556302808455,-0.04890245974618837,0.01466024421846691,0.02714750825299107,-0.0601396092813213,-0.06108177641774968,-0.008547947956652537,-0.009537326713607084,-0.05023266325751672,-0.01154732916630084,-0.04413909930438798,0.04757071920743944,0.02842099170365263,0.006181544288773969,-0.002092649200948281,0.021961356213153052,0.044455522470443924,-0.036052712843636765,-0.06128599999014629,0.020098698798186775,-0.014763105774138672,0.0028882668647012646,0.016836780695343976,-0.005960076870498677,-0.021616395798524538,0.014052414626445433,0.017772436765504017,0.05517377018517161,0.005023257379638636,0.007949499003156888,0.05331304342659948,-0.015660545753432457,0.030129772566364725,-0.005065483459171722,-0.00845088689027086,-0.024793981875819073,0.006274243912048134,0.0077077869570875744,0.027598150378924453,0.02117816576541988,0.042615830153645236,0.05671158145565631,-0.0410625463676629,-0.0205953972180486,0.015804792880017345,-0.03828535593342962,0.02420736...
    

Indexing embedding vectors

Lingo4G supports indexing vector fields, which can be consumed by analysis stages such as documents:​vector​Field​Nearest​Neighbors or vectors:​from​Vector​Field. To store vector data in Lingo4G, declare a field with the float-vector type in the project descriptor and the indexer will take care of importing these fields, if they are present in the JSON data.

The project descriptor of the dataset-json-records example already contains the field definitions we'll need in this tutorial. The question-embedding field has the correct float-vector type, along with its dimension, and the field​Mapping section points at the JSONPath where the data for this field can be found in the input JSON files:

"fields": {
  "title":    { "analyzer": "english" },
  "question": { "analyzer": "english" },
  "question-embedding": { "type": "float-vector", "length": 1024 },
  "acceptedAnswer": { "analyzer": "english" },
  ...
},
...
"source":  {
 "feed":  {
   "type":  "json-records",
   ...
   "fieldMapping": {
     "title": ".title",
     "question": ".question",
     "question-embedding": ".question-embedding",
...

The length parameter for float-vector fields must match the length of the vectors produced by the model. You can find this information on the LLM model info card or logged by Ollama.

Now index the "enriched" data with Lingo4G:

l4g index --force -p datasets/dataset-json-records -Dinput.dir=datasets/dataset-json-records/data-embeddings

Using external embeddings

Vector fields behave much like any other vectors in Lingo4G API. You can use them anywhere a document embedding vector is required. In this section, we'll present three toy examples. First, run the l4g server:

l4g server -p datasets/dataset-json-records

Similar document lookup

Similar embedding vectors should imply similarity of the original texts. The following query retrieves documents most similar to the one provided using an explicit identifier. Note the highlighted lines that reference vector field stages and vector fields.

{
  "stages": {
    "queryDocuments": {
      "type": "documents:byQuery",
      "limit": 1,
      "query": {
        "type": "query:string",
        "query": "id:83677"
      }
    },
    "similarDocuments": {
      "type": "documents:vectorFieldNearestNeighbors",
      "vector": {
        "type": "vector:fromVectorField",
        "documents": {
          "type": "documents:reference",
          "use": "queryDocuments"
        },
        "fieldName": "question-embedding"
      },
      "fieldName": "question-embedding",
      "limit": 5
    },
    "queryDocumentsContents": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "queryDocuments"
      }
    },
    "similarDocumentsContents": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "similarDocuments"
      }
    }
  },
  "output": {
    "stages": [
      "queryDocumentsContents",
      "similarDocumentsContents",
      "queryDocuments",
      "similarDocuments"
    ]
  },
  "tags": [
    "Similar Documents (MLT)"
  ]
}

Running this query in the Explorer displays the "source" document (query vector) and the most similar documents based on that document's LLM embedding field:

Similar documents, by LLM embedding vector (light theme).
Similar documents, by LLM embedding vector (dark theme).

Similar documents retrieved using an LLM embedding vector. The first document is the "query" document (source embedding vector).

Document clustering

You can also use external embedding for document clustering. Copy the following, slightly more verbose, request into the Explorer and run it:

{
  "variables": {
    "query": {
      "name": "Query",
      "comment": "Selects documents to arrange into a 2d map.",
      "value": "*:*"
    },
    "maxDocuments": {
      "name": "Max documents",
      "comment": "Maximum number of documents to include in the 2d map.",
      "value": 5000
    },
    "maxLabels": {
      "name": "Max labels",
      "comment": "Maximum number of labels to put on the 2d map.",
      "value": 200
    }
  },
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": {
          "@var": "query"
        }
      },
      "limit": {
        "@var": "maxDocuments"
      }
    },
    "labels": {
      "type": "labels:fromDocuments",
      "labelAggregator":{
        "type": "labelAggregator:topWeight",
        "maxRelativeDf": 0.5,
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields",
          "labelFilter": {
            "type": "labelFilter:tokenCount",
            "minTokens": 1,
            "maxTokens": 2
          }
        }
      },
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": {
          "@var": "maxLabels"
        }
      }
    },
    "documents2dEmbedding": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:fromVectorField",
          "fieldName": "question-embedding"
        }
      }
    },
    "documents2dEmbeddingLabels": {
      "type": "embedding2d:lvOverlay",
      "matrix": {
        "type": "matrix:keywordLabelDocumentSimilarity",
        "labels": {
          "type": "labels:reference",
          "use": "labels"
        }
      },
      "embedding2d": {
        "type": "embedding2d:reference",
        "use": "documents2dEmbedding"
      }
    },
    "docClusters": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:knn2dDistanceSimilarity",
        "embedding2d": {
          "type": "embedding2d:reference",
          "use": "documents2dEmbedding"
        },
        "maxNearestPoints": 32
      },
      "softening": 0.2,
      "inputPreference": -10000
    }
  },
  "output": {
    "stages": [
      "documents",
      "labels",
      "documents2dEmbedding",
      "documents2dEmbeddingLabels",
      "docClusters"
    ]
  },
  "tags": [
    "2D Embeddings"
  ]
}

The output clusters will be poorly labeled (because the input data set is so small that Lingo4G label extraction yields poor labels) but there are clearly patterns of related documents in the data:

Clustering of documents using LLM embeddings (light theme).
Clustering of documents using LLM embeddings (light theme).

Document clusters discovered from LLM embedding vectors.