Indexing LLM embeddings
In this article, you'll learn how to index fields with embedding vectors computed using an external (to Lingo4G) large language model.
Lingo4G can produce document embeddings and does a good job when there is a lot of data to learn from. Large language models (LLMs) can be very useful to produce accurate embedding vectors even when the number of documents is very small. Their "understanding" relies on contextual knowledge drawn from a large body of text processed during the training phase. Models can also be fine-tuned to fit a particular task.
This article shows one particular step-by-step process of adding embedding vectors to JSON data, followed by indexing it in Lingo4G. It does not exhaust all the tools and possibilities (of which there are plenty, as LLMs are a rapidly evolving).
Computing sentence embeddings using LLMs typically requires GPU (graphic cards) or modern hardware. It may still be a lengthy process if the input data is large. For this reason, we highly recommend to start small, to ensure the process is working, then scale to larger data samples.
Preparation
We'll work with the
dataset-json-records
document source example, present in Lingo4G distribution. This project contains a few JSON files under the
data
folder. These files are StackExchange
questions and answers, stored as json-record files. Here is an example file:
{
"id": "94",
"type": "question",
"title": "What Windows services can I safely disable?",
"tag": [
"windows",
"services"
],
"created": "2009-07-15",
"question": "I'm trying to improve the boot time and general performance of a Windows XP machine and figure the massive collection of services that Windows automatically starts have to have an impact. Are there any services that I can safely disable? If so what are they?\n\nObviously the services are there for a reason, so when listing a service, please provide reasoning & examples of when you'd not disable it.\n",
...
Let's add an embedding vector for the question
field. We'll need a LLM sentence embedding model and
the machinery to apply this model to input text. In this example, we'll use the
Ollama project because of its simplicity in installation and deployment.
- Download and install Ollama according to instructions.
-
Download and install a sentence embedding model. In this article, we'll use one of Ollama's library models, the one called
mxbai-embed-large
. Run the following command to download the model:ollama pull mxbai-embed-large
-
If everything went well, you should have a working LLM model on your machine. Try to ask it to embed some text (quotes assume bash or similar shell, Windows users may have to quote it appropriately):
curl -XPOST http://localhost:11434/api/embeddings --data '{"model":"mxbai-embed-large", "prompt":"Hello world."}'
If you received a JSON with a long sequence of numbers, you're good to go.
{ "embedding": [ 0.5235398411750793, 0.48802658915519714, 0.4726874530315399, -0.39526 48639678955, -0.7870591878890991, -0.020934749394655228, 0.5342496037483215, 0.63634 99164581299, 0.7467385530471802, 0.3339114189147949, -0.029485151171684265, 0.468538 40351104736, -0.32256293296813965, 0.052154384553432465, -1.1541720628738403, 0.1468 7027037143707, -0.3646547198295593, -0.12941092252731323, -0.8018029928207397, 0.387 24684715270996, -0.2168998122215271, 0.3071088194847107, -1.1364370584487915, ...
Adding embedding vectors to JSON data
Let's add an embedding vector field question-embedding
, encoding the text found in the
question
field of all documents present under the data
folder of the
datasets/dataset-json-records
example.
Lingo4G comes with a command-line utility to help you out with this task. This utility takes JSON record files as input, asks an external service to encode an embedding vector for one or more fields, then writes back a copy of the input JSON, along with those new vector fields. Run this tool now against your Ollama service you set up in the previous step:
> mkdir datasets/dataset-json-records/data-embeddings
> l4g llm-vec --parallelism 4 --report-every 5
--backend ollama
--ollama-model mxbai-embed-large
--fields question
--output datasets/dataset-json-records/data-embeddings/stackxchg-with-embeddings.json.zst
datasets/dataset-json-records/data/
You should end up with one or more files under the data-embeddings
folder. These files will contain
the embedding vectors:
{
"id" : "67",
"type" : "question",
"title" : "PDF Viewer on Windows",
...
"question" : "I've tried Foxit and Adobe's reader, but I'm not satisfied with either. \nFoxit has update nagging for non-critical junk.\nAdobe PDF reader is bloatware.\nAre there other options you people like?\n",
"question-embedding" : "0.02229069927045383,-0.054698395890743434,4.459132531833404E-4,-0.003377242019978439,-0.003233384262313286,-9.794070359603001E-4,0.04046424396001888,-0.021523714067476444,0.041624021461988084,0.043229097278370786,-0.01117457786453014,-0.003601036705659116,0.05046213702095343,0.006889291739273135,-0.010423603647854057,-0.025593951823563842,0.03664289388466606,-0.014391882979680899,-0.005151148029556899,-0.016675287166848116,-0.02039960684820681,0.028724057044930376,-0.06865118319438214,-0.03270286893383771,-0.006637946007480703,-0.007894755607465566,8.884546737630511E-4,0.019566645339608536,0.06573310085195767,0.040249356620883914,-0.01630173327093651,-0.004072705179263674,-0.0029919227074089666,0.008677419512545793,0.015039198158608911,-0.01947743129237517,-0.0038754770412554604,0.041360733095298606,-0.007902831817509736,-0.007465403526535213,-0.018225096452929603,-0.026553859528147056,0.04630780750635358,-0.05864483477861287,-0.028445103608310297,-0.0526171960104591,0.05527012237859657,0.010589140819441978,0.016630679291220665,-0.014904178308612097,-0.02020384374270951,0.028884555424852378,-0.028212339379298044,-0.08897882410128809,-0.028624852319463233,0.0010660980663149582,0.007162978897352965,-0.002658216502979147,-0.013845213276424134,0.01562011273055492,0.03979095438090028,0.0734497010084014,0.02098379654953459,-2.959310291346015E-4,-0.037625722382857066,0.013821898001832011,-0.061754353715868494,0.03329363615827352,0.05802028362331184,0.005911600865996817,-0.026633769617823475,0.03952654136000062,0.02387632708910805,-0.017395633300533602,-0.024545341278206432,-7.842759094481533E-7,-0.02741271023585939,-0.012302004517741338,-0.013014577757215097,0.06666785208668498,-0.01702915620603849,-0.004195675041007011,-0.026036157338899336,0.006573076037838543,-0.004008782645592514,-0.04289214406091445,0.05969192192872194,-0.021508539755746653,0.06125136064047272,-0.010326163436681827,0.002938862884989854,0.03378095223558799,0.01594432157098407,-0.03790363807036502,0.03833932911138912,0.02433295033061376,-0.001845874932917005,-0.019064853599391817,-0.015392003888364205,0.022575572477871134,0.025376527196369843,0.019455074529893996,-0.0018946329529821756,0.036642123666934236,0.006555857326277443,0.017865827369508827,0.02868314689602428,-0.05394198414136328,-0.06349197173494255,-0.03453422176926576,-0.00724850629400493,-0.03739396863861244,-0.010868421428195928,-0.005734334062203168,0.0041730750294523495,-0.024902962572807136,5.165284379171876E-4,0.027855258259522533,0.03740012015633783,-0.02036565933127465,0.022812291740855858,0.03943501836359332,-0.06741283666768108,-0.031768339221909386,-0.022462209037506425,-0.04310231126040972,-0.006102579079036256,0.04251662542817809,8.20307444712208E-4,0.010317263332227622,0.024793255962647046,0.002750027479025116,0.024256565861484107,0.004370714262654814,-0.020807886998890045,0.02003697402628468,0.018521689920434216,-0.023163902951518215,0.009834644390262192,-0.019455986181412855,0.013873138781267233,-0.046264297020591816,-0.004905139293198235,0.06681904650905003,-0.012802091384416714,0.01677465547838219,-0.024982930599215594,-0.027057656901705265,-0.052026967256826956,0.051799292940126536,-0.03849449049587736,0.03455570607272328,0.05449936958401508,0.01143360958543624,-0.04234977104794704,0.04638193585097675,0.01696807214623189,0.019880514177390143,0.03272945848580084,0.030210594307560002,0.033571930138561266,0.025601982877037452,-0.04616138093623108,-0.011404014139493176,-0.008886897435331942,0.07908092643181086,0.01142384809809833,-0.02099123460351556,0.026063803384211423,0.021677876895347763,-0.01655334568209848,0.03135089825147777,-0.028782430006481827,-0.0032005196510638085,0.0580781112979735,-0.016355232730910513,0.015468347460977125,0.004280600066047918,0.032359307520885804,0.024006500701871904,-0.007552228535582713,0.02355118785292321,-0.04067045101347709,0.040501142842242,-0.04141314198149303,0.05832244753719991,0.022861682804919636,-0.02328852997420004,-0.017044362894177358,-0.01636063107111938,0.01423337489690661,0.03586925788565483,-0.06779108173486637,0.03236844448433276,-0.012816520186726877,-0.00318556302808455,-0.04890245974618837,0.01466024421846691,0.02714750825299107,-0.0601396092813213,-0.06108177641774968,-0.008547947956652537,-0.009537326713607084,-0.05023266325751672,-0.01154732916630084,-0.04413909930438798,0.04757071920743944,0.02842099170365263,0.006181544288773969,-0.002092649200948281,0.021961356213153052,0.044455522470443924,-0.036052712843636765,-0.06128599999014629,0.020098698798186775,-0.014763105774138672,0.0028882668647012646,0.016836780695343976,-0.005960076870498677,-0.021616395798524538,0.014052414626445433,0.017772436765504017,0.05517377018517161,0.005023257379638636,0.007949499003156888,0.05331304342659948,-0.015660545753432457,0.030129772566364725,-0.005065483459171722,-0.00845088689027086,-0.024793981875819073,0.006274243912048134,0.0077077869570875744,0.027598150378924453,0.02117816576541988,0.042615830153645236,0.05671158145565631,-0.0410625463676629,-0.0205953972180486,0.015804792880017345,-0.03828535593342962,0.02420736...
Indexing embedding vectors
Lingo4G supports indexing vector fields, which can be
consumed by analysis stages such as
documents:​vector​Field​Nearest​Neighbors
or
vectors:​from​Vector​Field
. To store vector data in Lingo4G, declare a field with the
float-vector
type in the project descriptor and the indexer will take care of importing these fields, if they are present in
the JSON data.
The project descriptor of the dataset-json-records
example already contains the field definitions
we'll need in this tutorial. The question-embedding
field has the correct
float-vector
type, along with its dimension, and the field​Mapping
section points at the JSONPath where the data for this field can be found in the input JSON files:
"fields": {
"title": { "analyzer": "english" },
"question": { "analyzer": "english" },
"question-embedding": { "type": "float-vector", "length": 1024 },
"acceptedAnswer": { "analyzer": "english" },
...
},
...
"source": {
"feed": {
"type": "json-records",
...
"fieldMapping": {
"title": ".title",
"question": ".question",
"question-embedding": ".question-embedding",
...
The length
parameter for float-vector
fields must match the length of the vectors
produced by the model. You can find this information on the LLM model info card or logged by Ollama.
Now index the "enriched" data with Lingo4G:
l4g index --force -p datasets/dataset-json-records -Dinput.dir=datasets/dataset-json-records/data-embeddings
Using external embeddings
Vector fields behave much like any other vectors in Lingo4G API. You can use them anywhere a document embedding vector is required. In this section, we'll present three toy examples. First, run the l4g server:
l4g server -p datasets/dataset-json-records
Similar document lookup
Similar embedding vectors should imply similarity of the original texts. The following query retrieves documents most similar to the one provided using an explicit identifier. Note the highlighted lines that reference vector field stages and vector fields.
{
"stages": {
"queryDocuments": {
"type": "documents:byQuery",
"limit": 1,
"query": {
"type": "query:string",
"query": "id:83677"
}
},
"similarDocuments": {
"type": "documents:vectorFieldNearestNeighbors",
"vector": {
"type": "vector:fromVectorField",
"documents": {
"type": "documents:reference",
"use": "queryDocuments"
},
"fieldName": "question-embedding"
},
"fieldName": "question-embedding",
"limit": 5
},
"queryDocumentsContents": {
"type": "documentContent",
"documents": {
"type": "documents:reference",
"use": "queryDocuments"
}
},
"similarDocumentsContents": {
"type": "documentContent",
"documents": {
"type": "documents:reference",
"use": "similarDocuments"
}
}
},
"output": {
"stages": [
"queryDocumentsContents",
"similarDocumentsContents",
"queryDocuments",
"similarDocuments"
]
},
"tags": [
"Similar Documents (MLT)"
]
}
Running this query in the Explorer displays the "source" document (query vector) and the most similar documents based on that document's LLM embedding field:
"Natural language" search
The same model used for encoding the question
field can be used to encode any natural language
query. The right "prompt" should yield semantically relevant documents.
First, you should acquire the sentence embedding for your question. We'll use
curl
to ask Ollama to provide it:
curl -XPOST http://localhost:11434/api/embeddings --data '{"model":"mxbai-embed-large", "prompt":"How can I disable UAC in Windows?"}'
Next, copy the resulting vector of numbers, replacing the corresponding line of numbers in the query below. Then run it in the Explorer:
{
"name": "Documents by an explicit embedding vector",
"stages": {
"similarDocuments": {
"type": "documents:vectorFieldNearestNeighbors",
"vector": {
"type": "vector:direct",
"vector": [
-0.6678240299224854,-0.3270546793937683,0.24490061402320862,0.5776948928833008,0.1049603819847107,-0.5096140503883362,-0.33692947030067444,-0.25290247797966003,0.26838594675064087,0.8057006597518921,-0.06449219584465027,0.09683781117200851,0.46315404772758484,-0.3866603374481201,-0.14350521564483643,0.07675948739051819,-0.22871717810630798,0.27078965306282043,-0.452234148979187,0.7868403792381287,0.9385868906974792,-0.2630316913127899,-1.1200165748596191,-0.9737227559089661,-0.516177237033844,-0.15844713151454926,-0.4514494836330414,0.015934884548187256,1.2818877696990967,1.4789485931396484,-0.25919079780578613,-0.4203009009361267,-0.033671434968709946,0.17300833761692047,-0.2230384647846222,-0.4288966655731201,0.16999675333499908,-0.46896377205848694,0.021551236510276794,-0.7981429696083069,0.25897639989852905,0.28446781635284424,0.30176281929016113,-0.6501299738883972,-0.3090338706970215,-0.41341373324394226,-0.12070462107658386,0.3953201174736023,0.7337993383407593,0.46978768706321716,-0.5871947407722473,1.0663228034973145,0.35147762298583984,-0.08280615508556366,-0.7741233706474304,0.11972643435001373,-0.13841646909713745,-0.24295708537101746,0.05853267014026642,0.40691301226615906,-0.32044583559036255,0.9029668569564819,0.37813884019851685,-0.32792654633522034,0.8090909719467163,0.7476796507835388,0.21685004234313965,0.4479389488697052,-0.210826575756073,0.803276538848877,0.03843045234680176,0.8348561525344849,0.37874341011047363,-0.20635125041007996,-0.08051184564828873,0.6512452960014343,-0.24455365538597107,0.5323154926300049,0.17537719011306763,0.6345975399017334,-0.5425848960876465,1.1258317232131958,0.10661931335926056,-0.12910807132720947,-1.3183754682540894,-1.0743837356567383,0.33026760816574097,0.1525283306837082,0.0051500434055924416,0.11569217592477798,0.4577731788158417,0.0981932207942009,-0.08952921628952026,-0.6858391761779785,-0.21584835648536682,0.5809176564216614,-0.38912233710289,1.1463581323623657,0.5128231048583984,0.07286614924669266,0.860008716583252,0.28054922819137573,-0.053399309515953064,0.2637169063091278,0.2920328974723816,0.9476811289787292,0.9473128318786621,0.524198591709137,-0.8201144933700562,-0.36364561319351196,-0.09184926003217697,0.21342632174491882,-0.5415635704994202,0.3005877137184143,0.2526929974555969,-0.1729966253042221,-0.5121755003929138,0.6260485649108887,-0.16456781327724457,0.46030813455581665,0.9912554025650024,-0.12902361154556274,0.23835931718349457,-0.1523369550704956,-0.7323474884033203,-0.31485772132873535,-0.28789323568344116,0.21615973114967346,-0.7631410956382751,-0.3054836094379425,0.05691303312778473,0.13819491863250732,-0.06780652701854706,0.1110917404294014,-0.05510175973176956,0.6591027975082397,0.3042431175708771,0.3049825429916382,0.43153202533721924,-1.1492836475372314,0.41413581371307373,1.17585027217865,0.3855838477611542,1.481425166130066,0.31220319867134094,1.580590009689331,0.5841251015663147,0.38650768995285034,-0.1939869225025177,-0.2725134491920471,-0.4573327302932739,-0.12020952999591827,0.25065842270851135,-0.026825472712516785,-0.3598436713218689,0.37251436710357666,0.2420288324356079,0.015366345643997192,-0.6161212921142578,1.0976288318634033,0.23320037126541138,1.0277793407440186,-0.4473324120044708,0.08050915598869324,0.021169418469071388,0.1937406212091446,-0.5518890023231506,0.14284062385559082,0.15890470147132874,0.20192798972129822,0.7298301458358765,0.046034883707761765,-1.6545511484146118,-0.7279776334762573,0.0038791820406913757,-0.6128689050674438,-0.3993251919746399,0.5234780311584473,0.35191088914871216,1.1940338611602783,-0.7113875150680542,0.23084524273872375,-0.4188513159751892,0.049564607441425323,0.40593481063842773,-0.20355214178562164,0.5053063631057739,-0.22380244731903076,-0.6742454171180725,-0.15089565515518188,0.7454171180725098,-0.17750725150108337,-0.4725326597690582,0.8420986533164978,-0.07355941832065582,0.05562528595328331,-1.1052883863449097,0.300507515668869,-0.8537042737007141,-0.42494937777519226,-1.0459604263305664,0.5123926997184753,-1.3745778799057007,0.03739108517765999,-0.07844873517751694,-0.367543488740921,-0.024080593138933182,0.6308107376098633,-0.38552406430244446,-0.04075939953327179,1.025085687637329,0.22532248497009277,-0.8526533842086792,-0.4563581645488739,1.2468194961547852,0.03517918288707733,-0.18753354251384735,0.07379301637411118,-0.04918387532234192,0.22043117880821228,0.2978243827819824,0.6045581102371216,-0.053957484662532806,0.643552303314209,-0.6103844046592712,-0.18636713922023773,0.05303202569484711,0.0062947943806648254,-0.10125985741615295,0.0756499320268631,0.13714322447776794,0.30507805943489075,-0.4993361234664917,0.8600193858146667,-0.04186887666583061,0.3596005141735077,0.025588706135749817,0.5950284600257874,0.5796450972557068,0.4087822735309601,-0.43100351095199585,0.5513607263565063,0.12573717534542084,0.8201496005058289,0.07199370861053467,0.06819179654121399,-0.2338753640651703,0.10614994168281555,0.3055512607097626,-0.7499877214431763,-0.1345459371805191,0.20179221034049988,-1.2351906299591064,-0.008320439606904984,-0.41453656554222107,0.012705337256193161,0.6094315052032471,0.6358441114425659,-0.6375629901885986,-1.0642788410186768,-0.49829766154289246,0.09644682705402374,0.012574207969009876,-0.6421748995780945,0.6822296380996704,0.21006803214550018,0.4614858031272888,0.05584020912647247,-0.6701479554176331,-0.8759964108467102,-0.6722673773765564,0.3213675618171692,-0.6966325640678406,-0.0237315334379673,-1.1541998386383057,-0.18576882779598236,0.7638826966285706,-0.8627251386642456,0.29170477390289307,-0.4407261610031128,0.018274078145623207,-0.44841113686561584,-0.13100962340831757,1.1591181755065918,0.013283297419548035,-0.752669095993042,0.029200421646237373,0.5503090023994446,-0.4460316598415375,0.6294165849685669,-0.25609487295150757,0.14607882499694824,-0.5887885689735413,-0.535032331943512,0.4006384015083313,-0.9763392210006714,0.6294928193092346,0.43944957852363586,-0.8923460245132446,0.035288333892822266,-0.8941684365272522,-0.11776317656040192,0.11810781061649323,-0.021693279966711998,-0.3859817683696747,-0.08011908084154129,-0.056837745010852814,-0.5240936875343323,0.5749272108078003,0.542073130607605,-0.7075226306915283,0.9432483911514282,-0.3019866645336151,0.5350804328918457,-0.8687669038772583,0.4809129536151886,0.29429444670677185,-0.3465624153614044,-0.8229600191116333,-0.43593835830688477,-0.08865698426961899,-0.39332354068756104,-0.12051291763782501,0.17937356233596802,-0.410043865442276,0.6364004015922546,-0.07567362487316132,-1.392578125,1.1815215349197388,0.09100450575351715,-0.006931355223059654,-1.1363712549209595,-0.8805798292160034,0.45651933550834656,-0.31618645787239075,-0.15556621551513672,-0.16768479347229004,-0.21632947027683258,-0.44242584705352783,0.1932225078344345,-0.009274154901504517,-0.6313037872314453,1.4259318113327026,-0.026215016841888428,-0.325498104095459,-0.006029928103089333,-0.6792687773704529,-0.795100748538971,0.38341024518013,-0.31174436211586,-0.1855074167251587,-0.6158837676048279,-0.4338547885417938,0.13422444462776184,0.8272004723548889,1.1639829874038696,-1.1500273942947388,0.9096818566322327,0.3780547082424164,-0.1967843770980835,-0.21046286821365356,0.505218505859375,-0.34325310587882996,0.5948609709739685,0.8855289220809937,-0.785014808177948,0.15578541159629822,0.9610651731491089,0.9537094235420227,-0.040810808539390564,0.33539170026779175,-0.05632761865854263,-0.5989818572998047,0.20575246214866638,-0.15847563743591309,0.7376065254211426,0.5532804131507874,-0.42690756916999817,-0.01279190182685852,-0.7039685845375061,-0.01356450468301773,-0.18823093175888062,0.0845145583152771,0.40373528003692627,-0.5776837468147278,0.05671258270740509,-0.40807437896728516,-0.13336904346942902,0.3443179428577423,-0.6535309553146362,-0.7402740120887756,-0.3815173804759979,0.4932897388935089,0.4198564887046814,-0.9697323441505432,-0.7380609512329102,1.0805197954177856,0.5650096535682678,0.6438094973564148,0.4661160111427307,1.0220212936401367,-1.1365330219268799,-0.059295520186424255,0.8429642915725708,0.19227764010429382,-0.09449808299541473,0.4263327419757843,1.2110506296157837,-0.3002256155014038,-0.0587196983397007,-0.9527116417884827,-0.014306768774986267,-1.0068385601043701,0.4580576717853546,-0.09311781078577042,0.7418749332427979,-0.51519775390625,0.46069392561912537,0.21543718874454498,-0.26445916295051575,-0.6002886891365051,-0.41419970989227295,-0.6274615526199341,0.18137747049331665,0.39830252528190613,-0.45259836316108704,-0.5130108594894409,-0.2821756601333618,-0.07207773625850677,-0.06434433162212372,-0.33170410990715027,-0.15071533620357513,-0.0373401939868927,-0.7189939022064209,-0.9957669377326965,0.410360187292099,0.3132494390010834,-0.32817342877388,0.4087875783443451,-0.18108586966991425,-0.4723246097564697,-0.2548844516277313,-0.5607573986053467,-0.2937496304512024,-0.983547568321228,-0.43506282567977905,-0.24076101183891296,0.06973284482955933,-0.2958128750324249,-0.04087000712752342,0.8960068225860596,-0.4322759509086609,-0.27045711874961853,-0.8025009036064148,-0.22119121253490448,-0.3694828450679779,0.13302290439605713,-0.35610368847846985,0.4296471178531647,-0.5618695020675659,-0.5102592706680298,-0.0545954704284668,-0.23437458276748657,-0.41497817635536194,1.0204927921295166,0.6474722027778625,-0.7547150254249573,0.3537299335002899,0.7397130131721497,1.1420817375183105,-0.3511338233947754,0.5918375849723816,0.2826872169971466,-0.2544497549533844,-0.0793425515294075,-1.184470772743225,-0.14069746434688568,-0.17875973880290985,-0.650674045085907,-0.19607704877853394,-0.853507936000824,0.9735073447227478,1.127422571182251,0.2921452820301056,-0.17259764671325684,-0.2559465169906616,0.3809281587600708,0.76553875207901,0.029313087463378906,0.6235186457633972,-0.5967974662780762,0.021092338487505913,1.0815755128860474,-0.3941582441329956,-0.14277006685733795,0.04674488306045532,-0.06274522840976715,-0.4366597533226013,0.17596235871315002,0.5810372829437256,0.5435763597488403,-0.06473775207996368,-1.154187798500061,0.04764412343502045,-0.012224621139466763,-0.2602982223033905,-1.1483341455459595,0.2104119211435318,-0.2211235910654068,-0.21967662870883942,-1.1632031202316284,0.17876410484313965,-0.8430588841438293,0.4235435724258423,0.10185834020376205,0.741112470626831,0.5490782260894775,0.5362679362297058,-0.3730953335762024,0.30619722604751587,0.6068710684776306,-0.797300398349762,-0.06528200209140778,0.2503140866756439,-0.39163684844970703,0.5481182336807251,0.23116779327392578,-0.42642179131507874,-0.4635744094848633,0.49347084760665894,1.116963267326355,0.273224800825119,-0.2760360836982727,-0.4270435571670532,-0.7471345663070679,-0.8939622640609741,0.524777889251709,-0.14308008551597595,-1.0130486488342285,-0.07194490730762482,-0.14827774465084076,-0.10845403373241425,-0.9695382118225098,0.17673319578170776,-0.1603754460811615,-0.5442399978637695,0.25730621814727783,0.9747385382652283,-0.36133715510368347,-0.08979347348213196,0.3403306007385254,0.12270374596118927,-0.1501343548297882,-0.12470434606075287,-0.9545978307723999,-0.6290841698646545,-0.4501216411590576,0.17598198354244232,-0.9923639893531799,0.7288824319839478,-0.49735307693481445,-0.2766314744949341,0.024575985968112946,-0.004677183926105499,-0.4341130554676056,1.0488574504852295,-0.8127288818359375,0.5741678476333618,1.1320773363113403,0.13246393203735352,0.47829243540763855,-0.25413617491722107,0.6198315024375916,0.008916586637496948,-0.0028505846858024597,-0.1972295045852661,0.37741750478744507,-0.44931647181510925,-0.5501949787139893,0.5252619981765747,0.3029854893684387,-0.270489901304245,0.11402548104524612,0.14067442715168,0.1468515396118164,0.20012278854846954,-0.051360636949539185,1.2031508684158325,0.19153425097465515,-0.7911192774772644,-0.6803591251373291,0.12031000852584839,0.3207525610923767,-1.1776175498962402,0.2419499158859253,0.4644848108291626,-0.40733692049980164,-0.32476741075515747,-1.1111592054367065,0.47451603412628174,-0.1846984475851059,-0.5528518557548523,0.08545060455799103,-0.6277899146080017,0.45217031240463257,-0.5458784699440002,-0.46291688084602356,0.06424447894096375,-0.1287844479084015,-1.1650208234786987,0.021915510296821594,0.3928301930427551,-0.44256746768951416,0.1851692497730255,0.8012158870697021,-0.6651701927185059,-0.3965778052806854,0.008221130818128586,0.3445877134799957,0.10858052223920822,-0.1383046805858612,1.2911027669906616,0.7610169649124146,-0.4456222951412201,-1.2054113149642944,-0.1843147873878479,-0.00947922095656395,-0.2776593267917633,0.06931628286838531,0.3598436415195465,-0.3878498673439026,0.2599753141403198,-0.4993066191673279,0.08440832048654556,0.1522495299577713,0.6850064396858215,-0.24030625820159912,-0.008302201516926289,-0.9430957436561584,0.061437081545591354,0.16257044672966003,-0.04725198820233345,0.5073382258415222,-0.7684669494628906,-0.7315279841423035,0.21065837144851685,0.8177401423454285,-0.1823573112487793,-0.3425579071044922,0.7887237668037415,-0.43166428804397583,0.6626870036125183,0.839630126953125,-0.46662187576293945,1.0483211278915405,-0.15546119213104248,-0.5698243975639343,-0.7465584874153137,0.19695758819580078,-0.5048912763595581,-0.397411972284317,0.36469677090644836,0.6672120690345764,0.5057865977287292,0.16176994144916534,-0.09365447610616684,-0.45406463742256165,-0.455569326877594,0.6285457611083984,-0.9266245365142822,-0.31631603837013245,0.4135551154613495,-0.5937749147415161,-0.06194646656513214,0.41984468698501587,-0.2858307659626007,0.681341290473938,-1.2108960151672363,0.06684700399637222,-0.4886522889137268,-1.450094223022461,-0.2596518099308014,-0.16863003373146057,-0.4284188151359558,-0.6277339458465576,-0.5726729035377502,-0.025889620184898376,-0.08490926772356033,-0.6287164092063904,0.44492506980895996,0.49248191714286804,-0.2573589086532593,1.0328261852264404,-0.5468521118164062,-0.5662209987640381,-0.20375819504261017,-0.46866995096206665,-0.36948734521865845,-0.2271522581577301,0.015905171632766724,1.1302355527877808,0.6938649415969849,0.146419495344162,0.5537509322166443,-1.3059697151184082,0.15104380249977112,-0.5275390148162842,-0.2643243372440338,-0.15418599545955658,0.2894150912761688,-0.2534034848213196,-1.0805354118347168,-0.8429749011993408,-0.15915583074092865,0.5628639459609985,-0.13850559294223785,-0.22868993878364563,-0.02517290785908699,-0.06865020841360092,0.2299344837665558,-0.3564406931400299,-0.056218117475509644,0.5773911476135254,-0.11622463166713715,0.4373515248298645,0.7347260117530823,0.791166365146637,-0.19027400016784668,-0.04129612445831299,0.518986701965332,0.7333835959434509,0.10005193948745728,0.2194066047668457,0.5026146769523621,0.40993523597717285,-1.202702283859253,0.28268420696258545,-0.16039936244487762,0.6014275550842285,0.4232293963432312,0.5420016050338745,0.26943421363830566,0.24806712567806244,0.7238158583641052,0.13499361276626587,-1.6976341009140015,-0.05983565002679825,-0.41134652495384216,-0.24469095468521118,-0.24890848994255066,-0.5912842750549316,-0.5031788945198059,-0.23088189959526062,-0.42771032452583313,-1.1045684814453125,0.23061715066432953,-1.2404212951660156,-0.1529216468334198,1.0852267742156982,0.9825576543807983,1.2423981428146362,-0.4192562401294708,-0.1279120147228241,-0.30855000019073486,-0.8185910582542419,0.6520063877105713,-1.2501802444458008,0.35082167387008667,0.9092715382575989,0.03733613342046738,-0.19926118850708008,-0.7027714848518372,0.16973653435707092,-0.5383713245391846,0.7339189648628235,0.02901506796479225,0.17367002367973328,-0.5925265550613403,-0.41411906480789185,-0.2661055624485016,0.12976643443107605,-0.7878815531730652,-0.040980756282806396,0.1021926999092102,-0.052721310406923294,0.48179715871810913,0.40794628858566284,0.3450275957584381,-1.2908302545547485,-0.005963817238807678,-0.005873620510101318,0.5572190880775452,0.31831344962120056,0.03790868818759918,0.12605717778205872,0.738211989402771,0.36488834023475647,0.22363150119781494,-0.8777632117271423,0.10282336920499802,0.6014198660850525,0.35778117179870605,0.11417387425899506,0.1500457227230072,-0.22486039996147156,-1.4669500589370728,0.8706555366516113,-0.36371544003486633,-0.9760779142379761,-1.0400822162628174,-0.4934695363044739,-0.24299877882003784,-0.23656480014324188,-0.4155847430229187,-0.9154437184333801,-0.15734508633613586,1.0709238052368164,-0.07248411327600479,0.2994179129600525,0.5357376337051392,-0.13553687930107117,0.2917648255825043,0.6192038655281067,-0.20857985317707062,-1.2494370937347412,0.585063636302948,0.9092404842376709,-0.6641386151313782,-0.20846624672412872,0.39521679282188416,-0.11661521345376968,-0.1432095468044281,-0.17665955424308777,-0.5382022261619568,0.22875896096229553,-0.38816529512405396,-0.011846042238175869,-0.7310686111450195,-0.17746952176094055,-0.38218724727630615,0.27411961555480957,0.2347392439842224,-0.7776448130607605,0.24628937244415283,-0.5403510928153992,-0.2527235448360443,-0.16699713468551636,-0.6724295616149902,0.8884900808334351,0.0426083467900753,0.2616976797580719,0.6045438647270203,-0.20976859331130981,0.7192700505256653,0.5148752927780151,0.21218551695346832,0.6630753874778748,1.5170093774795532,0.7332854866981506,-0.4846644103527069,-0.03360036760568619,0.08343462646007538,-0.2851085662841797,-0.3944458067417145,0.2819611132144928,-0.04296903312206268,-0.08845951408147812,0.4968705177307129,-0.6578850150108337,0.12704038619995117,0.807711660861969,-0.34183448553085327,0.9705681800842285,0.09323857724666595,0.11499091982841492,-0.35486701130867004,-0.23813897371292114,0.7719376683235168,-0.1579892635345459,0.21804538369178772,-0.7374932169914246,-0.3753277361392975,-0.41065114736557007,0.7897483110427856,0.00772525928914547,0.7530194520950317,-0.25686606764793396,-0.7975298762321472,-0.24421127140522003,-0.5765836834907532,-0.3662935793399811,-0.007702499628067017,-0.41201865673065186,-0.8999729156494141,0.45135897397994995,0.9344605803489685,0.22889044880867004,1.0636216402053833,-0.49011504650115967,0.20016980171203613,0.41646432876586914,1.1945068836212158,0.8570274114608765,0.5612809658050537,0.091730996966362,0.4113489091396332,0.732751190662384,0.6668616533279419,0.704310417175293,-0.44061750173568726,0.033405061811208725,0.5023148059844971,-0.6086605191230774,-0.9641386270523071,-0.5088567137718201,0.0415053591132164,0.8163349628448486,-0.21512943506240845,0.12082536518573761,-0.3983280062675476,0.12678904831409454,-0.8364443778991699,-0.09419563412666321,-0.7634602785110474,-0.7557963132858276,0.5077084302902222,-0.444617360830307,0.26666444540023804,-0.25439900159835815,3.6664342880249023,0.4776168763637543,0.05848127231001854,0.3998822867870331,1.1698061227798462,0.7882514595985413,-0.1581261307001114,-0.23112893104553223,-0.7322351336479187,-0.49524715542793274,0.372330904006958,0.36429038643836975,1.0240098237991333,0.5790278315544128,-0.3525961637496948,0.38738876581192017,-0.13322463631629944,0.787551760673523,0.6428444981575012,-1.2297862768173218,-0.9917603731155396,-0.5562047362327576,0.2146642953157425,0.8369634747505188,-0.5251255035400391,0.4102334976196289,0.6984227895736694,-0.8667664527893066,-0.1408490389585495,-0.8810006380081177,0.9910755157470703,-0.26874077320098877,0.6402947902679443,-0.8854994177818298,-0.5035386681556702,0.378109335899353,0.07884366810321808,-0.02549321949481964,-0.14774946868419647,-0.5794371962547302,-0.3648683428764343,0.29148584604263306,-0.8143506646156311,-0.3226469159126282,-0.21946895122528076,0.7430067658424377,0.4402856230735779,-0.09337639808654785,1.4086858034133911,0.16586096584796906,0.767601490020752,-0.05555306747555733,0.3715052604675293,1.3782682418823242,-0.9532417058944702,0.16616389155387878,-0.8862783908843994,0.39727628231048584,0.4196211099624634,0.15457874536514282,0.1690782904624939,0.06919054687023163,0.724695086479187,-0.7799537181854248,0.24668094515800476,0.36646968126296997,0.27672600746154785,0.5473178625106812,-1.3883249759674072,0.025557413697242737,0.4888012111186981,0.1784696877002716,-0.5218316912651062,-0.29063278436660767,0.43535810708999634,0.5428794622421265,0.07689512521028519,0.561474084854126,0.7262402772903442,0.05700063705444336,0.7323194742202759,0.0058692581951618195,0.1817697137594223,-0.2129569947719574,0.040970735251903534,0.23067329823970795,-0.5619150400161743,-0.21048621833324432,-0.4133268892765045,0.061900556087493896,0.6525273323059082,0.2664949297904968,-0.20414435863494873,-0.01234511286020279,-0.1664981096982956
]
},
"fieldName": "question-embedding",
"limit": 10
},
"similarDocumentsContents": {
"type": "documentContent",
"documents": {
"type": "documents:reference",
"use": "similarDocuments"
}
}
},
"output": {
"stages": [
"similarDocumentsContents",
"similarDocuments"
]
},
"tags": [
"Similar Documents (MLT)"
]
}
You should see documents that, hopefully, answer (or are highly related to) the query vector:
If you're expecting to make large numbers of "natural language" search queries, with a little bit of
configuration, you can use the
vector:​from​Embedding​Service
stage to get LLM embedding vectors directly in Lingo4G requests.
To make this work, let's edit the jsonrecords.project.json
project descriptor file to add the
following snippet to its
analysis_v2/components
section:
{
"analysis_v2": {
"components": {
"ollamaMxbai": {
"type": "embeddingService:ollama",
"url": "http://localhost:11434/api/embed",
"model": "mxbai-embed-large",
"prompt": ""
}
}
}
}
The above declaration lets Lingo4G know how to connect to the Ollama service and which model to use to compute embedding vectors. Now there is no need to copy and paste vector data from the console, Lingo4G can make the call for you:
{
"name": "Documents by an explicit embedding vector",
"stages": {
"similarDocuments": {
"type": "documents:vectorFieldNearestNeighbors",
"vector": {
"type": "vector:fromEmbeddingService",
"text": "How can I disable UAC in Windows?"
},
"fieldName": "question-embedding",
"limit": 10
},
"similarDocumentsContents": {
"type": "documentContent",
"documents": {
"type": "documents:reference",
"use": "similarDocuments"
}
}
},
"output": {
"stages": [
"similarDocumentsContents",
"similarDocuments"
]
},
"tags": [
"Similar Documents (MLT)"
]
}
Document clustering
You can also use external embedding for document clustering. Copy the following, slightly more verbose, request into the Explorer and run it:
{
"variables": {
"query": {
"name": "Query",
"comment": "Selects documents to arrange into a 2d map.",
"value": "*:*"
},
"maxDocuments": {
"name": "Max documents",
"comment": "Maximum number of documents to include in the 2d map.",
"value": 5000
},
"maxLabels": {
"name": "Max labels",
"comment": "Maximum number of labels to put on the 2d map.",
"value": 200
}
},
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": {
"@var": "query"
}
},
"limit": {
"@var": "maxDocuments"
}
},
"labels": {
"type": "labels:fromDocuments",
"labelAggregator":{
"type": "labelAggregator:topWeight",
"maxRelativeDf": 0.5,
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"labelFilter": {
"type": "labelFilter:tokenCount",
"minTokens": 1,
"maxTokens": 2
}
}
},
"maxLabels": {
"type": "labelCount:fixed",
"value": {
"@var": "maxLabels"
}
}
},
"documents2dEmbedding": {
"type": "embedding2d:lv",
"matrix": {
"type": "matrix:knnVectorsSimilarity",
"vectors": {
"type": "vectors:fromVectorField",
"fieldName": "question-embedding"
}
}
},
"documents2dEmbeddingLabels": {
"type": "embedding2d:lvOverlay",
"matrix": {
"type": "matrix:keywordLabelDocumentSimilarity",
"labels": {
"type": "labels:reference",
"use": "labels"
}
},
"embedding2d": {
"type": "embedding2d:reference",
"use": "documents2dEmbedding"
}
},
"docClusters": {
"type": "clusters:ap",
"matrix": {
"type": "matrix:knn2dDistanceSimilarity",
"embedding2d": {
"type": "embedding2d:reference",
"use": "documents2dEmbedding"
},
"maxNearestPoints": 32
},
"softening": 0.2,
"inputPreference": -10000
}
},
"output": {
"stages": [
"documents",
"labels",
"documents2dEmbedding",
"documents2dEmbeddingLabels",
"docClusters"
]
},
"tags": [
"2D Embeddings"
]
}
The output clusters will be poorly labeled (because the input data set is so small that Lingo4G label extraction yields poor labels) but there are clearly patterns of related documents in the data: