Testing arrowspace on a new wave of embeddings
I ran new experiments on the CVE dataset to compare ArrowSpace search with two embedding backends: the original 384âdimensional DenoisingAutoEncoder (test 15) and Perplexityâs newest 1024âdimensional embeddings based on pretrained Qwen3 base models (0.6B parameters, âpplxâembedâ) (test 16). The goal is to see how ArrowSpace behaves as embedding dimensionality grows, and how much extra quality we actually gain relative to the cost in memory, training, and inference.
TLDR;
ArrowSpace is consistently better than plain cosine on highly semantic vector spaces, and its advantages remain coherent as embedding dimensionality grows.
Even when we upgrade to strong 1024âdim embeddings, ArrowSpaceâs spectral modes (hybrid/taumode) keep their edge on structureâaware retrieval and tail behaviour, while the cost gap between 384 and 1024 dimensions stays significant.
Previous blog posts on testing ArrowSpace on CVE here.
Code and data:
1. Experimental setups (test 15 vs 16)
- Both tests use the same CVE JSON pipeline, the same query set (18 security queries), the same ArrowSpace graph parameters, the same tau grid (cosine Ï=1.0, hybrid Ï=0.72, taumode Ï=0.42) and the same evaluation metrics (Spearman, Kendall, NDCG@10, MRRâTop0, tail metrics).
- Test 15 encodes CVEs with a domainâadapted 384âdim SentenceTransformer autoencoder, scaled to float64.
- Test 16 repeats the same ArrowSpace build and evaluation but using 1024âdim Perplexity embeddings, with the same graph and tau configuration and the same perâquery metrics and tail analysis CSVs.
So the main difference is purely the base embedding model (384 vs 1024 dimensions); the ArrowSpace spectral layer and evaluation protocol are held constant.
2. Global quality metrics
From the perârun summaries:
- NDCG@10
- Test 15 (384âdim):
- Hybrid vs cosine: 0.754
- Taumode vs cosine: 0.611
- Taumode vs hybrid: 0.980
- Test 16 (1024âdim):
- Hybrid vs cosine: 0.968
- Taumode vs cosine: 0.906
- Taumode vs hybrid: 0.976
- Test 15 (384âdim):
Interpretation: with 1024âdim embeddings, cosine and hybrid rankings are much closer (NDCG â 0.97 vs â 0.75), meaning the raw embedding space is already better aligned and cosine leverages the added dimensionality. Taumode stays very close to hybrid in both settings (â 0.98 in both), so ArrowSpace preserves the neighbourhood structure across dimensions and continues to add structure on top of whatever the base embedding provides.
- MRRâTop0 (topologyâaware quality of the whole list, as defined in the MRRâTop0 paper):
- Test 15:
- Cosine: 0.0079
- Hybrid: 0.0201
- Taumode: 0.0219
- Test 16:
- Cosine: 0.0293
- Hybrid: 0.0435
- Taumode: 0.0470
- Test 15:
Interpretation: moving to 1024âdim roughly 3â4Ă increases topologyâweighted reciprocal rank for cosine, and about 2Ă for hybrid/taumode. The richer embedding clearly improves overall graphâconsistent ranking, but ArrowSpace remains consistently ahead of vanilla cosine in both regimes, with taumode â„ hybrid â„ cosine.
- Tail/Head ratio (semantic stability past rank 3):
- Test 15: Tail/Head â 0.99 but MRRâTop0 and NDCG are lower, so the nearâ1 ratio mostly reflects lack of contrast, not uniformly great tails.
- Test 16: Tail/Head is lower overall, yet NDCG and MRRâTop0 improve. This means the higherâdim model is better at both putting the most relevant items in the head and keeping many good ones in the tail. Within test 16, ArrowSpaceâs taumode has the highest Tail/Head ratio and lowest tail variance, which is closer to âtails almost as good as headsâ in a meaningful way.
Net effect: the 1024âdim model improves absolute retrieval quality and topologyâweighted ranking, but ArrowSpace + 384âdim already yields a very coherent, stable ranking where taumode and hybrid nearly match, and ArrowSpaceâs taumode remains the best of the three in both regimes. This is a strong confirmation of the strengths of Graph Wiring: ArrowSpace consistently improves on pure cosine, and that advantage is stable as you scale up embedding capacity.
Note on Fig.2: on average Test 16 performs better except for the average Tail/Head score.
3. Perâquery ranking behaviour
Looking at the detailed comparison metrics:
- In test 15 (384âdim), most queries have very high Spearman/Kendall between cosine and hybrid/taumode (often 0.99â1.0), but a few structured queries (e.g. command injection, XXE, integer overflow) show low or even negative correlation between cosine and taumode, while hybridâtaumode stays high.
- NDCG_taumode_vs_hybrid in test 15 is â„ 0.95 for all queries and often â 1.0, confirming that taumode reorders only locally while staying almost identical to hybrid at topâk.
- MRRâTop0 in test 15 shows taumode â„ hybrid â„ cosine for nearly all queries, sometimes substantially (e.g. XXE, integer overflow have much larger topologyâweighted scores for spectral modes than for plain cosine).
For test 16 (1024âdim):
- Spearman/Kendall are mostly 1.0 across all pairs (cosineâhybrid, cosineâtaumode, hybridâtaumode), indicating that with a stronger embedding the three ranking modes become almost identical for many queries.
- The exceptions (e.g. âunsafe deserializationâ, âsensitive information disclosureâ, âdenial of serviceâ) are precisely the complex structural queries where taumode diverges somewhat from cosine/hybrid and tends to achieve higher MRRâTop0 and better tail statistics.
- MRRâTop0 is again consistently highest for taumode, then hybrid, then cosine, though the absolute gap between hybrid and taumode is modest (on the order of 0.01 for many queries).
This supports the view that ArrowSpaceâs spectral layer adds topologyâaware smoothing and robustness that is especially helpful on tricky naturalâlanguage patterns, independent of base dimension. This happens thanks to the structural information carried by ArrowSpaceâs Graph Laplacian, as demonstrated in the paper [âEpiplexity And Graph Wiring: An Empirical Study for the design of a generic algorithmâ] (https://github.com/tuned-org-uk/graph-wiring-epiplexity/blob/main/paper/Epiplexity_A_measure_on_Graph_Wiring.pdf).
Note on Fig.3: the additional embeddings provide the additional information needed to make also query 14 to perform as the others with cosine (taumode still performs better also for this query in terms of head/tail topological quality).
Note on Fig. 4: Additional embeddings makes cosine to agree more with taumode, hinting that taumode for real embeds structural information that was not present in the 384-dim embeddings; by tripling the number of dimensions the graph information can be partially encoded in plain embedding space. Why add so many dimensions for somenthing that can be encoded in a sparse matrix?
4. Tail behaviour and stability
Tail metrics (head_mean, tail_mean, tail_to_head_ratio, tail_cv, tail_decay_rate) show:
- In test 15, all three tau methods have tail_to_head_ratio â 0.99 across queries, with tiny coefficients of variation (on the order of 0.001â0.006), meaning scores decay very slowly from head to tail. ArrowSpaceâs taumode has slightly higher tail_to_head_ratio and lower tail_cv, indicating the most stable and smooth tail.
- In test 16, tails drop more sharply (tail_to_head_ratio â 0.90â0.98) and tail_std / tail_cv are larger, especially for cosine. Hybrid reduces this variance, and taumode again yields the most regular tail with the highest tail_to_head_ratio and smallest relative variance.
So in both dimensions, ArrowSpace spectral indexing is improving tail coherence and score regularity â better calibrated similarity across the topâk list â which is exactly what you want for downstream reâranking, multiâhop retrieval, and multiâagent workflows.
5. Cost and efficiency implications
Given identical infrastructure and dataloader, increasing the embedding size from 384 to 1024 has these implications:
- Memory: vector storage grows linearly in dimension, so 1024âdim embeddings need â 2.7Ă more memory than 384âdim for the same number of CVEs. The ArrowSpace graph (kâNN edges, Laplacian, lambdas) sits on top of those vectors and thus also grows in computation cost with dimension via distance computations.
- Training time: the 384âdim domainâadapted autoencoder is cheaper to preâtrain and fineâtune than a 1024âdim model (fewer parameters and FLOPs per batch), so domain adaptation cycles and hyperâparameter sweeps are significantly faster at 384âdim; this translates into less GPU time and energy to explore variants.
- Inference time: encoding a new CVE with a 1024âdim model requires â 2.7Ă more floatingâpoint operations than 384âdim (all else equal), and every similarity evaluation in ArrowSpace (distance computations, Rayleigh quotients, spectral refinement) scales linearly with dimension, so query throughput is correspondingly lower at 1024âdim.
Because ArrowSpaceâs spectral graph effectively concentrates structural information (epiplexity) into its Î»Ï indices and topology factors, you recover much of the benefit of higherâdimensional representations while working in a lowerâdimensional ambient space. In other words: you can let ArrowSpace encode graph structure instead of pushing ever more complexity into the embedding model.
6. Is the 1024âdim gain worth it?
Putting the measurements and costs together:
- Absolute gains
- MRRâTop0 roughly doubles from â 0.022 (hybrid/taumode at 384âdim) to â 0.044â0.047 at 1024âdim, and cosine improves ~4Ă; this is a real but moderate improvement, not a qualitative phase change.
- NDCG_taumode_vs_cosine jumps from 0.61 to 0.91, meaning that raw cosine rankings are much cleaner and closer to the spectral one in the 1024âdim model.
- Relative to cost
- These gains come at â 2.7Ă higher vector dimensionality and corresponding increases in training, storage and inference cost.
- Meanwhile, ArrowSpace + 384âdim embeddings already yields very high NDCG_taumode_vs_hybrid (~ 0.98), smooth tails, and the same qualitative ordering of tau methods.
So for a production system where infrastructure and GPU budget matter, you can reasonably say:
- ArrowSpace spectral indexing on 384âdim embeddings achieves comparable structural and tail behaviour to a much larger 1024âdim embedding space, and it does so at significantly lower memory and compute cost.
- ArrowSpace using the 1024âdim model is better in absolute terms and may be justified for highâstakes CVE triage where every marginal improvement counts, but the marginal gain over â384âdim + ArrowSpaceâ is modest relative to the extra training and inference burden.
- Across both settings, ArrowSpace is always better than pure cosine on these highly semantic CVE vectors, and that advantage is stable as we scale dimension.
In more straightforward terms:
- Is the increase in quality worth the extra training/inference work of 1024âdims?
- For research or premium security workflows: probably yes, because MRRâTop0 and NDCG vs cosine clearly improve, and the model is more robust on hard queries â ArrowSpace plus a strong embedding is the best of both worlds.
- For costâsensitive or largeâscale deployments: often no; 384âdim + ArrowSpace provides strong quality at about oneâthird the dimensionality and lower energy use, and still consistently beats cosine.
- Does ArrowSpace + Laplacian save time and energy by achieving similar results with 1/3 of the dimensions?
- Yes: by leveraging graph Laplacians and Î»Ï spectral indexing, ArrowSpace makes the 384âdim model behave structurally much closer to the 1024âdim space, while reducing memory, FLOPs, and training/inference time proportionally to the lower dimension.
- Can we claim that ArrowSpace spectral indexing enables comparable results with fewer dimensions, less memory and less compute?
- Within this CVE experiment, yes: the metrics show that ArrowSpaceâs spectral modes (hybrid/taumode) on 384âdim embeddings deliver high NDCG alignment, strong tail stability, and good topologyâweighted MRR, narrowing much of the gap to the 1024âdim model while using substantially fewer resources â and always outperforming pure cosine in structural terms.
7. Computing costs
Using ArrowSpace with 384âdim embeddings instead of 1024âdim gives roughly a 2.5â3Ă advantage across compute, storage, and throughput for a typical mediumâscale workload, while preserving most of the retrieval quality thanks to the spectral index. Here some back-of-the-envelop computations about scaling costs.
Storage
- Embedding storage scales linearly with dimension: moving from 1024 to 384 reduces perâitem vector size by a factor of $1024/384 \approx 2.67$.
- For a medium company with, say, 5â20 million documents in a vector store, that directly cuts embedding storage and any Parquet/Arrow snapshots by about 60%.
Compute (training and indexing)
- Training and fineâtuning FLOPs for the encoder scale roughly linearly with the embedding size in the projection layers; a 384âdim autoencoder therefore needs about 2.5â3Ă less compute and energy than a 1024âdim model per training step and per epoch.
- ArrowSpaceâs build pipeline (clustering, Laplacian, Î»Ï computation) does all distance and Rayleighâenergy work in the ambient feature dimension, so reducing from 1024 to 384 again saves about 2.7Ă on those stages for the same dataset.
- Î»Ï is computed once and stored as a scalar per row; this cost is amortised and does not grow at query time, making the lowerâdimensional build particularly attractive for periodic reâindexing on mediumâsized corpora.
Query throughput (online search)
- ArrowSpace search blends cosine with Î»Ï using precomputed λ values, so the perâquery hot path is dominated by vector arithmetic between the query and index rows; this cost scales linearly with dimension.
- Moving from 1024 â 384 dims therefore yields â 2.7Ă more queries per second on the same CPU/GPU budget, which for a medium company (tens to hundreds of QPS peak) often means you can:
- Run on smaller instances, or
- Support higher traffic and more tenants on the same hardware.
Net advantage on a âmedium companyâ workload
For a realistic midâsize deployment (millions of documents, tens of QPS, periodic reâindexing), remembering that adding a document to ArrowSpace does not require recomputing the whole index but just computing a lambdaâscore on the new vector:
- Storage: ~60% reduction for embeddings and ArrowSpace snapshots when using 384âdim vs 1024âdim.
- Offline compute (training + index build): â 2.5â3Ă less FLOPs and energy to train/fineâtune the encoder and to run ArrowSpace clustering/Laplacian/λÏ.
- Online throughput: â 2.5â3Ă more queries per second or equivalent latency reduction at the same hardware level.
Because ArrowSpaceâs Î»Ï index recovers much of the structural signal that a larger embedding would otherwise encode directly, these savings come with only a modest quality loss relative to using 1024âdim embeddings without such spectral supervision â and ArrowSpace still maintains its advantage over plain cosine in both settings.
Storage example (100M docs)
Assume float32 embeddings (4 bytes per dimension), no extra overheads:
- 384âdim: Storage â 143 GB of raw vectors.
- 1024âdim: Storage â 381 GB of raw vectors.
So moving from 1024 â 384 dims saves â 238 GB of embedding storage on 100M documents, before indexing overhead and replicas.
Throughput and latency impact
Because most ArrowSpace hotâpath operations (cosine, λâaware scoring) scale linearly with dimension, 384âdim vs 1024âdim gives:
- â 2.7Ă fewer FLOPs per query for similarity computations and λâaware blending.
- In practice, this translates to roughly:
- 2â3Ă higher QPS at the same latency, or
- 2â3Ă lower latency at the same QPS, on the same hardware.
For â50 agent instances for 200 users per dayâ:
- Suppose each user triggers 1,000 retrieval calls/day via agents (RAG loops, tools, etc.): about 200,000 queries/day, or â 2.3 QPS on average, with peaks easily 10â20Ă higher (tens of QPS).
- With 1024âdim, you might need larger/extra search nodes to keep tail latencies acceptable at peak; with 384âdim + ArrowSpace, the same workload can generally fit into ~1/3 of the compute budget (fewer or smaller nodes) for similar latency targets.
Cost and scaling intuition
Using 1024âdim did make cosine similarity better (but still not as good as spectral indexing in terms of contextâawareness in the retrieved document space for all ranks) â but at what cost? Can ArrowSpaceâs Laplacian be, in the majority of cases, a cheaper compressed proxy instead of tripling the number of dimensions? Spectral indexing is designed to handle higherâdimensional spaces and does not lose retrieval quality when dimensions grow, but your infrastructure may find it harder to handle the more complex training and heavier inference phase at model level.
At 100M documents:
- Storage: ~240 GB less embedding data to store, replicate, back up, and move across the network.
- Compute: ~2.7Ă reduction in FLOPs for training, index build, and perâquery similarity; this directly reduces GPU/CPU hours and energy.
- Throughput: the same ArrowSpace cluster can handle ~2â3Ă more agent calls or serve the same agents with lower p95 latency.
Throughput is the most problematic expense: storage and raw compute will probably keep getting cheaper, but the cost of serving ever higher dimensions at inference time is a rising concern. That is why the compression of structural information that ArrowSpace provides becomes critical for efficiency and cost savings. If the target scale is âwebâscale retrievalâ, having similar or better structural performance using a fraction of the dimensions is a real comparative advantage â especially when ArrowSpace keeps outperforming cosine regardless of how big your embedding model becomes (this should be confirmed on the 4B parameters model but at that point costs moves really on a different order of magnitude).
Conclusions
Just for reference: for an average developer, fineâtuning the embeddings for Test 15 (384âdim) was done on a local laptop with a few hours of compute; Test 16 embeddings (the version used is the 0.6B parameters) required 20 GB of RAM for approximately the same amount of time, using a dedicated Colab environment with an A100 machine.
Increasing embeddings dimensions indeed improved the embeddings by providing part of the additional information that was missing from the 384-dim embeddings but at a given cost; so the team behind these embeddings indeed provided value for the cosine pipeline.
ArrowSpaceâs Taumode still performs better on average in carrying topological information that is structural to the feature-space of the vector collection.
These numbers may improve even more for the 4B parameters version but the question is if the costs of scaling up so much are worth considering that better results are achieved on the 0.6B version by complementing the search with a few Kb artifact (Graph Laplacian) as ArrowSpace does. This is demonstrated formally in Information theoretical terms in the paper âEpiplexity And Graph Wiring: An Empirical Study for the design of a generic algorithmâ.
Thanks for reading, please share and consider sponsoring my research.