arrowspace-rs: Faster (not approximate?) Nearest Neighbours

You can find arrowspace in the:

Rust repository ↪️ cargo add arrowspace
and Python repository ↪️ pip install arrowspace

This is an important release as it comes after the confirmation received on the previous post and focus on making those results replicable giving the computation a solid backup in numerical stability, the core of the computation is available in this file.

In particular I am now quite convinced that arrowspace is one of the fastest Approximate Nearest Neigbours algorithm out there and as far as the test goes also not approximate, as it returns the same results as cosine similarity. Obviously backing this from a single-researcher perspective will take much more time and effort.

According to the time-complexity computation and based on public knowledge of other algorithms:

"training" phase (index building) with complexity: O(n * d * k * log(n)). The latest comparative benchmark achieves index building on 310,000 records dataset at ~1600s (single machine, 12 cores). This involves graph building and linear score computation;
"inference" phase (index lookup for similar vectors): O(n * log(n * k)), this is the usual complexity for search on a sorted set.

This was possible because of the improvements described in the previous post and because the hardening of tests that made evident some limitations in the sampling strategy, now improved and better tested, and the clustering algorithm, now made more consistent (see below). In particular I would like to mention: pre-filtered clustering to top-k neighbours with a new smartcore’s CosineSimilarity structure based on my implementation of FastPair, an approach to clustering using inline-sampling, a procedure that allows spotting optimal k-number of clusters using a mix of statistical approximations (K-means-Lloyd, Calinski-Harabasz score for Between-Group Sum of Squares and Within-Group Sum of Squares).

As you see I am starting to think the index in terms of Machine Learning principles as few far-fetched ideas sprung from this latest effort; in particular how to do graph learning on the feature graph currently used and how to use taumode in a potential cost function or something on the same wave.

For now let’s focus on this version:

What’s new in v0.16.0

Deterministic clustering

With a controllable seed enables stable runs and reproducible search behavior across builds, removing prior sources of variability in graph construction.

A configurable sampling

Sampling strategy was added and SimpleRandomSampler is now the default beside the alternative AdaptiveSampler; both simplifies scaling larger corpora without sacrificing recall in the tail, with AdaptiveSampler making clustering a little less deterministic in exchange for some time savings.

Parallel clustering computation

It was a challenge to make the row-by-row computation working with deterministic sequential for-loop and with a parallel for-loop at the same time. Version 0.15.0 was somehow flaky in terms of consistency of clustering results, code available here. Since the beginning I had this idea of doing clustering and sampling in the same loop, I don’t know if it is a common thing, it helps in daving a for-loop on potentially very large datasets.

To ensure replicable builds it is not possible to use a parallel iterator in clustering for the obvious reason that we will get different clustering outcome. It was improved and made seed‑aware for both clustering and spectral search paths, aligning throughput with determinism goals. This is particularly useful in the explorative phase of a graph and in tests, when the parameters are not being fully identified correctly.

Furthermore, tests were strengthened and flakiness removed to lock in numeric stability as part of CI. Still one or two tests fail for a particular architecture but the problems have been mitigated

Run context

The same as the previous article. The CVE experiment reused the same three analyst‑style queries as before and computed cosine (τ=1.0), hybrid (τ=0.8), and taumode (τ=0.62) over a shared candidate set, allowing like‑for‑like NDCG and tail statistics, see previous post. Per‑query rank agreement in v0.16.0 is perfect across Spearman, Kendall, and NDCG@10, indicating the spectral modes re‑weight the tail without disturbing the head ordering for the tested queries.

Headline results

Average NDCG@10 is 1.0000 for both “Hybrid vs Cosine” and “Taumode vs Cosine,” reflecting identical top‑k order for these queries under v0.16.0. Average Tail/Head ratios improve as τ drops below 1.0, with taumode showing the strongest long‑tail quality and the lowest tail variability.

Version comparison

The table below places the previously reported v0.15.0 snapshot next to v0.16.0 on the same evaluation axes used in earlier posts.

Metric	v0.15.0	v0.16.0
Avg NDCG@10 (Hybrid vs Cosine)	0.9988	1.0000
Avg NDCG@10 (taumode vs Cosine)	0.9886	1.0000
Avg Tail/Head ratio (Cosine)	0.9114 ± 0.0463	0.9718 ± 0.0019
Avg Tail/Head ratio (Hybrid)	0.9394 ± 0.0340	0.9787 ± 0.0012
Avg Tail/Head ratio (taumode)	0.9593 ± 0.0259	0.9843 ± 0.0008

By diagrams:

CVE Top-10 comparative analysis for the three test queries
Colors: green = taumode orange = hybrid blue = cosine
Version 0.15	Version 0.16

CVE Head/Tail analysis
Version 0.15	Version 0.16

Results

taumode is ready to provide additional features and to step up tests to a cloud environment-scale. I may decide to make the CVE dataset searching capability publicly available thorugh an API. If you are interested in such feature, please consider sponsoring my Github. I am self-funded independent researcher.

Interpreting the deltas

Moving from v0.15.0 to v0.16.0, NDCG@10 rises to 1.0000 across both spectral modes, confirming head‑alignment while still boosting tail strength. Tail/Head ratios tighten substantially and tail coefficients of variation drop for τ<1.0, quantifying the stability and long‑rank consistency improvements attributable to the improvements in clustering algorithm.

Per‑query tail quality

Across the three queries, the tail mean rises and dispersion falls as τ decreases, with taumode achieving the highest tail‑to‑head balance and the lowest CV in each case. This behavior matches the intended “temperature‑like” control in the spectral family, trading a strictly geometric score for one that encodes graph topology while keeping the head intact.

Practical impact

Determinism plus stronger long‑tail stability makes results easier to audit and compare across runs, an important property for operational retrieval and analyst workflows. Combined with the new default sampler, v0.16.0 shortens iteration time on larger corpora while preserving the spectral signal that lifts tail relevance.

Artifacts

Top‑10 comparison and tail analysis figures for v0.16.0 were regenerated for the three queries to mirror the earlier post’s layout and scale.
CSV artifacts include per‑query rank correlations and NDCG, tail metrics, and a compact summary for roll‑up reporting.

Next steps

Build a storage layer for parquet files using Apache Arrow
explore subgaphs and subvector spaces
Run algebraic analysis on the interval of Reals generated by the characteristic Taumode of each dataset
Add all the harness required by cloud installations

Interested in learning more? Whether you’re evaluating ArrowSpace for your data infrastructure, considering sponsorship, or want to discuss integration strategies, please check the Contact page.

Please consider sponsoring my research and improve your company’s understanding of LLMs and vector databases.

Book a call on Calendly to discuss how ArrowSpace can accelerate discovery in your analysis and storage workflows, or discussing these improvements.