Why arrowspace is game-changing for data operations at scale

TLDR; The possibility of having a vector database, graph-search capabilities and a key-value store in the same harness can make the difference for data practioners that need to face the challenge of datasets operations at AI-scale. Undecurrent: designing the data system for a planetary reserach vessel.

arrowspace v0.22.0 is out with improvements and a new graph motives API.

You can find arrowspace in the:

Intro

arrowspace is a database for vectors supported by a graph representation and a key-value store. The main use-cases targeted are: AI search capabilities as advanced vector similarity, graph characterisation analysis and search, indexing of high-dimensional vectors.

If you want to know the entire background to this story, please peek at the devlog.

What ArrowSpace Is

arrowspace treats any dataset as a vector space with an added graph layered over vectors, enabling both geometric and topological operations in one index for search, matching, ranking, and dataset characterization at any lifecycle stage. Two stabilised build paths—eigenmaps and pure energy maps—now provide spectral indices are provided that respect manifold structure instead of just geometric projection, with compact runtime footprints and query-time work limited to light scalar terms plus small projected features.

arrowspace taps from vectors, graphs, and key‑value stores paradigms by centering indexing on graph Laplacians constructed from data similarity graphs, aligning with standard spectral methods used to preserve manifold structure for search, matching, ranking, and characterization. This approach follows well‑documented pipelines in which a neighborhood graph over items is built from feature similarity, and a Laplacian operator provides the spectral coordinates and energies that capture both global topology and local geometry.

Paradigm change 1. Any dataset is a graph

Vectors induce edges from item-space geometry while features induce topology; from these, arrowspace constructs a representative GraphLaplacian that supports reading, writing without index updates, matching, ranking, and cross-snapshot comparison of topology (global) and geometry (local) efficiently for evolving datasets.

A dataset can be cast as a similarity graph (Δ‑graph, k‑NN, fully connected with kernels), enabling construction of a representative graph Laplacian whose eigenvectors and eigenvalues encode cluster structure and geometric relations for downstream tasks.12 Spectral views like Laplacian Eigenmaps and clustering leverage this Laplacian to embed data in coordinates faithful to the manifold while supporting comparisons across snapshots through stable spectral features.32 This Laplacian-centric view powers eigenmaps and energy maps so the index encodes structure directly, reducing dependence on cosine to tiebreaks when desired and enabling robust comparisons across datasets and time.

Any vector space is automatically also a graph, for streamlining operations required by MLops and AIops teams in data-driven companies. The geometric relations are defined from the items-space; the topological relations are defined from features-space. This is enough to generate a representative GraphLaplacian for any dataset, at any time of the lifecycle, in the most efficient way given the capabilities’ constraints. The feature-space so generated by arrowspace works as a characterisation of the dataset in the sense that can be used to read (relatively fast for a prototype that includes topological structure), write (without index updating), match, rank, compare the dataset with any other snapshot of the dataset or other comparable datasets in terms of topology at wider scale and geometry at smaller scale.

Paradigm change 2. High dimensions are the norm

Manifold methods such as Laplacian Eigenmaps and diffusion maps remain effective in high dimensions by exploiting locality and the connection of the graph Laplacian to the Laplace–Beltrami operator and heat equation, concentrating on geometry of the graph rather than raw ambient dimension.43 When needed, dimensionality can be reduced with probabilistic guarantees using the Johnson–Lindenstrauss lemma, preserving pairwise distances within \(1 \pm \epsilon\) to keep search and analysis reliable at scale.56

Leveraging this approach on the features graph plus the other elements of energy dispersion and taumode make arrowspace well suited for datasets in which dimensionality is the main concern; allowing meaningful analysis and search even for datasets in which dimensions are overwhelmingly more than items presents in the dataset (think bag words with an 100k vocabulary but 10 millions of dimensions or a genome sequence with 20k genes but millions of potential relations among these genes or examples of prompts on a 70 millions weights model). This paradigm can be scaled adaptively to lower-higher dimensions to deliver good results on any vector space used in research and commercial appllications. This is inspired by full-disk and partial spectrography; for example analyzing Jupiter’s entire disk produces aggregate albedo and phase-angle distributions, while partial observations of atmospheric regions yield localized composition signatures.Different spatial scales reveal different spectral characteristics, even within the same planetary system.

Practical consequences

Algorithmic consequences:

Consequences on a practical level for data practioners that are made possible by the arrowspace approach:

Who benefits

Briefly, these give unprecedented capabilities to people working in:

Dispersion and diffusion

Definition

Diffusion on graphs corresponds to applying heat‑like dynamics generated by the Laplacian, smoothing signals by iteratively reducing local differences along edges in a manner captured by diffusion maps and related Markov processes.4 By contrast, dispersion (in the physical sense) denotes apparent spreading due to flow‑induced shear combined with molecular diffusion, as in Taylor–Aris dispersion, which enhances axial spreading even when molecular diffusivity is fixed.1112

How arrowspace uses diffusion-energy

In graph learning practice, diffusion is used to propagate local geometric information and reveal multiscale cluster structure, while energy terms such as Dirichlet energy and the Rayleigh quotient quantify smoothness or roughness that can be used to regularize rankings and embeddings.84 These spectral tools provide interpretable knobs that connect proximity, connectivity, and smoothness, supporting audit‑friendly retrieval and characterization beyond cosine‑only scoring.82

Future developments

Algorithmic:

Potential future capabilities

Notes on the name

arrowspace honors Kenneth Arrow’s inquisitive, discussion-driven research ethos; “space” signals an added analytical layer that augments vector spaces with manifold-aware search and characterization.

References

  1. https://arxiv.org/abs/0711.0189 {#n01} 

  2. https://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf  2 3 4 5 6 7 8 9

  3. https://dl.acm.org/doi/10.1162/089976603321780317  2 3 4 5 6

  4. https://www.math.pku.edu.cn/teachers/yaoy/Fall2011/Lafon06.pdf  2 3 4 5 6

  5. https://www.math.toronto.edu/undergrad/projects-undergrad/Project03.pdf  2

  6. https://cs.stanford.edu/people/mmahoney/cs369m/Lectures/lecture1.pdf  2

  7. https://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering  2 3

  8. https://arxiv.org/pdf/2203.03221.pdf  2 3 4

  9. https://fanchung.ucsd.edu/research/cb/ch1.pdf  2

  10. https://msp.org/pjm/2004/216-2/pjm-v216-n2-p03-p.pdf  2 3

  11. https://en.wikipedia.org/wiki/Taylor_dispersion  2

  12. https://www.azom.com/article.aspx?ArticleID=12173  2

  13. https://arxiv.org/abs/2209.14734  2

  14. https://arxiv.org/pdf/2209.14734.pdf  2

  15. https://iclr.cc/media/iclr-2023/Slides/11556.pdf  2

  16. https://openreview.net/pdf?id=IVwWgscehR  2

  17. https://papers.neurips.cc/paper/8157-dags-with-no-tears-continuous-optimization-for-structure-learning.pdf  2

  18. https://causaldm.github.io/Causal-Decision-Making/2_Causal_Structure_Learning/Causal%20Discovery.html  2

  19. http://papers.neurips.cc/paper/7877-graph-convolutional-policy-network-for-goal-directed-molecular-graph-generation.pdf  2

  20. https://dl.acm.org/doi/10.5555/3327345.3327537  2

  21. https://arxiv.org/abs/1806.02473  2