Why `arrowspace` is game-changing for data operations at scale

TLDR; The possibility of having a vector database, graph-search capabilities and a key-value store in the same harness can make the difference for data practioners that need to face the challenge of datasets operations at AI-scale. Undecurrent: designing the data system for a planetary reserach vessel.

arrowspace v0.22.0 is out with improvements and a new graph motives API.

You can find arrowspace in the:

Rust repository ↪️ cargo add arrowspace
and Python repository ↪️ pip install arrowspace

Intro

arrowspace is a database for vectors supported by a graph representation and a key-value store. The main use-cases targeted are: AI search capabilities as advanced vector similarity, graph characterisation analysis and search, indexing of high-dimensional vectors.

If you want to know the entire background to this story, please peek at the devlog.

What ArrowSpace Is

arrowspace treats any dataset as a vector space with an added graph layered over vectors, enabling both geometric and topological operations in one index for search, matching, ranking, and dataset characterization at any lifecycle stage. Two stabilised build paths—eigenmaps and pure energy maps—now provide spectral indices are provided that respect manifold structure instead of just geometric projection, with compact runtime footprints and query-time work limited to light scalar terms plus small projected features.

arrowspace taps from vectors, graphs, and key‑value stores paradigms by centering indexing on graph Laplacians constructed from data similarity graphs, aligning with standard spectral methods used to preserve manifold structure for search, matching, ranking, and characterization. This approach follows well‑documented pipelines in which a neighborhood graph over items is built from feature similarity, and a Laplacian operator provides the spectral coordinates and energies that capture both global topology and local geometry.

Paradigm change 1. Any dataset is a graph

Vectors induce edges from item-space geometry while features induce topology; from these, arrowspace constructs a representative GraphLaplacian that supports reading, writing without index updates, matching, ranking, and cross-snapshot comparison of topology (global) and geometry (local) efficiently for evolving datasets.

A dataset can be cast as a similarity graph (ε‑graph, k‑NN, fully connected with kernels), enabling construction of a representative graph Laplacian whose eigenvectors and eigenvalues encode cluster structure and geometric relations for downstream tasks.¹² Spectral views like Laplacian Eigenmaps and clustering leverage this Laplacian to embed data in coordinates faithful to the manifold while supporting comparisons across snapshots through stable spectral features.³² This Laplacian-centric view powers eigenmaps and energy maps so the index encodes structure directly, reducing dependence on cosine to tiebreaks when desired and enabling robust comparisons across datasets and time.

Any vector space is automatically also a graph, for streamlining operations required by MLops and AIops teams in data-driven companies. The geometric relations are defined from the items-space; the topological relations are defined from features-space. This is enough to generate a representative GraphLaplacian for any dataset, at any time of the lifecycle, in the most efficient way given the capabilities’ constraints. The feature-space so generated by arrowspace works as a characterisation of the dataset in the sense that can be used to read (relatively fast for a prototype that includes topological structure), write (without index updating), match, rank, compare the dataset with any other snapshot of the dataset or other comparable datasets in terms of topology at wider scale and geometry at smaller scale.

Paradigm change 2. High dimensions are the norm

Manifold methods such as Laplacian Eigenmaps and diffusion maps remain effective in high dimensions by exploiting locality and the connection of the graph Laplacian to the Laplace–Beltrami operator and heat equation, concentrating on geometry of the graph rather than raw ambient dimension.⁴³ When needed, dimensionality can be reduced with probabilistic guarantees using the Johnson–Lindenstrauss lemma, preserving pairwise distances within \(1 \pm \epsilon\) to keep search and analysis reliable at scale.⁵⁶

Leveraging this approach on the features graph plus the other elements of energy dispersion and taumode make arrowspace well suited for datasets in which dimensionality is the main concern; allowing meaningful analysis and search even for datasets in which dimensions are overwhelmingly more than items presents in the dataset (think bag words with an 100k vocabulary but 10 millions of dimensions or a genome sequence with 20k genes but millions of potential relations among these genes or examples of prompts on a 70 millions weights model). This paradigm can be scaled adaptively to lower-higher dimensions to deliver good results on any vector space used in research and commercial appllications. This is inspired by full-disk and partial spectrography; for example analyzing Jupiter’s entire disk produces aggregate albedo and phase-angle distributions, while partial observations of atmospheric regions yield localized composition signatures.Different spatial scales reveal different spectral characteristics, even within the same planetary system.

Practical consequences

Algorithmic consequences:

Sparse similarity graphs and a single sparse eigenproblem yield compact spectral artifacts and efficient pipelines, making it feasible to manage many dataset snapshots without heavy index churn.⁷²
Spectral embeddings let teams project items or segments into low‑dimensional coordinates for fast ranking, clustering, and cross‑dataset comparison grounded in graph geometry.⁴³
Graph‑induced embeddings and standard vector embeddings unify naturally by building the graph from vector similarities and then working in the same embedded space for retrieval and analysis.³²
A unified interface emerges because geometric proximity defines edges while topological structure is captured by the Laplacian spectrum, linking similarity, connectivity, and semantics in one operator.²³

Consequences on a practical level for data practioners that are made possible by the arrowspace approach:

managing data archives with >10k different datasets/snapshots with less infrastructural over-head via a local machine holding the space of the datasets while the datasets lives in their own enclaves
do embeddings on anything, also on datasets themselves; turn any dataset into a single representative vector with F dimensions reliably
make graph embeddings and non-graph embeddings to work together in the same space
have a unified interface for the geometric and topological/semantic layers

Who benefits

Data science gains robust manifold‑aware embeddings and clusters that often outperform purely geometric clustering, with efficient implementations via standard linear algebra on sparse Laplacians.⁷²
Explainable and interpretable AI can attribute behavior to spectral coordinates and energies: the Rayleigh quotient of the Laplacian equals the normalized Dirichlet energy, making smoothness/roughness explicit and auditable.⁸⁹
MLOps and AIOps benefit from clear, reproducible choices of graph type, scaling, and normalization that materially affect outcomes—elements documented in the spectral clustering literature as key to stable results.²
Research using diffusion or wave‑like operators on graphs can leverage heat‑equation and wave‑equation formalisms on graph domains, enabling experiment designs that probe transport and interface effects.¹⁰⁴

Briefly, these give unprecedented capabilities to people working in:

data science: any vector space discovery is strengthed with spectral graph methods such as Laplacian Eigenmaps and the spectral clustering toolkit that translate similarity graphs into low‑dimensional, structure‑preserving coordinates for retrieval, clustering, and curation at scale.
explainable and interpretable AI: any vector space is characterised for its properties allowing its connection to outcomes happening at inference time. Allowing data drifting control also on item-by-item addition bases thanks to taumode.
MLops and AIops: Reproducible and auditable pipelines are supported by well‑documented sensitivities in spectral methods—graph type, scaling, normalization, and neighborhood size—all of which materially affect results and must be versioned alongside artifacts.
edge research in in-memory processing and other fields that require simulating systems using dispersion/diffusion models (explain the difference between dispersion and diffusion, briefly explain how arrowspace uses dispersion, mention the possibility of simulating metamaterials with an extension of arrowspace to use complex values)

Dispersion and diffusion

Definition

Diffusion on graphs corresponds to applying heat‑like dynamics generated by the Laplacian, smoothing signals by iteratively reducing local differences along edges in a manner captured by diffusion maps and related Markov processes.⁴ By contrast, dispersion (in the physical sense) denotes apparent spreading due to flow‑induced shear combined with molecular diffusion, as in Taylor–Aris dispersion, which enhances axial spreading even when molecular diffusivity is fixed.¹¹¹²

How `arrowspace` uses diffusion-energy

In graph learning practice, diffusion is used to propagate local geometric information and reveal multiscale cluster structure, while energy terms such as Dirichlet energy and the Rayleigh quotient quantify smoothness or roughness that can be used to regularize rankings and embeddings.⁸⁴ These spectral tools provide interpretable knobs that connect proximity, connectivity, and smoothness, supporting audit‑friendly retrieval and characterization beyond cosine‑only scoring.⁸²

Future developments

Algorithmic:

Coherent, explainable diffusion‑based graph generation is a fast‑moving area, with discrete denoising diffusion models such as DiGress showing strong performance on molecular and non‑molecular graphs.¹³¹⁴¹⁵
Embedding causal relations into graph generation can build on differentiable causal discovery for DAG structure learning, where acyclicity constraints are enforced in continuous optimization (e.g., NOTEARS and its successors).¹⁶¹⁷¹⁸
Reinforcement learning can guide graph construction toward task rewards, as shown by Graph Convolutional Policy Networks that assemble molecular graphs under chemistry constraints while optimizing objectives.¹⁹²⁰²¹
Complex‑valued or wave‑equation operators on graphs enable modeling interface phenomena, since wave dynamics have established graph formulations via edge‑based Laplacians and related operators.¹⁰

Potential future capabilities

weighted features: provide masks to enable prioritising some features in the features space; for example prioritising position 0 and 1 as they hold longitude and latitude of a spatial feature vector — realizable via learned Mahalanobis metrics and local metric learning, or by coupling masks with anisotropic diffusion kernels to bias smoothing along salient coordinates.
with enough accumulated data it will be probably possible to run coherent diffusion generation of graphs that will be explainable and controllable according to the source graphs — building on discrete denoising diffusion for graphs (e.g., DiGress) with spectral or topology-guided conditioning for controllability.
Consider embedding causal relations in graph generation — inject acyclicity and intervention priors from differentiable causal discovery (e.g., NOTEARS and smooth acyclicity) to bias edge orientation and generative rollouts.
use Reinforcement Learning for generating the graph as it is possible to learn how energy disperses along edges for class of datasets — leverage goal‑directed graph RL (e.g., GCPN) to learn edge additions guided by energy or spectral rewards aligned with downstream tasks.
have a arrowspace-complex to work on complex numbers to allow Dirac-like dispersion for interfaces and boundaries simulation in novel materials — employ magnetic/complex Laplacians and graph wave dynamics to encode directionality and boundary conditions with phase information.
anisotropic diffusion operators for direction‑ and feature‑aware smoothing — alternate closed‑form diffusion with local directional filters so that propagation respects flow, orientation, or masked features on the manifold.
magnetic Laplacian modes for asymmetric or temporal graphs — use complex‑phase Laplacians to capture edge orientation and cyclic flows, improving embeddings and community detection in directed networks.
topology‑aware feature masking via metric learning — learn global or local Mahalanobis metrics to turn masks into data‑driven edge weights that better reflect per‑domain importance before Laplacian construction.
spectral positional encodings for sequences and spatiotemporal data — incorporate magnetic‑Laplacian‑based positional phases to preserve order and direction in spectral coordinates for retrieval and generation.
causal‑aware diffusion priors — constrain diffusion steps or score fields with DAG structure to maintain identifiability of directions during graph synthesis and editing.

Notes on the name

arrowspace honors Kenneth Arrow’s inquisitive, discussion-driven research ethos; “space” signals an added analytical layer that augments vector spaces with manifold-aware search and characterization.

References

Ulrike von Luxburg, A Tutorial on Spectral Clustering (graph Laplacians, similarity graphs, algorithms, and practical choices).²
Belkin \(\&\) Niyogi, Laplacian Eigenmaps (graph Laplacian, Laplace–Beltrami connection, manifold embeddings).⁷³
Coifman \(\&\) Lafon, Diffusion Maps (Markov diffusion processes on graphs and multiscale geometry).⁴
Dirichlet energy and Rayleigh quotient on graphs (energy–smoothness link and spectral characterization).⁹⁸
Johnson–Lindenstrauss lemma (distance‑preserving dimension reduction).⁶⁵
Taylor–Aris dispersion vs diffusion (flow‑induced dispersion mechanisms).¹²¹¹
Wave equations and edge‑based Laplacians on graphs (graph‑domain wave dynamics).¹⁰
DiGress: discrete diffusion for graph generation (graph generative diffusion).¹⁴¹⁵¹³
GCPN: reinforcement learning for graph construction (goal‑directed molecular graph generation).²⁰²¹¹⁹
Differentiable causal discovery and DAG learning (acyclicity‑constrained structure learning).¹⁷¹⁸¹⁶

https://arxiv.org/abs/0711.0189 {#n01} ↩
https://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
https://dl.acm.org/doi/10.1162/089976603321780317 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
https://www.math.pku.edu.cn/teachers/yaoy/Fall2011/Lafon06.pdf ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
https://www.math.toronto.edu/undergrad/projects-undergrad/Project03.pdf ↩ ↩²
https://cs.stanford.edu/people/mmahoney/cs369m/Lectures/lecture1.pdf ↩ ↩²
https://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering ↩ ↩² ↩³
https://arxiv.org/pdf/2203.03221.pdf ↩ ↩² ↩³ ↩⁴
https://fanchung.ucsd.edu/research/cb/ch1.pdf ↩ ↩²
https://msp.org/pjm/2004/216-2/pjm-v216-n2-p03-p.pdf ↩ ↩² ↩³
https://en.wikipedia.org/wiki/Taylor_dispersion ↩ ↩²
https://www.azom.com/article.aspx?ArticleID=12173 ↩ ↩²
https://arxiv.org/abs/2209.14734 ↩ ↩²
https://arxiv.org/pdf/2209.14734.pdf ↩ ↩²
https://iclr.cc/media/iclr-2023/Slides/11556.pdf ↩ ↩²
https://openreview.net/pdf?id=IVwWgscehR ↩ ↩²
https://papers.neurips.cc/paper/8157-dags-with-no-tears-continuous-optimization-for-structure-learning.pdf ↩ ↩²
https://causaldm.github.io/Causal-Decision-Making/2_Causal_Structure_Learning/Causal%20Discovery.html ↩ ↩²
http://papers.neurips.cc/paper/7877-graph-convolutional-policy-network-for-goal-directed-molecular-graph-generation.pdf ↩ ↩²
https://dl.acm.org/doi/10.5555/3327345.3327537 ↩ ↩²
https://arxiv.org/abs/1806.02473 ↩ ↩²

Why arrowspace is game-changing for data operations at scale