Why arrowspace is game-changing for data operations at scale
TLDR; The possibility of having a vector database, graph-search capabilities and a key-value store in the same harness can make the difference for data practioners that need to face the challenge of datasets operations at AI-scale. Undecurrent: designing the data system for a planetary reserach vessel.
arrowspace v0.22.0 is out with improvements and a new graph motives API.
You can find arrowspace in the:
- Rust repository âȘïž
cargo add arrowspace - and Python repository âȘïž
pip install arrowspace
Intro
arrowspace is a database for vectors supported by a graph representation and a key-value store. The main use-cases targeted are: AI search capabilities as advanced vector similarity, graph characterisation analysis and search, indexing of high-dimensional vectors.
If you want to know the entire background to this story, please peek at the devlog.
What ArrowSpace Is
arrowspace treats any dataset as a vector space with an added graph layered over vectors, enabling both geometric and topological operations in one index for search, matching, ranking, and dataset characterization at any lifecycle stage.
Two stabilised build pathsâeigenmaps and pure energy mapsânow provide spectral indices are provided that respect manifold structure instead of just geometric projection, with compact runtime footprints and query-time work limited to light scalar terms plus small projected features.
arrowspace taps from vectors, graphs, and keyâvalue stores paradigms by centering indexing on graph Laplacians constructed from data similarity graphs, aligning with standard spectral methods used to preserve manifold structure for search, matching, ranking, and characterization. This approach follows wellâdocumented pipelines in which a neighborhood graph over items is built from feature similarity, and a Laplacian operator provides the spectral coordinates and energies that capture both global topology and local geometry.
Paradigm change 1. Any dataset is a graph
Vectors induce edges from item-space geometry while features induce topology; from these, arrowspace constructs a representative GraphLaplacian that supports reading, writing without index updates, matching, ranking, and cross-snapshot comparison of topology (global) and geometry (local) efficiently for evolving datasets.
A dataset can be cast as a similarity graph (Δâgraph, kâNN, fully connected with kernels), enabling construction of a representative graph Laplacian whose eigenvectors and eigenvalues encode cluster structure and geometric relations for downstream tasks.12 Spectral views like Laplacian Eigenmaps and clustering leverage this Laplacian to embed data in coordinates faithful to the manifold while supporting comparisons across snapshots through stable spectral features.32 This Laplacian-centric view powers eigenmaps and energy maps so the index encodes structure directly, reducing dependence on cosine to tiebreaks when desired and enabling robust comparisons across datasets and time.
Any vector space is automatically also a graph, for streamlining operations required by MLops and AIops teams in data-driven companies. The geometric relations are defined from the items-space; the topological relations are defined from features-space. This is enough to generate a representative GraphLaplacian for any dataset, at any time of the lifecycle, in the most efficient way given the capabilitiesâ constraints. The feature-space so generated by arrowspace works as a characterisation of the dataset in the sense that can be used to read (relatively fast for a prototype that includes topological structure), write (without index updating), match, rank, compare the dataset with any other snapshot of the dataset or other comparable datasets in terms of topology at wider scale and geometry at smaller scale.
Paradigm change 2. High dimensions are the norm
Manifold methods such as Laplacian Eigenmaps and diffusion maps remain effective in high dimensions by exploiting locality and the connection of the graph Laplacian to the LaplaceâBeltrami operator and heat equation, concentrating on geometry of the graph rather than raw ambient dimension.43 When needed, dimensionality can be reduced with probabilistic guarantees using the JohnsonâLindenstrauss lemma, preserving pairwise distances within \(1 \pm \epsilon\) to keep search and analysis reliable at scale.56
Leveraging this approach on the features graph plus the other elements of energy dispersion and taumode make arrowspace well suited for datasets in which dimensionality is the main concern; allowing meaningful analysis and search even for datasets in which dimensions are overwhelmingly more than items presents in the dataset (think bag words with an 100k vocabulary but 10 millions of dimensions or a genome sequence with 20k genes but millions of potential relations among these genes or examples of prompts on a 70 millions weights model). This paradigm can be scaled adaptively to lower-higher dimensions to deliver good results on any vector space used in research and commercial appllications. This is inspired by full-disk and partial spectrography; for example analyzing Jupiterâs entire disk produces aggregate albedo and phase-angle distributions, while partial observations of atmospheric regions yield localized composition signatures.Different spatial scales reveal different spectral characteristics, even within the same planetary system.
Practical consequences
Algorithmic consequences:
- Sparse similarity graphs and a single sparse eigenproblem yield compact spectral artifacts and efficient pipelines, making it feasible to manage many dataset snapshots without heavy index churn.72
- Spectral embeddings let teams project items or segments into lowâdimensional coordinates for fast ranking, clustering, and crossâdataset comparison grounded in graph geometry.43
- Graphâinduced embeddings and standard vector embeddings unify naturally by building the graph from vector similarities and then working in the same embedded space for retrieval and analysis.32
- A unified interface emerges because geometric proximity defines edges while topological structure is captured by the Laplacian spectrum, linking similarity, connectivity, and semantics in one operator.23
Consequences on a practical level for data practioners that are made possible by the arrowspace approach:
- managing data archives with >10k different datasets/snapshots with less infrastructural over-head via a local machine holding the space of the datasets while the datasets lives in their own enclaves
- do embeddings on anything, also on datasets themselves; turn any dataset into a single representative vector with F dimensions reliably
- make graph embeddings and non-graph embeddings to work together in the same space
- have a unified interface for the geometric and topological/semantic layers
Who benefits
- Data science gains robust manifoldâaware embeddings and clusters that often outperform purely geometric clustering, with efficient implementations via standard linear algebra on sparse Laplacians.72
- Explainable and interpretable AI can attribute behavior to spectral coordinates and energies: the Rayleigh quotient of the Laplacian equals the normalized Dirichlet energy, making smoothness/roughness explicit and auditable.89
- MLOps and AIOps benefit from clear, reproducible choices of graph type, scaling, and normalization that materially affect outcomesâelements documented in the spectral clustering literature as key to stable results.2
- Research using diffusion or waveâlike operators on graphs can leverage heatâequation and waveâequation formalisms on graph domains, enabling experiment designs that probe transport and interface effects.104
Briefly, these give unprecedented capabilities to people working in:
- data science: any vector space discovery is strengthed with spectral graph methods such as Laplacian Eigenmaps and the spectral clustering toolkit that translate similarity graphs into lowâdimensional, structureâpreserving coordinates for retrieval, clustering, and curation at scale.
- explainable and interpretable AI: any vector space is characterised for its properties allowing its connection to outcomes happening at inference time. Allowing data drifting control also on item-by-item addition bases thanks to taumode.
- MLops and AIops: Reproducible and auditable pipelines are supported by wellâdocumented sensitivities in spectral methodsâgraph type, scaling, normalization, and neighborhood sizeâall of which materially affect results and must be versioned alongside artifacts.
- edge research in in-memory processing and other fields that require simulating systems using dispersion/diffusion models (explain the difference between dispersion and diffusion, briefly explain how arrowspace uses dispersion, mention the possibility of simulating metamaterials with an extension of arrowspace to use complex values)
Dispersion and diffusion
Definition
Diffusion on graphs corresponds to applying heatâlike dynamics generated by the Laplacian, smoothing signals by iteratively reducing local differences along edges in a manner captured by diffusion maps and related Markov processes.4 By contrast, dispersion (in the physical sense) denotes apparent spreading due to flowâinduced shear combined with molecular diffusion, as in TaylorâAris dispersion, which enhances axial spreading even when molecular diffusivity is fixed.1112
How arrowspace uses diffusion-energy
In graph learning practice, diffusion is used to propagate local geometric information and reveal multiscale cluster structure, while energy terms such as Dirichlet energy and the Rayleigh quotient quantify smoothness or roughness that can be used to regularize rankings and embeddings.84 These spectral tools provide interpretable knobs that connect proximity, connectivity, and smoothness, supporting auditâfriendly retrieval and characterization beyond cosineâonly scoring.82
Future developments
Algorithmic:
- Coherent, explainable diffusionâbased graph generation is a fastâmoving area, with discrete denoising diffusion models such as DiGress showing strong performance on molecular and nonâmolecular graphs.131415
- Embedding causal relations into graph generation can build on differentiable causal discovery for DAG structure learning, where acyclicity constraints are enforced in continuous optimization (e.g., NOTEARS and its successors).161718
- Reinforcement learning can guide graph construction toward task rewards, as shown by Graph Convolutional Policy Networks that assemble molecular graphs under chemistry constraints while optimizing objectives.192021
- Complexâvalued or waveâequation operators on graphs enable modeling interface phenomena, since wave dynamics have established graph formulations via edgeâbased Laplacians and related operators.10
Potential future capabilities
- weighted features: provide masks to enable prioritising some features in the features space; for example prioritising position 0 and 1 as they hold longitude and latitude of a spatial feature vector â realizable via learned Mahalanobis metrics and local metric learning, or by coupling masks with anisotropic diffusion kernels to bias smoothing along salient coordinates.
- with enough accumulated data it will be probably possible to run coherent diffusion generation of graphs that will be explainable and controllable according to the source graphs â building on discrete denoising diffusion for graphs (e.g., DiGress) with spectral or topology-guided conditioning for controllability.
- Consider embedding causal relations in graph generation â inject acyclicity and intervention priors from differentiable causal discovery (e.g., NOTEARS and smooth acyclicity) to bias edge orientation and generative rollouts.
- use Reinforcement Learning for generating the graph as it is possible to learn how energy disperses along edges for class of datasets â leverage goalâdirected graph RL (e.g., GCPN) to learn edge additions guided by energy or spectral rewards aligned with downstream tasks.
- have a arrowspace-complex to work on complex numbers to allow Dirac-like dispersion for interfaces and boundaries simulation in novel materials â employ magnetic/complex Laplacians and graph wave dynamics to encode directionality and boundary conditions with phase information.
- anisotropic diffusion operators for directionâ and featureâaware smoothing â alternate closedâform diffusion with local directional filters so that propagation respects flow, orientation, or masked features on the manifold.
- magnetic Laplacian modes for asymmetric or temporal graphs â use complexâphase Laplacians to capture edge orientation and cyclic flows, improving embeddings and community detection in directed networks.
- topologyâaware feature masking via metric learning â learn global or local Mahalanobis metrics to turn masks into dataâdriven edge weights that better reflect perâdomain importance before Laplacian construction.
- spectral positional encodings for sequences and spatiotemporal data â incorporate magneticâLaplacianâbased positional phases to preserve order and direction in spectral coordinates for retrieval and generation.
- causalâaware diffusion priors â constrain diffusion steps or score fields with DAG structure to maintain identifiability of directions during graph synthesis and editing.
Notes on the name
arrowspace honors Kenneth Arrowâs inquisitive, discussion-driven research ethos; âspaceâ signals an added analytical layer that augments vector spaces with manifold-aware search and characterization.
References
- Ulrike von Luxburg, A Tutorial on Spectral Clustering (graph Laplacians, similarity graphs, algorithms, and practical choices).2
- Belkin \(\&\) Niyogi, Laplacian Eigenmaps (graph Laplacian, LaplaceâBeltrami connection, manifold embeddings).73
- Coifman \(\&\) Lafon, Diffusion Maps (Markov diffusion processes on graphs and multiscale geometry).4
- Dirichlet energy and Rayleigh quotient on graphs (energyâsmoothness link and spectral characterization).98
- JohnsonâLindenstrauss lemma (distanceâpreserving dimension reduction).65
- TaylorâAris dispersion vs diffusion (flowâinduced dispersion mechanisms).1211
- Wave equations and edgeâbased Laplacians on graphs (graphâdomain wave dynamics).10
- DiGress: discrete diffusion for graph generation (graph generative diffusion).141513
- GCPN: reinforcement learning for graph construction (goalâdirected molecular graph generation).202119
- Differentiable causal discovery and DAG learning (acyclicityâconstrained structure learning).171816
-
https://arxiv.org/abs/0711.0189 {#n01}Â ↩
-
https://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
https://dl.acm.org/doi/10.1162/089976603321780317Â ↩Â ↩2Â ↩3Â ↩4Â ↩5Â ↩6
-
https://www.math.pku.edu.cn/teachers/yaoy/Fall2011/Lafon06.pdf ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
https://www.math.toronto.edu/undergrad/projects-undergrad/Project03.pdf ↩ ↩2
-
https://cs.stanford.edu/people/mmahoney/cs369m/Lectures/lecture1.pdf ↩ ↩2
-
https://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering ↩ ↩2 ↩3
-
https://msp.org/pjm/2004/216-2/pjm-v216-n2-p03-p.pdf ↩ ↩2 ↩3
-
https://papers.neurips.cc/paper/8157-dags-with-no-tears-continuous-optimization-for-structure-learning.pdf ↩ ↩2
-
https://causaldm.github.io/Causal-Decision-Making/2_Causal_Structure_Learning/Causal%20Discovery.html ↩ ↩2
-
http://papers.neurips.cc/paper/7877-graph-convolutional-policy-network-for-goal-directed-molecular-graph-generation.pdf ↩ ↩2