ArrowSpace: Spectral Search For Embeddings and Graph Analysis
Field of application: Search-Rank of Vectors for Scientific Applications
Paper
Read the presentation paper PDF.
Abstract
arrowspace is a library that implements a novel spectral indexing approach for vector similarity search, combining traditional semantic similarity with graph-based spectral properties (`ArrowSpace` is its core data structure). The library introduces taumode (λτ , lambda-tau) indexing, which blends Rayleigh quotient smoothness energy from graph Laplacians with edge-wise dispersion statistics to create bounded, comparable spectral scores. This enables similarity search that considers both semantic content and spectral characteristics of high-dimensional vector datasets.
Briefly
Existing vector database solutions are not fine-tuned to the domain they apply to, current solutions mostly target word embeddings and use standard distance metrics (L2 distance/Euclidean, cosine and similar) or use (not very much cost-effective compared to the quality of the outcome) hashing functions. Spectral indexing allows fine-tuning vector search to the spectral signature of the domain the dataset belongs to, this enables finding associations of vectors that are not spotted by existing solutions (there is an example called `compare_cosine` in the repository). Imagine having an LLM that can find contexts that are related but not spotted by the same LLM using a traditional search, this means the possibility of discovering alternative patterns for the same problem-solving activity or even finding previously ignored meaningful connections. This comes with a simplification of the stack, one index can synthesise the spectrum of the vector; this gives advantages in index maintenance, interpretability and explainability of the dataset. These characteristics makes spectral indexing the perfect fit for medium-large datasets that need domain-specific precision in search (reference example: proteins structure datasets, like the `proteins_lookup` example in the the repository). In more general terms, this approach can also help overcome the theoretical limitations of single-vector search as highlighted by this paper.
Some characteristics:
- Spectral fusion: Blends Rayleigh‑quotient smoothness from graph Laplacians with edge‑wise dispersion, producing a single λτ score per item or neighborhood. Captures both content similarity and how well items fit the dataset’s structural manifold, improving retrieval of subtle, systematic patterns.
- Domain‑tuned retrieval: Fine‑tunes search to a dataset’s spectral signature, enabling discovery of associations that traditional metrics overlook in medium‑large, domain‑specific corpora. Demonstrates advantages on specialized datasets (e.g., protein structures) where semantics alone underperform.
- Bounded comparability: Produces bounded, comparable spectral scores, simplifying calibration across collections, time windows, or model updates. Supports stable re‑ranking and thresholding strategies for production pipelines.
- Simpler indexing stack: A single spectral index can synthesize structural information, reducing dependence on multiple bespoke indices or heavy hashing schemes. Lowers maintenance overhead while increasing interpretability and explainability.
- Better explainability: Graph Laplacian energy and dispersion components provide interpretable rationales for why items are retrieved together. Aids review workflows that require auditability and post‑hoc analysis.
- Practical uplift: Recovers related contexts that standard search misses, enabling alternative solution patterns and uncovering overlooked connections. Particularly effective when high precision is required under domain shift or specialized embeddings.
Read the full paper PDF.
DOI:10.36227/techrxiv.175751921.18542359/v1
Implementation
Explore ArrowSpace for spectral vector search.
Unlock powerful spectral search for your vector space.