Research Papers in Vector Search & AI Systems
Field of application: Vector Databases, Spectral Methods, and Agentic AI
A curated collection of foundational and emerging papers that inform the design and implementation of arrowspace, optical compression, and next-generation retrieval systems.
Graph-based Vector Search
Paper: Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art
Authors: Ilias Azizi, Karima Echihabi, Themis Palpanas, Vassilis Christophides.â Link: https://openreview.net/pdf?id=ALnwJjieOiâ
Comprehensive experimental evaluation of twelve stateâofâtheâart graphâbased ANN vector search methods on up to billionâscale datasets; identification of incremental insertion plus neighborhood diversification as the most effective current paradigm; analysis of how baseâgraph design impacts scalability and search quality; and articulation of open directions, especially more dataâadaptive seed selection and diversification strategies for future vector search systems
Key contributions: the paper highlights limitations: qualityâlatency tradeoffs, sensitivity to choice of base graph, lack of sophisticated dataâadaptive diversification, and open needs for more adaptive, structureâaware strategies to build and traverse graphs.
What arrowspace can bring to the table
- Spectral reâranking layer:
arrowspacecan take any of these graph indices as a base and add a perâvector spectral score (taumode) derived from a Laplacian over the dataset, allowing reâranking of graphâreturned candidates by both cosine distance and spectral smoothness, which targets systemically relevant but geometrically âoutlierâ neighbours the surveyed methods may miss.ââ - Dataâadaptive diversification: Instead of purely local degree or heuristic pruning, diversification can be guided by differences in spectral scores and eigenstructure, encouraging candidate sets that span distinct spectral modes of the dataset manifold, directly addressing the paperâs call for more dataâadaptive diversification strategies.ââ
- Better seed selection: Seed nodes for graph search can be chosen using spectral signatures (e.g., lowâenergy, âcentralâ signals or highâenergy, âboundaryâ signals), making initial entry points more robust to embedding drifts and cluster imbalances than purely metricâbased seeds discussed in the evaluation.â
- Multiâvector and subvector search: ArrowSpaceâs ability to index subvectors and aggregate spectral scores per item gives a principled route to multiâvector queries on top of the existing graph methods, which the evaluated algorithms largely treat as singleâquery problems.
Paper: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces
Authors: Roger Weber, Hans-J. Schek, Stephen Blott
Foundational analysis of similarity search in high-dimensional vector spaces, proving that tree- and partition-based indexes (R*-tree, X-tree, etc.) degrade to linear scan as dimensionality grows and introducing the VA-file as an approximation-based layout that makes this unavoidable scan as efficient as possible. This connects to arrowspace by precisely characterizing the âdimensionality curseâ that spectral indexing aims to escape, suggesting that instead of fighting geometric sparsity with ever more complex trees, one should change the representation (e.g., spectral graph + taumode) so that similarity is no longer driven solely by fragile highâd metric properties.â
Key contributions: Formal cost model and lower bounds for highâdimensional nearest neighbor search, empirical demonstration that classic multidimensional indexes are outperformed by sequential scan beyond ~10â20 dimensions, and proposal of the VAâfile as a practical, approximationâbased alternative for similarity search in HDVSs.
What arrowspace can bring to the table
arrowspace complements this paper by attacking the dimensionality curse at the signal and graph level instead of only in the raw metric space, so it can keep using simple approximate indexes (including VAâlike layouts) while prioritizing candidates that are structurally meaningful in very high dimensions.
Spectral rather than purely geometric structure: arrowspace first builds a graph Laplacian over items (using approximate kâNN plus random projection if needed), then computes perâfeature Rayleigh energies that reflect how smoothly each signal varies over that graph, capturing a lowâdimensional spectral structure even when the original embedding dimension is large.â
How arrowspace sidesteps the curse of dimensionality
- Spectral rather than purely geometric structure:
arrowspacefirst builds a graph Laplacian over items (using approximate kâNN plus random projection if needed), then computes perâfeature Rayleigh energies that reflect how smoothly each signal varies over that graph, capturing a lowâdimensional spectral structure even when the original embedding dimension is large. - Bounded taumode score as a second axis: These Rayleigh energies are transformed into a bounded, comparable scalar \(\lambda_\tau\) per feature (taumode), and perâitem scores can be aggregated, so search is performed in a joint space âcosine distance Ă spectral roughnessâ instead of only Euclidean/cosine distance in \(d\) dimensions, which reduces reliance on fragile highâdimensional metric properties highlighted by Weber et al.
- Compatible with VAâstyle layouts: Because
arrowspacestores dense arrays plus a single extra scalar per row and does not require storing the full graph at query time, it can be overlaid on VAâfileâlike or HNSWâlike approximate indices: the coarse approximate index prunes by geometry, whilearrowspacereâranks and filters by spectral score, effectively using high \(d\) only once at indexâbuild time rather than repeatedly at query time. - Targeting nonâuniform, structured data: The WeberâSchekâBlott analysis assumes uniform, independent dimensions;
arrowspaceexplicitly exploits nonâuniformity (clusters and manifolds) via the Laplacian, so the âeverything degenerates to a scanâ result no longer applies in the same way for realâworld structured datasets where spectral energies concentrate in a few modes.
arrowspace improvements vs. VAâfile and trees
- Where VAâfile speeds up the inevitable scan in high \(d\),
arrowspacereduces how much of the dataset is relevant by focusing on spectrally consistent neighbours, so for a fixed approximate index (tree, graph, or VAâstyle), fewer candidates need full scoring to reach useful recall. - Tree and partition indexes degrade because their bounding regions expand in effective volume at high \(d\);
arrowspaceinstead builds its main discriminative signal (taumode) from a sparse graph on items, whose complexity depends on the number of items and local kâNN degrees, not directly on the original feature dimension after optional random projection. - In practice, this means: keep simple approximate indexing (including VAâlike) for coarse filtering, and let ArrowSpaceâs spectral index handle fineâgrained selection and ranking, turning the âcurseâ into a manageable oneâoff preprocessing cost rather than a perâquery explosion.
Graph Embeddings
Ontology Embedding: A Survey of Methods, Applications and Resources
Authors: Jiaoyan Chen, Olga Mashkova, Ernesto Jiménez-Ruiz, Ian Horrocks, Diego M. López, Przemyslaw Andrzej Nowak, et al. arXiv: 2406.10964 (2024), accepted to IEEE TKDE
Comprehensive survey of ontology embedding, covering formal definitions, method categories, resources, and applications across ontology engineering, machine learning augmentation, and life sciences, consolidating works from AI and bioinformatics venues. This connects to arrowspace by framing how logical structure can be embedded into vector spaces to complement spectral and graph-based similarity in hybrid search.
Key contributions: Taxonomy of ontology embedding approaches, resource catalog, application landscape, and challenges/future directions in integrating symbolic semantics with embeddings.
The RDF2vec Family of Knowledge Graph Embedding Methods
Authors: Petar Ristoski, Simone Paolo Ponzetto, Heiko Paulheim (and collaborators across variants) Journal: Semantic Web â Interoperability, Usability, Applicability (SWJ), âThe RDF2vec Family of Knowledge Graph Embedding Methodsâ
In-depth study of RDF2vec variants that generate embeddings from random walks over RDF graphs, with a comprehensive evaluation revealing representational strengths and weaknesses relative to other KG embedding methods. For arrowspace, this informs walk-based feature extraction that can be blended with Laplacian-based distances for structure- and path-aware retrieval.
Key contributions: Unified overview of RDF2vec techniques, large-scale comparative evaluation, and practical guidance for selecting variants by task characteristics.
Graph Signal Processing & Spectral Methods
The Emerging Field of Signal Processing on Graphs
Authors: David I Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, Pierre Vandergheynst
arXiv: 1211.0053 (2012)
Foundational work extending classical signal processing operations to graph-structured data. This paper establishes the theoretical framework for spectral graph analysis used in arrowspaceâs energy-distance metrics and Laplacian-based search.
Key contributions: Graph Fourier Transform, spectral filtering, and multi-resolution analysis on irregular graph domains.
What Is Positive Geometry?
Authors: Kristian Ranestad, Bernd Sturmfels, Simon Telen
arXiv: 2502.12815 (2025)
Foundational introduction to positive geometryâan interdisciplinary field bridging particle physics, cosmology, and algebraic geometry. Positive geometries are tuples \((X, X_{\geq 0}, \Omega(X_{\geq 0}))\) consisting of a complex algebraic variety, a semi-algebraic positive region, and a canonical differential form satisfying recursive axioms. The framework represents physical observables (scattering amplitudes, cosmological correlators) as geometric structures like amplituhedra and cosmological polytopes.
Relevance to arrowspace: The canonical form constructionârecovering volume integrals from positive regions via \(\Omega(P) = \text{vol}(P-x)^\circ dx\)âdirectly parallels arrowspaceâs energy map pipeline. Just as positive geometry âlinearizesâ high-dimensional semi-algebraic varieties into canonical differential forms, arrowspaceâs energymaps.rs constructs a graph Laplacian over the data manifold and projects it onto a 1-dimensional taumode spectrum (Rayleigh quotients). Both frameworks encode complex geometric structures (amplituhedra / energy graphs) as scalar fields that preserve topological invariants while enabling efficient computation.
Key contributions:
- Formal definition of positive geometries with recursive boundary factorization and canonical forms
- Connection between convex polytopes, Grassmannian amplituhedra, and universal barrier functions in optimization
- Integration of real, complex, and tropical algebraic geometry for computing scattering amplitudes and cosmological correlators
Agentic Systems & Planning
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
Authors: ZeroRepo Team
arXiv: 2509.16198 (2025)
Introduces the Repository Planning Graph (RPG), a graph-driven framework for generating complete software repositories. Relevant to formal agent protocols and structured generation workflows.
Key contributions: Persistent graph representations unifying proposal- and implementation-level planning for autonomous code generation.
Retrieval Fundamentals & Limitations
On the Theoretical Limitations of Embedding-Based Retrieval
Authors: Orion Weller et al. (Google DeepMind)
arXiv: 2508.21038 (2025)
Theoretical analysis proving fundamental limitations of single-vector embeddings for complex retrieval tasks. Introduces the LIMIT benchmark to expose failure modes in cosine-similarity-based retrieval.
Key contributions: Sign-rank bounds on embedding expressiveness, motivating energy-distance and spectral approaches beyond cosine similarity.
Document Reranking
jina-reranker-v3: Last but Not Late Interaction for Document Reranking
Authors: Feng Wang, Yuqing Li, Han Xiao
arXiv: 2509.25085v2 (2025)
State-of-the-art 0.6B parameter multilingual document reranker achieving 61.94 nDCG@10 on BEIR. Demonstrates lightweight alternatives to generative listwise reranking.
Key contributions: Late-interaction architecture for efficient cross-encoder reranking with strong BEIR performance.
Context Compression & Recursive Models
Recursive Language Models
Authors: Alex Zhang, Omar Khattab (MIT CSAIL)
Paper: alexzhang13.github.io/blog/2025/rlm
Proposes Recursive Language Models (RLMs), where models recursively call themselves to decompose and interact with unbounded context. RLM with GPT-4-mini outperforms full GPT-4 by 87% on long-context benchmarks.
Key contributions: Divide-and-conquer strategy for handling 10M+ token contexts without performance degradation, mitigating âcontext rot.â
Quantum Computing & Hybrid Systems
Mind the Gaps: The Fraught Road to Quantum Advantage
Authors: Jens Eisert, John Preskill
arXiv: 2510.19928 (2025)
Perspectives on the transition from noisy intermediate-scale quantum (NISQ) devices to fault-tolerant application-scale quantum computing. Identifies four key hurdles including error mitigation, scalable fault tolerance, and verifiable algorithms.
Relevance: Explores hybrid classical-quantum systems for optimization and simulation tasks relevant to graph algorithms and energy minimization.
Computational Methods in Software Engineering
Vulnerability2Vec: A Graph-Embedding Approach for Enhancing Vulnerability Classification
Source: Tech Science Press - CMES 2025
Vulnerability2Vec converts Common Vulnerabilities and Exposures (CVE) text explanations to semantic graphs.
Key contributions: Security vulnerability; graph representation; graph-embedding; deep learning; node classification.
Implementation Resources
For practical implementations informed by these papers:
-
arrowspace: Spectral vector database with energy-informed search
GitHub | PyPI | crates.io -
BMPP Agents: Formal protocol for AI agent workflows
Implementation page -
Optical Embeddings: DeepSeek-OCR compression in Rust
Blog post
Interested in research collaboration or sponsorship? Check the Contact page to discuss how these methods can accelerate your data infrastructure.