Research Papers in Vector Search & AI Systems

Field of application: Vector Databases, Spectral Methods, and Agentic AI

A curated collection of foundational and emerging papers that inform the design and implementation of arrowspace, optical compression, and next-generation retrieval systems.


Paper: Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art

Authors: Ilias Azizi, Karima Echihabi, Themis Palpanas, Vassilis Christophides.​ Link: https://openreview.net/pdf?id=ALnwJjieOi​

Comprehensive experimental evaluation of twelve state‑of‑the‑art graph‑based ANN vector search methods on up to billion‑scale datasets; identification of incremental insertion plus neighborhood diversification as the most effective current paradigm; analysis of how base‑graph design impacts scalability and search quality; and articulation of open directions, especially more data‑adaptive seed selection and diversification strategies for future vector search systems

Key contributions: the paper highlights limitations: quality–latency tradeoffs, sensitivity to choice of base graph, lack of sophisticated data‑adaptive diversification, and open needs for more adaptive, structure‑aware strategies to build and traverse graphs.

What arrowspace can bring to the table

Paper: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

Authors: Roger Weber, Hans-J. Schek, Stephen Blott

Foundational analysis of similarity search in high-dimensional vector spaces, proving that tree- and partition-based indexes (R*-tree, X-tree, etc.) degrade to linear scan as dimensionality grows and introducing the VA-file as an approximation-based layout that makes this unavoidable scan as efficient as possible. This connects to arrowspace by precisely characterizing the “dimensionality curse” that spectral indexing aims to escape, suggesting that instead of fighting geometric sparsity with ever more complex trees, one should change the representation (e.g., spectral graph + taumode) so that similarity is no longer driven solely by fragile high‑d metric properties.​

Key contributions: Formal cost model and lower bounds for high‑dimensional nearest neighbor search, empirical demonstration that classic multidimensional indexes are outperformed by sequential scan beyond ~10–20 dimensions, and proposal of the VA‑file as a practical, approximation‑based alternative for similarity search in HDVSs.

What arrowspace can bring to the table

arrowspace complements this paper by attacking the dimensionality curse at the signal and graph level instead of only in the raw metric space, so it can keep using simple approximate indexes (including VA‑like layouts) while prioritizing candidates that are structurally meaningful in very high dimensions.

Spectral rather than purely geometric structure: arrowspace first builds a graph Laplacian over items (using approximate k‑NN plus random projection if needed), then computes per‑feature Rayleigh energies that reflect how smoothly each signal varies over that graph, capturing a low‑dimensional spectral structure even when the original embedding dimension is large.​

How arrowspace sidesteps the curse of dimensionality

arrowspace improvements vs. VA‑file and trees


Graph Embeddings

Ontology Embedding: A Survey of Methods, Applications and Resources

Authors: Jiaoyan Chen, Olga Mashkova, Ernesto Jiménez-Ruiz, Ian Horrocks, Diego M. López, Przemyslaw Andrzej Nowak, et al. arXiv: 2406.10964 (2024), accepted to IEEE TKDE

Comprehensive survey of ontology embedding, covering formal definitions, method categories, resources, and applications across ontology engineering, machine learning augmentation, and life sciences, consolidating works from AI and bioinformatics venues. This connects to arrowspace by framing how logical structure can be embedded into vector spaces to complement spectral and graph-based similarity in hybrid search.

Key contributions: Taxonomy of ontology embedding approaches, resource catalog, application landscape, and challenges/future directions in integrating symbolic semantics with embeddings.

The RDF2vec Family of Knowledge Graph Embedding Methods

Authors: Petar Ristoski, Simone Paolo Ponzetto, Heiko Paulheim (and collaborators across variants) Journal: Semantic Web – Interoperability, Usability, Applicability (SWJ), “The RDF2vec Family of Knowledge Graph Embedding Methods”

In-depth study of RDF2vec variants that generate embeddings from random walks over RDF graphs, with a comprehensive evaluation revealing representational strengths and weaknesses relative to other KG embedding methods. For arrowspace, this informs walk-based feature extraction that can be blended with Laplacian-based distances for structure- and path-aware retrieval.

Key contributions: Unified overview of RDF2vec techniques, large-scale comparative evaluation, and practical guidance for selecting variants by task characteristics.


Graph Signal Processing & Spectral Methods

The Emerging Field of Signal Processing on Graphs

Authors: David I Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, Pierre Vandergheynst
arXiv: 1211.0053 (2012)

Foundational work extending classical signal processing operations to graph-structured data. This paper establishes the theoretical framework for spectral graph analysis used in arrowspace’s energy-distance metrics and Laplacian-based search.

Key contributions: Graph Fourier Transform, spectral filtering, and multi-resolution analysis on irregular graph domains.

What Is Positive Geometry?

Authors: Kristian Ranestad, Bernd Sturmfels, Simon Telen
arXiv: 2502.12815 (2025)

Foundational introduction to positive geometry—an interdisciplinary field bridging particle physics, cosmology, and algebraic geometry. Positive geometries are tuples \((X, X_{\geq 0}, \Omega(X_{\geq 0}))\) consisting of a complex algebraic variety, a semi-algebraic positive region, and a canonical differential form satisfying recursive axioms. The framework represents physical observables (scattering amplitudes, cosmological correlators) as geometric structures like amplituhedra and cosmological polytopes.

Relevance to arrowspace: The canonical form construction—recovering volume integrals from positive regions via \(\Omega(P) = \text{vol}(P-x)^\circ dx\)—directly parallels arrowspace’s energy map pipeline. Just as positive geometry “linearizes” high-dimensional semi-algebraic varieties into canonical differential forms, arrowspace’s energymaps.rs constructs a graph Laplacian over the data manifold and projects it onto a 1-dimensional taumode spectrum (Rayleigh quotients). Both frameworks encode complex geometric structures (amplituhedra / energy graphs) as scalar fields that preserve topological invariants while enabling efficient computation.

Key contributions:

Full PDF


Agentic Systems & Planning

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Authors: ZeroRepo Team
arXiv: 2509.16198 (2025)

Introduces the Repository Planning Graph (RPG), a graph-driven framework for generating complete software repositories. Relevant to formal agent protocols and structured generation workflows.

Key contributions: Persistent graph representations unifying proposal- and implementation-level planning for autonomous code generation.


Retrieval Fundamentals & Limitations

On the Theoretical Limitations of Embedding-Based Retrieval

Authors: Orion Weller et al. (Google DeepMind)
arXiv: 2508.21038 (2025)

Theoretical analysis proving fundamental limitations of single-vector embeddings for complex retrieval tasks. Introduces the LIMIT benchmark to expose failure modes in cosine-similarity-based retrieval.

Key contributions: Sign-rank bounds on embedding expressiveness, motivating energy-distance and spectral approaches beyond cosine similarity.

Paper PDF


Document Reranking

jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Authors: Feng Wang, Yuqing Li, Han Xiao
arXiv: 2509.25085v2 (2025)

State-of-the-art 0.6B parameter multilingual document reranker achieving 61.94 nDCG@10 on BEIR. Demonstrates lightweight alternatives to generative listwise reranking.

Key contributions: Late-interaction architecture for efficient cross-encoder reranking with strong BEIR performance.


Context Compression & Recursive Models

Recursive Language Models

Authors: Alex Zhang, Omar Khattab (MIT CSAIL)
Paper: alexzhang13.github.io/blog/2025/rlm

Proposes Recursive Language Models (RLMs), where models recursively call themselves to decompose and interact with unbounded context. RLM with GPT-4-mini outperforms full GPT-4 by 87% on long-context benchmarks.

Key contributions: Divide-and-conquer strategy for handling 10M+ token contexts without performance degradation, mitigating “context rot.”


Quantum Computing & Hybrid Systems

Mind the Gaps: The Fraught Road to Quantum Advantage

Authors: Jens Eisert, John Preskill
arXiv: 2510.19928 (2025)

Perspectives on the transition from noisy intermediate-scale quantum (NISQ) devices to fault-tolerant application-scale quantum computing. Identifies four key hurdles including error mitigation, scalable fault tolerance, and verifiable algorithms.

Relevance: Explores hybrid classical-quantum systems for optimization and simulation tasks relevant to graph algorithms and energy minimization.


Computational Methods in Software Engineering

Vulnerability2Vec: A Graph-Embedding Approach for Enhancing Vulnerability Classification

Source: Tech Science Press - CMES 2025

Vulnerability2Vec converts Common Vulnerabilities and Exposures (CVE) text explanations to semantic graphs.

Key contributions: Security vulnerability; graph representation; graph-embedding; deep learning; node classification.

Full PDF


Implementation Resources

For practical implementations informed by these papers:


Interested in research collaboration or sponsorship? Check the Contact page to discuss how these methods can accelerate your data infrastructure.