Capabilities, speed and accuracy

arrowspace v0.25.14 is available.

You can find arrowspace in the:

Summing up

It is now six months that the Spectral Indexing paper has been published and it is time to account for the current state of development with a showdown of perfomance in different dimensions: capabilities, speed, accuracy.

You can read something about performace in the previous post, in this one I am going to extend the analysis to every aspect thanks to the results returned by a major test campaign leveraging two datasets at the opposite edges of vector spaces applications: text embeddings (Sentence Transformer on the CVE dataset, high semantic content) and biochemical data (Dorothea 100K dimensions for biochemical markers classifications, sparse, one-hot, non-embedded).

These datasets have been selected because they stand as opposites in the spectrum of vector spaces characteristics so to highlight strenghts and non-covered areas for arrowspace. Briefly, text embeddings are dense representation with high semantic content used for text documents and AI/Transformers workloads while Dorothea 100K is a one-hot sparse representation mostly used for numerical analysis and biochemistry classification. In general, the first one has been used to test what arrowspace can do, the second for what it cannot do.

dataset link results
CVE data v0.25.x
Dorothea 100K data experiment 7 on v0.25.14

For the building-up to this level of testing, please see previous posts: 004, 005, 008, 015.


CVE dense embeddings Search Performance: Method Comparison

Here is the full visual analysis of the CVE spectral search test across 18 vulnerability queries, comparing Cosine (τ=1.0), Hybrid (τ=0.72), and Taumode (τ=0.42) methods. Unlike the Dorothea classification experiment where λ failed, the CVE search results show Taumode delivering consistent improvements — the manifold L = Laplacian(Cᵀ) provides operationally useful structure for retrieval.

Score Decay by Rank

Taumode maintains the highest scores at every rank position (1–15), with an average score of 0.887 versus Cosine’s 0.833 — a +0.054 absolute lift across all 270 query-rank cells. The shaded bands show Taumode also has tighter variance.

diagram 1

Score Lift Over Cosine

The per-rank lift is consistently positive for both spectral methods, with Taumode gaining +0.04 to +0.07 at every position. The lift is strongest at rank 1 and remains significant even at rank 15 — critical for RAG tail stability.

diagram 2 Score Lift

Per-Query Score Curves

All 18 individual query curves confirm the pattern: green (Taumode) sits above orange (Hybrid) which sits above blue (Cosine) with near-perfect consistency. Only Q14 (command injection) shows Taumode with slightly steeper tail decay.

diagram 7 Per Query Score Curves

Ranking Agreement & NDCG

NDCG@10 per Query

Taumode vs Hybrid achieves NDCG ≥ 0.93 on 15/18 queries, meaning the spectral methods largely agree on ranking. The Taumode-vs-Cosine NDCG is more variable (mean 0.685, std 0.407), reflecting that spectral re-ranking genuinely reshuffles results for some queries (Q1, Q4, Q7, Q14).

diagram 3 NDCG

Rank Correlation Heatmap

The Spearman/Kendall heatmap reveals three query categories: fully concordant (green rows like Q2, Q3, Q5, Q15, Q18), partially concordant (Q6, Q10, Q12), and divergent (Q1, Q4, Q7, Q14 with ρ ≈ 0). Divergent queries are where the spectral manifold injects the most novel structure.

diagram 4 Rank Correlation

Ranking Agreement Categories

7 of 18 queries have perfect agreement across all methods, while 6 show “spectral divergence” where Hybrid and Taumode agree but diverge from Cosine. This pattern is useful: λ provides a computationally cheap proxy for detecting where learned manifold structure differs from raw cosine similarity.

diagram 8 Ranking Agreement

Tail Quality Analysis (RAG Stability)

Per the scoring methodology, tail quality metrics are the primary indicators of multi-query stability for RAG systems.

Tail/Head Ratio

Taumode achieves the highest T/H ratio (0.990) versus Cosine (0.989), meaning scores decay less from head to tail. While the absolute difference is small (~0.001), Taumode wins on 14/18 queries — a statistically meaningful pattern.

diagram 5 RAG Stability

Tail Coefficient of Variation

Lower CV means more stable tail scores. Taumode achieves the lowest CV on 14/18 queries (avg 0.0028 vs Cosine’s 0.0029). The advantage is most pronounced on harder queries like Q3 (deserialization) and Q14 (command injection).

diagram 6 Tail Coefficient Variation

Tail Decay Rate by Difficulty

When sorted by query difficulty (hardest left), Taumode consistently shows lower or comparable decay rates. The exception is Q14 (command injection) where Taumode’s aggressive re-ranking creates a steeper tail — a known tradeoff of stronger spectral weighting.

diagram 13 Tail Decay Tail

Explorative Visualizations

Spectral Boost Map

This query × rank heatmap shows the exact score difference (Taumode − Cosine) at every cell. The nearly uniform green confirms that spectral search provides a global score elevation, not just a top-k effect. The few yellow/red cells indicate where re-ranking shifts results rather than just boosting.

diagram 14 Spectral Boost Map

Head Score: Cosine vs Taumode

Every point sits above the diagonal, confirming Taumode improves head scores for all 18 queries. Point size encodes T/H ratio improvement — queries with larger bubbles benefit most from spectral search in the tail.

diagram 9 Head Score

Score Distribution Violins

Pooling all 270 scores per method, the violins show Taumode’s entire distribution is shifted ~0.05 higher than Cosine. The Cosine distribution has a wider lower tail (more low-scoring results), which Taumode compresses upward.

diagram 12 Score Dist Violins Head

Top-10 Result Overlap

Hybrid–Taumode overlap averages ~0.85, confirming the spectral methods retrieve similar items. Cosine–Taumode overlap is lower (~0.65), particularly on divergent queries Q1, Q4, Q7, Q14 where spectral structure surfaces different CVEs.

diagram 11 Results Overlap

Re-ranking Position Shifts

The histogram of rank shifts (Cosine → Taumode) shows a sharp peak at 0 with symmetric tails, meaning Taumode mostly preserves cosine ordering while making targeted swaps. This is the ideal profile: not random reshuffling, but structured refinement.

diagram 15 Re-ranking Position Shift

Cumulative Score Advantage

The running cumulative advantage shows Taumode accumulating +0.65 total score over Cosine by rank 15, growing linearly. This linearity means the spectral advantage doesn’t diminish deeper in the results — exactly the property needed for stable RAG retrieval.

diagram 22 Cumulative Score Advantage

Cross-Query Stability

Taumode reduces inter-query score variability at the tail ranks (10–15), meaning it produces more predictable scores regardless of query difficulty. This is the multi-query stability property that the test_2_CVE_db scoring prioritises.

diagram 17 Cross Query Stability

Score Landscape Heatmaps

The side-by-side query×rank heatmaps show Taumode’s surface is uniformly brighter (higher scores) with less dark patches in the lower-left (hard queries, deep ranks).

diagram 20 Score Heatmap

Method Dominance

Taumode wins top-1 score on all 18 queries, T/H ratio on 14/18, and lowest tail CV on 14/18. Cosine wins lowest decay on only 3 queries (typically ones where Taumode’s re-ranking creates slightly steeper drops).

diagram 19 Method Wins

Spectral Score Lift per Query

Every query shows positive lift, ranging from +0.02 (Q14) to +0.07 (Q3). This confirms that even when the spectral re-ranking diverges from cosine, it produces higher absolute scores.

diagram 16 Spectral Score Lift

CVE Dataset Summary Table

Metric Cosine Hybrid Taumode
Avg Top-1 Score 0.8434 0.8734 0.8970
Avg T/H Ratio 0.9891 0.9896 0.9903
Avg Tail CV 0.0029 0.0030 0.0028
NDCG vs Cosine 0.763 0.685
Top-1 Wins 0/18 0/18 18/18

The CVE results validate the operationally useful interpretation: even if the epiplexity reading of λ is approximate, the manifold L = Laplacian(Cᵀ) provides a computationally cheap spectral proxy that consistently improves search scores and tail stability across diverse vulnerability queries.


Classification Performance

Method F1 Comparison

The grouped bar chart shows that pure lambda k-NN achieves F1=0.000 across all five graph configurations, while cosine k-NN reaches F1=0.867 consistently. Hybrid methods only approach cosine performance when alpha nears 1.0 (minimal spectral component).

diagram 1 F1 Comparison

Balanced Accuracy Heatmap

The heatmap reveals a stark binary pattern: methods involving cosine similarity show green (balanced accuracy ~0.93), while lambda-only and UMAP methods show red (~0.50, random chance). No graph wiring configuration improves the outcome for pure spectral classification.

diagram 2 Balanced Acc Heatmap

Method Ranking

The horizontal ranking confirms cosine k-NN as the clear winner, with hybrid methods degrading monotonically as spectral weight increases (lower alpha). Lambda k-NN and UMAP 2D both fail to classify the minority class.

diagram 25 Summary

Search / Hybrid Analysis

Alpha Sweep: F1 Response

F1 monotonically increases with alpha across all configurations, meaning more cosine = better performance on this dataset. The tight_clusters config shows the strongest hybrid result (F1=0.857 at α=0.9), but still below pure cosine.

diagram 3 Alpha sweep F1

Precision-Recall Tradeoff

The dual panel reveals that lower alpha values sacrifice recall heavily while providing only marginal precision gains in some configs. The precision-recall curves show that spectral weighting creates an unfavorable tradeoff for classification tasks.

diagram 4 Alpha tradeoffs

Time vs F1 Tradeoff

Hybrid search takes ~390s per query versus ~45s for pure cosine—an 8× slowdown for equal or worse F1. The star markers (cosine baseline) consistently sit higher than hybrid points that cost 8× more compute.

diagram 15 Time vs F1

k-Sensitivity Curves

Lambda k-NN produces flat-zero lines for F1, precision, and recall regardless of k, while cosine k-NN stabilizes quickly at k≥3. UMAP shows noisy, near-zero signals. This confirms the failure mode is structural, not parametric.

diagram 13 K-sensitivity

Spectral Diagnostics (Novel)

Lambda Distributions: Class Overlap

The violin plots deliver the core diagnostic: positive and negative class lambda distributions are nearly identical. Cohen’s d ranges from 0.046 to 0.086 (negligible effect size) across all configurations.

diagram 5 Lambda distribution violins

Lambda Density Landscape

The KDE comparison shows that different graph configs shift the lambda mean (0.21–0.35) but classes remain inseparable within every config. The manifold L = Laplacian(Cᵀ) captures geometric structure but not the biochemical class boundary.

diagram 12 Lambda Density

Lambda CDF with KS Test

The cumulative distribution functions for positive vs negative classes are nearly superimposed. KS statistics are small and p-values large, formally confirming no distributional separation via lambda.

diagram 24 cumulative distribution functions

Lambda Quantile Class Composition

Breaking items into lambda deciles shows that no quantile consistently enriches the positive class above the 9.8% base rate. This rules out even threshold-based spectral classification strategies.

diagram 16 Lambda Quantile

Novel Structural Visualizations

Lambda Contour Fields Over UMAP

The contour maps interpolate lambda as a continuous field over UMAP 2D space. The × markers (positive class) scatter across all lambda regions without clustering into distinct spectral zones—lambda topology doesn’t align with class boundaries.

diagram 22 UMAP Lambdas Contours

Lambda-UMAP Fusion

Coloring UMAP scatter plots by lambda value reveals that spectral energy is spatially diffuse: high-lambda and low-lambda items are interleaved with no class-correlated spatial patterns.

diagram 8 Lambda UMAP Fusion

Spectral Quality Radar Profiles

Each graph configuration produces a distinct spectral fingerprint, but none achieves the combination needed for classification: high Cohen’s d + low overlap. HiCompress shows the highest spectral gap but worst Cohen’s d.

diagram 6 Spectral Radar

Spectral-Performance Correlation Matrix

The correlation analysis reveals a negative correlation (-0.73) between best alpha and best hybrid F1. Lambda std (+0.75) and lambda mean (+0.60) correlate positively with F1—but this reflects that configs with wider lambda spread allow the cosine component to do more work at higher alpha, not that spectral features help.

diagram 14 Spectral Correlation Matrix

Rank Stability Across Configs

The Spearman rank-order correlation between lambda orderings across configs is near zero (ρ ∈ [-0.05, 0.11]). This means the manifold Laplacian assigns items to different spectral positions depending on config—lambda is not a stable item property on Dorothea.

diagram 20 Spearman Rank-order correlation

Effective Rank vs Participation Ratio

Both metrics sit near N=800 (the number of items), indicating the Laplacian spectrum is near full-rank with minimal compression. The manifold captures little low-dimensional structure from the 100k sparse features—there’s no spectral bottleneck to exploit.

diagram 17 Rank vs Participation

Information Bottleneck Pipeline

The pipeline visualization traces dimensionality from 100k → 220 (JL projection) → 50 (clusters) → 220 (Laplacian) → 1 (lambda). Classification signal survives through the cosine path but is completely lost at the lambda compression stage.

diagram 21 Information Bottleneck

Method Dominance Sankey

The Sankey flow visualizes how all five configs route through four method types into performance tiers. Only cosine reaches the “High F1” tier; lambda and UMAP flow entirely to “Failed.”

diagram 23 Sankey method flows

Spectral Diagnostic Bubble

The bubble chart places each config in Cohen’s d × F1 space, sized by Fiedler value and colored by overlap. All configs cluster in the low-d, moderate-F1 region—confirming that spectral class separation is the binding constraint.

diagram 9 Spectral Diagnostic bubble

Parallel Coordinates: Config Signatures

Each config traces a unique path through spectral metrics but converges to identical cosine F1 (0.867) while diverging on hybrid F1. This confirms that spectral variation affects only the spectral component, which is net-negative for this task.


Conclusion

Consider joining the next online event about all this.

Take aways:

⚠️ arrowspace is not meant for or fails in these cases:

✅ STRONG Product Potential:

According to my current understanding: