Proposal

Main paper can be found here.

An indexing technique for NN queries
It scales sublinearly ($$O(\sqrt n)$$) on the dataset size
It generalizes to all distance metrics
Depends only on the intrinsic dimensionality of the dataset
Easy to parallelize on multi/many core systems

Summary

techniques like LSH do scale sublinearly wrt dataset sizes. However, they are hard to tune, approximate and work on omly certain metrics
brute force NN search is easy to parallelize but does more work
$$\rho$$ is the distance metric
Data structure introduced is called Random Ball Cover (RBC):
- randomly pick $$n_r$$ representatives from the dataset
- for each rep, it "owns" a set of points in the dataset $$L_r$$. They typically are the closest points to this rep.
- for each rep, also store the max distance from it to the points in its cover, $$\psi_r$$
there are 2 search algos based on RBC: exact and one-shot. And for both these algos, building RBC boils down to calling brute-force NN search.
one-shot
- $$L_r$$ contains a suitably sized $$s$$ NNs of that rep from the dataset
- when given a query, it is first brute-forced through all the representative points to find the nearest rep
- then again brute-forced through its $$L_r$$ for the NNs
- it can be proved that it's complexity is $$O(\sqrt n)$$
- it sometimes can give approximate results, but can be very fast in practice
exact
- $$L_r$$ contains all points in the dataset for which this rep is the closest
- this relies on triangle inequality in order to guarantee exact results
- the closest point to $$q$$ called $$r_q$$ is found from the set of reps
- let $$\gamma = \rho(q, r_q)$$. This is an upper bound on q's NNs.
- all reps with $$\rho(r, q) > \gamma + \psi_r$$ are discarded
- all reps with $$\rho(r, q) > 3 \rho(r_q, q)$$ are also discarded
- after this, do a brute force on all the points belonging to the remaining reps
- it can be proved that it's complexity is $$O(\sqrt n)$$
for parallelization strategy, they simply come up with an effective way to parallelize brute force search, which is very similar to a gemm implementation