Computing Extremely Accurate Quantiles Using t-Digests
Proposal
Main paper can be found here.
- online approximate quantile computation
- accuracy of q-th quantile is max(q, 1-q)
- merging of summaries (aka digests) results is lesser loss of accuracy
- can also be extended for weighted samples
Summary
- digest = partition of samples; aka clusters
- t-digest = a digest with any cluster of unit weight or bounded by some scale function.
- weight = number of samples in the cluster C; aka |C|
- fully merged t-digest = no two consecutive clusters can be merged without the violation of the weight bound
- a cluster is represented by its mean and number of samples in it
- clusters at both ends of the quantile are kept smaller while the ones in the middle are larger
- scale function
- it must be a non-decreasing function
-
- = compression parameter. Atleast for it ends up bounding the number of clusters to
- is the notional index
- = number of samples. It is assumed to be much larger than
- , for any cluster with weight 1
- = sum of weights of all clusters to the left of it
- this function leads to estimates which are very accurate near extreme quantiles and modestly accurate in the middle
- another scale function is linear: . It can work but trades off accuracy for computation
- other scale functions are:
- merging 2 t-digests is still possible but might end up with weakly ordered clusters, whose error bounds are hard to prove
- but in practice merging doesn't result in significant loss of accuracy, especially if we do stratified merging where we use higher value of while computing the clusters and lower value while merging 2 t-digests
- merging new-data with an existing t-digest
- C = input t-digest to be merged
- X = input data to be merged with t-digest
- m = number of clusters in C
- algo as follows:
- X = sort(C union X)
- S =
- C' = []
- = 0
- for i = 2 ... (m + n)
- q =
- if q :
- else:
- C'.append()
- C'.append()
- this merge could cause centroids near q = 0 to shift their centroids thereby causing weak ordering!
- effect of this issue can be reduced by alternating scan order of merge, ie. ascending order first, then descending order, and so on
- if instead we set the data buffer to be just 1, then there's a clustering variant of the above merge algo too.