Proposal

Main paper can be found here.

  • online approximate quantile computation
  • accuracy of q-th quantile is max(q, 1-q)
  • merging of summaries (aka digests) results is lesser loss of accuracy
  • can also be extended for weighted samples

Summary

  • digest = partition of samples; aka clusters
  • t-digest = a digest with any cluster of unit weight or bounded by some scale function.
  • weight = number of samples in the cluster C; aka |C|
  • fully merged t-digest = no two consecutive clusters can be merged without the violation of the weight bound
  • a cluster is represented by its mean and number of samples in it
  • clusters at both ends of the quantile are kept smaller while the ones in the middle are larger
  • scale function
    • it must be a non-decreasing function
    • $$k_1(q) = \frac{\delta}{2 \pi} arcsin(2q - 1)$$
      • $$\delta$$ = compression parameter. Atleast for $$k_1$$ it ends up bounding the number of clusters to $$[\lfloor \frac{\delta}{2} \rfloor, \lceil \delta \rceil]$$
      • $$k$$ is the notional index
      • $$n$$ = number of samples. It is assumed to be much larger than $$\delta$$
    • $$|C_k| = k(q_r) - k(q_l) \le 1$$, for any cluster with weight $$\ge$$ 1
    • $$q_l = \frac{W_l(C)}{n}$$
    • $$q_r = q_l + \frac{|C|}{n}$$
    • $$W_l(C)$$ = sum of weights of all clusters to the left of it
    • this function leads to estimates which are very accurate near extreme quantiles and modestly accurate in the middle
    • another scale function is linear: $$k_0(q) = \frac{\delta q}{2}$$. It can work but trades off accuracy for computation
  • other scale functions are:
    • $$k_2(q) = \frac{\delta}{4 log(\frac{n}{\delta}) + 24} log(\frac{q}{1-q})$$
    • $$k_3(q) = \frac{\delta}{4 log(\frac{n}{\delta}) + 21} log(2q) if q \le 0.5$$
    • $$k_3(q) = \frac{\delta}{4 log(\frac{n}{\delta}) + 21} (-log(2(1 - q))) if q > 0.5$$
  • merging 2 t-digests is still possible but might end up with weakly ordered clusters, whose error bounds are hard to prove
  • but in practice merging doesn't result in significant loss of accuracy, especially if we do stratified merging where we use higher value of $$\delta$$ while computing the clusters and lower value while merging 2 t-digests
  • merging new-data with an existing t-digest
    • C = input t-digest to be merged
    • X = input data to be merged with t-digest
    • m = number of clusters in C
  • algo as follows:
    • X = sort(C union X)
    • S = $$\Sigma_i X_i.count$$
    • C' = []
    • $$q_0$$ = 0
    • $$q_{limit} = k^{-1}(k(q_0, \delta) + 1, \delta)$$
    • $$\sigma = x_1$$
    • for i = 2 ... (m + n)
      • q = $$q_0 + \frac{\sigma.count + x_i.count}{\delta}$$
      • if q $$\le$$ $$q_{limit}$$: $$\sigma += x_i$$
      • else:
        • C'.append($$\sigma$$)
        • $$q_0 += \frac{\sigma.count}{S}$$
        • $$q_{limit} = k^{-1}(k(q_0, \delta) + 1, \delta)$$
        • $$\sigma = x_i$$
    • C'.append($$\sigma$$)
  • this merge could cause centroids near q = 0 to shift their centroids thereby causing weak ordering!
  • effect of this issue can be reduced by alternating scan order of merge, ie. ascending order first, then descending order, and so on
  • if instead we set the data buffer to be just 1, then there's a clustering variant of the above merge algo too.