SegmentsCostCache provides faster way to calculate cost function proposed in
CostBalancerStrategy
.
See https://github.com/druid-io/druid/pull/2972 for more details about the cost function.
Joint cost for two segments (you can make formulas below readable by copy-pasting to
https://www.codecogs.com/latex/eqneditor.php):
cost(X, Y) = \int_{x_0}^{x_1} \int_{y_0}^{y_1} e^{-\lambda |x-y|}dxdy
or
cost(X, Y) = e^{y_0 + y_1} (e^{x_0} - e^{x_1})(e^{y_0} - e^{y_1}) (*)
if x_0 <= x_1 <= y_0 <= y_1
(*) lambda coefficient is omitted for simplicity.
For a group of segments {S_xi}, i = {0, n} total joint cost with segment S_y could be calculated as:
cost(X, Y) = \sum cost(X_i, Y) = e^{y_0 + y_1} (e^{y_0} - e^{y_1}) \sum (e^{xi_0} - e^{xi_1})
if xi_0 <= xi_1 <= y_0 <= y_1
and
cost(X, Y) = \sum cost(X_i, Y) = (e^{y_0} - e^{y_1}) \sum e^{xi_0 + xi_1} (e^{xi_0} - e^{xi_1})
if y_0 <= y_1 <= xi_0 <= xi_1
SegmentsCostCache stores pre-computed sums for a group of segments {S_xi}:
1) \sum (e^{xi_0} - e^{xi_1}) -> leftSum
2) \sum e^{xi_0 + xi_1} (e^{xi_0} - e^{xi_1}) -> rightSum
so that calculation of joint cost function for segment S_y became a O(1 + m) complexity task, where m
is the number of segments in {S_xi} that overlaps S_y.
Segments are stored in buckets. Bucket is a subset of segments contained in SegmentsCostCache, so that
startTime of all segments inside a bucket are in the same time interval (with some granularity):
|------------------------|--------------------------|-----------------------|-------- ....
t_0 t_0+D t_0 + 2D t0 + 3D ....
S_x1 S_x2 S_x3 S_x4 S_x5 S_x6 S_x7 S_x8 S_x9
bucket1 bucket2 bucket3
Reasons to store segments in Buckets:
1) Cost function tends to 0 as distance between segments' intervals increases; buckets
are used to avoid redundant 0 calculations for thousands of times
2) To reduce number of calculations when segment is added or removed from SegmentsCostCache
3) To avoid infinite values during exponents calculations