Sparse Index - StarTree Docs

Overview and Purpose

A sparse index is a specialized data structure that accelerates equality-based filter operations on columns with extremely high cardinality. It represents a hybrid approach between inverted indexes and Bloom filters, designed specifically for scenarios where traditional indexes become inefficient due to the sheer number of unique values. Sparse indexes are particularly valuable for:

Columns storing hundreds of thousands of unique values
Random identifiers like UUIDs or session IDs
High-cardinality fields such as public IP addresses
Any query pattern involving equality filters (column = value or column IN (...)) on high-cardinality columns
Deployments using tiered storage with data in cloud object stores

Sparse index is a StarTree extension that shows particular performance benefits with tiered storage because it can significantly reduce data transferred from cloud storage during query execution.

How the Index Works

Core Concepts

Traditional inverted indexes create a direct mapping between each unique value and the rows containing that value. While effective for low- to medium-cardinality columns, this approach becomes inefficient as cardinality increases, leading to large index sizes and degraded performance. The sparse index in StarTree Cloud takes a different approach:

Chunking: Rows are grouped into contiguous chunks of a predefined size.
Partitioning: Column values are classified into a fixed number of partitions using one or more hash functions (mappers).
Pseudo-Inverted Index: Instead of mapping values directly to rows, the index maps partitions to chunks that contain values from that partition.

Example Illustration

For a column storing device IDs with millions of unique values:

Traditional Approach: An inverted index would map each unique device ID to its corresponding rows, potentially creating an enormous index.
Sparse Index Approach:
- Rows are grouped into chunks
- Device IDs are hashed into partitions using multiple hash functions
- The index maps each partition to the chunks that contain values from that partition

When processing a query like WHERE deviceId = 'abc123':

The system determines which partition contains ‘abc123’
It identifies the chunks associated with that partition
Only those chunks are scanned, dramatically reducing I/O operations

Sparse indexes are probabilistic at chunk level. They reduce scan work but may include extra chunks that are removed by final predicate evaluation. This approach is particularly beneficial with tiered storage, as it minimizes the amount of data downloaded from cloud storage.

Configuration

Enabling Sparse Index at Server Level

To use sparse indexes, you must first enable them at the server level:

pinot.server.instance.index.sparse.enabled=true

This can be set via the Swagger UI using the /cluster/configs API, or through your server configuration files. Note: After adding this configuration, restart the servers for the changes to take effect.

Configuring Sparse Index for a Column

To enable a sparse index on a column in your StarTree Cloud table, add the following configuration to your table definition:

{
  "fieldConfigList": [
    {
      "name": "deviceId",
      "indexes": {
        "sparse": {
          "chunkSize": 1024,
          "partitions": 10000,
          "hashFunctionCount": 3
        }
      }
    }
  ]
}

Configuration Parameters

chunkSize: The number of documents per chunk.
- Must be a power of 2
- Default: 1024
- Recommended range: 128 to 8192
- Impact: Smaller chunks typically reduce false positives and numEntriesScannedInFilter, but can increase index size
- Usually the strongest tuning knob for scan reduction
partitions: The number of partitions used for classifying values.
- Default: 10000
- Recommended: Depends on cardinality and target false-positive rate
- Impact: Higher values improve selectivity and generally increase index size
- Size impact is often sub-linear when partitions are already large
hashFunctionCount: The number of partition mapping functions used.
- Default: 3
- Recommended range: 1 to 8
- Impact: Higher values increase selectivity and index lookups
- Usually increases index size close to linearly

Sparse index cannot be created on a column that also has an inverted index enabled.

Sparse index is not supported on MAP columns.

Advanced Configuration Parameters

These parameters are rarely needed and should only be used in specific scenarios:

seedGenerator: An integer seed for generating hash functions.
- Only necessary if hash collisions are unusually frequent
mapperId: Specifies the hash function type.
- Currently, only “murmur” is supported

False-Positive and Size Formulas

The following formulas are practical approximations for tuning.
Assume:

$R$ : rows in segment
$c$ : chunkSize
$m$ : partitions
$k$ : hashFunctionCount
$v$ : average values per row for the indexed column ( $v = 1$ for single-value)
$N_c = \lceil R / c \rceil$ and $n = c \cdot v$

Expected unique partitions touched in one chunk for one hash:

u = m \left(1 - e^{-n/m}\right)

Chunk-level false-positive rate for non-existing keys:

p_{\mathrm{fp},\mathrm{chunk}} \approx \left(\frac{u}{m}\right)^k \approx \left(1 - e^{-n/m}\right)^k

Expected false-positive chunks per segment:

E_{\mathrm{fp},\mathrm{chunks}} \approx N_c \cdot p_{\mathrm{fp},\mathrm{chunk}}

Candidate-docs proxy:

E_{\mathrm{candidateDocs}} \approx c \cdot E_{\mathrm{fp},\mathrm{chunks}}

Index-size proxy:

E_{\mathrm{entries}} \approx N_c \cdot k \cdot u

\mathrm{SizeBytes} \approx b \cdot E_{\mathrm{entries}}

Where

b

is calibrated bytes per entry for your dataset. For single-value columns (

v = 1

) and fixed

c, m, k

, you can derive:

S_{\mathrm{required}} \approx \left(\frac{b \cdot R \cdot k \cdot m}{c}\right) \cdot p_{\mathrm{fp},\mathrm{chunk,target}}^{1/k}

This gives required index size for a target chunk-level false-positive rate.

Optimization Workflow

Use this workflow to tune sparse index safely:

Pick the right columns
- High-cardinality columns queried with = or IN.
- Especially useful when an inverted index would be too large.
Benchmark baseline
- Run representative queries and capture:
  - timeUsedMs
  - numEntriesScannedInFilter
  - numDocsScanned
Tune in this order
- chunkSize first (often largest impact on scan reduction)
- partitions second (improve selectivity with moderate size growth)
- hashFunctionCount third (precision gains with near-linear size growth)
Evaluate each candidate
- Compare p50/p95 latency, numEntriesScannedInFilter, and index size.
- Choose the smallest index that meets your latency and scan targets.
Rebuild/reload segments after config updates
- Ensure segments are reloaded so new index settings take effect.

Performance Considerations

Tiered Storage Benefits: Sparse indexes provide the greatest performance improvements with tiered storage because they reduce object-store reads.
Cardinality Trade-offs: For low-cardinality columns, traditional inverted indexes are often more efficient.
Predicate Coverage: Sparse indexes are designed for equality/IN filter acceleration. Final predicate evaluation still runs on candidate rows.
Storage Impact:
- hashFunctionCount often increases index size close to linearly.
- partitions increases size, but not always linearly in practice.
- chunkSize trades off index size versus scan precision.

Documentation Index

​Overview and Purpose

​How the Index Works

​Core Concepts

​Example Illustration

​Configuration

​Enabling Sparse Index at Server Level

​Configuring Sparse Index for a Column

​Configuration Parameters

​Advanced Configuration Parameters

​False-Positive and Size Formulas

​Optimization Workflow

​Performance Considerations

Overview and Purpose

How the Index Works

Core Concepts

Example Illustration

Configuration

Enabling Sparse Index at Server Level

Configuring Sparse Index for a Column

Configuration Parameters

Advanced Configuration Parameters

False-Positive and Size Formulas

Optimization Workflow

Performance Considerations