Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.startree.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview and Purpose

A sparse index is a specialized data structure that accelerates equality-based filter operations on columns with extremely high cardinality. It represents a hybrid approach between inverted indexes and Bloom filters, designed specifically for scenarios where traditional indexes become inefficient due to the sheer number of unique values. Sparse indexes are particularly valuable for:
  • Columns storing hundreds of thousands of unique values
  • Random identifiers like UUIDs or session IDs
  • High-cardinality fields such as public IP addresses
  • Any query pattern involving equality filters (column = value or column IN (...)) on high-cardinality columns
  • Deployments using tiered storage with data in cloud object stores
Sparse index is a StarTree extension that shows particular performance benefits with tiered storage because it can significantly reduce data transferred from cloud storage during query execution.

How the Index Works

Core Concepts

Traditional inverted indexes create a direct mapping between each unique value and the rows containing that value. While effective for low- to medium-cardinality columns, this approach becomes inefficient as cardinality increases, leading to large index sizes and degraded performance. The sparse index in StarTree Cloud takes a different approach:
  1. Chunking: Rows are grouped into contiguous chunks of a predefined size.
  2. Partitioning: Column values are classified into a fixed number of partitions using one or more hash functions (mappers).
  3. Pseudo-Inverted Index: Instead of mapping values directly to rows, the index maps partitions to chunks that contain values from that partition.

Example Illustration

For a column storing device IDs with millions of unique values:
  1. Traditional Approach: An inverted index would map each unique device ID to its corresponding rows, potentially creating an enormous index.
  2. Sparse Index Approach:
    • Rows are grouped into chunks
    • Device IDs are hashed into partitions using multiple hash functions
    • The index maps each partition to the chunks that contain values from that partition
When processing a query like WHERE deviceId = 'abc123':
  1. The system determines which partition contains ‘abc123’
  2. It identifies the chunks associated with that partition
  3. Only those chunks are scanned, dramatically reducing I/O operations
Sparse indexes are probabilistic at chunk level. They reduce scan work but may include extra chunks that are removed by final predicate evaluation. This approach is particularly beneficial with tiered storage, as it minimizes the amount of data downloaded from cloud storage.

Configuration

Enabling Sparse Index at Server Level

To use sparse indexes, you must first enable them at the server level:
pinot.server.instance.index.sparse.enabled=true
This can be set via the Swagger UI using the /cluster/configs API, or through your server configuration files. Note: After adding this configuration, restart the servers for the changes to take effect.

Configuring Sparse Index for a Column

To enable a sparse index on a column in your StarTree Cloud table, add the following configuration to your table definition:
{
  "fieldConfigList": [
    {
      "name": "deviceId",
      "indexes": {
        "sparse": {
          "chunkSize": 1024,
          "partitions": 10000,
          "hashFunctionCount": 3
        }
      }
    }
  ]
}

Configuration Parameters

  1. chunkSize: The number of documents per chunk.
    • Must be a power of 2
    • Default: 1024
    • Recommended range: 128 to 8192
    • Impact: Smaller chunks typically reduce false positives and numEntriesScannedInFilter, but can increase index size
    • Usually the strongest tuning knob for scan reduction
  2. partitions: The number of partitions used for classifying values.
    • Default: 10000
    • Recommended: Depends on cardinality and target false-positive rate
    • Impact: Higher values improve selectivity and generally increase index size
    • Size impact is often sub-linear when partitions are already large
  3. hashFunctionCount: The number of partition mapping functions used.
    • Default: 3
    • Recommended range: 1 to 8
    • Impact: Higher values increase selectivity and index lookups
    • Usually increases index size close to linearly
Sparse index cannot be created on a column that also has an inverted index enabled.
Sparse index is not supported on MAP columns.

Advanced Configuration Parameters

These parameters are rarely needed and should only be used in specific scenarios:
  1. seedGenerator: An integer seed for generating hash functions.
    • Only necessary if hash collisions are unusually frequent
  2. mapperId: Specifies the hash function type.
    • Currently, only “murmur” is supported

False-Positive and Size Formulas

The following formulas are practical approximations for tuning.
Assume:
  • RR: rows in segment
  • cc: chunkSize
  • mm: partitions
  • kk: hashFunctionCount
  • vv: average values per row for the indexed column (v=1v = 1 for single-value)
  • Nc=R/cN_c = \lceil R / c \rceil and n=cvn = c \cdot v
Expected unique partitions touched in one chunk for one hash: u=m(1en/m)u = m \left(1 - e^{-n/m}\right) Chunk-level false-positive rate for non-existing keys: pfp,chunk(um)k(1en/m)kp_{\mathrm{fp},\mathrm{chunk}} \approx \left(\frac{u}{m}\right)^k \approx \left(1 - e^{-n/m}\right)^k Expected false-positive chunks per segment: Efp,chunksNcpfp,chunkE_{\mathrm{fp},\mathrm{chunks}} \approx N_c \cdot p_{\mathrm{fp},\mathrm{chunk}} Candidate-docs proxy: EcandidateDocscEfp,chunksE_{\mathrm{candidateDocs}} \approx c \cdot E_{\mathrm{fp},\mathrm{chunks}} Index-size proxy: EentriesNckuE_{\mathrm{entries}} \approx N_c \cdot k \cdot u SizeBytesbEentries\mathrm{SizeBytes} \approx b \cdot E_{\mathrm{entries}} Where bb is calibrated bytes per entry for your dataset. For single-value columns (v=1v = 1) and fixed c,m,kc, m, k, you can derive: Srequired(bRkmc)pfp,chunk,target1/kS_{\mathrm{required}} \approx \left(\frac{b \cdot R \cdot k \cdot m}{c}\right) \cdot p_{\mathrm{fp},\mathrm{chunk,target}}^{1/k} This gives required index size for a target chunk-level false-positive rate.

Optimization Workflow

Use this workflow to tune sparse index safely:
  1. Pick the right columns
    • High-cardinality columns queried with = or IN.
    • Especially useful when an inverted index would be too large.
  2. Benchmark baseline
    • Run representative queries and capture:
      • timeUsedMs
      • numEntriesScannedInFilter
      • numDocsScanned
  3. Tune in this order
    • chunkSize first (often largest impact on scan reduction)
    • partitions second (improve selectivity with moderate size growth)
    • hashFunctionCount third (precision gains with near-linear size growth)
  4. Evaluate each candidate
    • Compare p50/p95 latency, numEntriesScannedInFilter, and index size.
    • Choose the smallest index that meets your latency and scan targets.
  5. Rebuild/reload segments after config updates
    • Ensure segments are reloaded so new index settings take effect.

Performance Considerations

  1. Tiered Storage Benefits: Sparse indexes provide the greatest performance improvements with tiered storage because they reduce object-store reads.
  2. Cardinality Trade-offs: For low-cardinality columns, traditional inverted indexes are often more efficient.
  3. Predicate Coverage: Sparse indexes are designed for equality/IN filter acceleration. Final predicate evaluation still runs on candidate rows.
  4. Storage Impact:
    • hashFunctionCount often increases index size close to linearly.
    • partitions increases size, but not always linearly in practice.
    • chunkSize trades off index size versus scan precision.