Skip to content
HN On Hacker News ↗

OpenData Vector: MIT-licensed Vector Search on Object Storage | OpenData

▲ 48 points 5 comments by apurvamehta 1w ago HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

1 %

AI likelihood · overall

Human
100% human-written 0% AI-generated
SEGMENTS · HUMAN 9 of 9
SEGMENTS · AI 0 of 9
WORD COUNT 1,147
PEAK AI % 2% · §3
Analyzed
May 14
backend: pangram/v3.3
Segments scanned
9 windows
avg 127 words each
Distribution
100 / 0%
human / AI fraction
Verdict
Human
Pangram v3.3

Article text · 1,147 words · 9 segments analyzed

Human AI-generated
§1 Human · 1%

OpenData Vector fills the gap between running pgvector and paying a vendor many multiples of hardware costs to operate a search database for you.MIT-licensed and built on SlateDB, OpenData Vector is a stateless, durable, and highly-available search engine that runs anywhere with access to Object Storage. It is designed to be simple enough to operate yourself, and efficient enough to serve 100M vectors for roughly $350/mo.Stateless vector searchThere’s growing consensus in the database community that object storage is a “good thing”. The 99.999999999% durability SLAs, the cost efficiency (1/4 storage cost, $0 cross-AZ networking), and the strong consistency are all distributed system nightmares that are solved by object storage.Because of this, over the last decade online systems have steadily increased their dependency on object storage. The first generation systems were tiered, the second were fully disaggregated and the third (and current) generation are stateless.┏━Gen 1: Tiered Storage━━━━━━━━━━┓ ┏━Gen 2: Disaggregated━━━━━━━━━━━┓ ┃ ┃█ ┃ ┃█ ┃ ╭──────────────────────╮ ┃█ ┃ ╭──────────────────────╮ ┃█ ┃ │ Query │ ┃█ ┃ │ Query │ ┃█ ┃ ╰───────────┬──────────╯ ┃█ ┃ ╰───────────┬──────────╯ ┃█ ┃ │ ┃█ ┃ │ ┃█ ┃ ┌─────────┘ ┃█ ┃ ┌──────────┘ ┃█

§2 Human · 2%

┃ │ ┃█ ┃ │ ╭─────────╮ ┃█ ┃ ╔══▼═╗ ╔════╗ ╔════╗ ┃█ ┃ ┌──▼─┐ │ Cluster │ ┌────┐┃█ ┃ ║ A ├────▶ B ├────▶ C ║ ┃█ ┃ │ A ├───▶ Manager ◀────┤ B │┃█ ┃ ╚══┬═╝ ╚══┬═╝ ╚══┬═╝ ┃█ ┃ └──┬─┘ ╰─────────╯ └──┬─┘┃█ ┃ └─────────┼─────────┘ ┃█ ┃ └──────────┬────────────┘ ┃█ ┃ │ ┃█ ┃ │ ┃█ ┃ ┌───────────▼──────────┐ ┃█ ┃ ╔═══════════▼══════════╗ ┃█ ┃ │ S3 (cold) │ ┃█ ┃ ║ S3 (durable state) ║ ┃█ ┃ └──────────────────────┘ ┃█ ┃ ╚══════════════════════╝ ┃█ ┃ ┃█ ┃ ┃█

§3 Human · 2%

┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛█ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛█ ██████████████████████████████████ ██████████████████████████████████ ┏━Gen 3: Stateless━━━━━━━━━━━━━━━┓ ┃ ┃█ ┃ ╭──────────────────────╮ ┃█ ┃ │ Query │ ┃█ ┃ ╰───────────┬──────────╯ ┃█ ┃ │ ┃█ ┃ ┌─────────┼─────────┐ ┃█ ┃ │ │ │ ┃█ ┃ ┌──▼─┐ ┌──▼─┐ ┌──▼─┐ ┃█ ┃ │ A │ │ B │ │ C │ ┃█ ┃ └──┬─┘ └──┬─┘ └──┬─┘ ┃█ ┃ └─────────┼─────────┘

§4 Human · 1%

┃█ ┃ │ ┃█ ┃ ╔═══════════▼══════════╗ ┃█ ┃ ║ S3 (only truth) ║ ┃█ ┃ ╚══════════════════════╝ ┃█ ┃ ┃█ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛█ ██████████████████████████████████ The second generation systems (such as Chroma & Milvus) improved on the first in that durability and replication were delegated to object storage, but nodes are still statefully assigned shards to serve from local data and require coordination between them to rebalance work. The third generation improves on the second by delegating metadata management to object storage as well, enabling any node to serve any data.These stateless systems are significantly simpler, and therefore cheaper, more reliable, and easier to operate than their predecessors.It is challenging to adapt a second generation system to be truly stateless, instead systems must be designed from the ground up (leader election, caching, compute allocation, etc…) to natively support object storage.

§5 Human · 0%

As far as we are aware, OpenData Vector is the only OSS third generation, online vector database that is generally available (though Turbopuffer is a proprietary system that shares a similar architecture).Vector’s stateless architectureVector makes a stateless vector search engine possible by making three key architectural decisions: IVF indexing, LIRE compaction on an LSM tree, and share-everything state. We detail these decisions on our public RFC on GitHub.IVF/SPANN IndexingTo be competitive on performance and cost with a stateless architecture, Vector’s index needs to be optimized for object storage’s high latency and expensive GET requests. This means fetching index data in batches.Vector does this by maintaining an inverted-file index (IVF) based on SPFresh. The index groups vectors into clusters using k-means. Each cluster is represented by a “central” vector called a centroid, which holds references to vectors in its cluster via a “posting list” in SlateDB. Search proceeds by finding the nearest centroids to the query, and then exhaustively scoring the vectors from their posting lists.The main alternative to IVF-style indexes are graph-based indexes like HNSW or Vamana. We chose an inverted index for 2 main reasons. First, graph indexes require a traversal that hops node-by-node across the index, requiring sequential GET requests to object storage (where first-byte latency can be up to 100ms). IVF, despite its relatively imprecise retrieval algorithm, can batch load data per round trip. This more than makes up for the increased compute needs on cache misses, which must make their way to object storage. Second, graph-based indexes are very expensive to maintain incrementally as new writes arrive.LSM-based LIRE CompactionSince object storage doesn’t tolerate read-modify-write loops, data ingestion and index maintenance must use an append-only model with reconciliation happening during compaction.To support append-only incremental updates to the index, Vector adapts the LIRE protocol from SPFresh. For each batch of writes, the writer adds the new vectors to the closest centroids’ postings. When postings grow too large, the writer splits the posting by re-running k-means and computing new centroids.LIRE requires lots of incremental updates to add vectors to and move them between posting lists.

§6 Human · 1%

Vector applies a novel, lazy adaptation of LIRE’s incremental updates using SlateDB merges to minimize write amplification and maintain high ingest throughput.Share EverythingVector stores all of its state in SlateDB, which means that any deployed node has full access to both metadata and data on object storage. Nodes never communicate with each other directly. Readers always operate against a SlateDB snapshot, ensuring they see a consistent view of state without relying on an external consistent database.Simple enough to DIY deployAs with all OpenData systems, the stateless architecture of Vector makes it feasible to run a production system on a single Kubernetes pod.In order to handle a variety of different requirements, however, Vector can run with various configurations and provide higher levels of availability without sacrificing the simplicity of the deployment:┏━Topology 1: Embedded━━━━━━━━━━━━━┓ ┏━Topology 2: Single-Node━━━━━━━━━━┓ ┃ ┃█ ┃ ┃█ ┃ ╭───────Your Application───────╮ ┃█ ┃ ╭────────Vector Process────────╮ ┃█ ┃ │ ╭──────╮ ╭──────╮ ╭────────╮ │ ┃█ ┃ │ ╭──────╮ ╭──────╮ │ ┃█ ┃ │ │Writer│ │Reader│ │App Code│ │ ┃█ ┃ │ │Writer│ │Reader│ │ ┃█ ┃ │ ╰──────╯ ╰──────╯ ╰────────╯ │ ┃█ ┃ │ ╰──────╯ ╰──────╯ │ ┃█ ┃ ╰───────────────┬──────────────╯ ┃█ ┃

§7 Human · 2%

╰───────────────┬──────────────╯ ┃█ ┃ │ ┃█ ┃ │ ┃█ ┃ │ ┃█ ┃ │ ┃█ ┃ ╔═════════▼════════╗ ┃█ ┃ ╔═════════▼════════╗ ┃█ ┃ ║ S3 ║ ┃█ ┃ ║ S3 ║ ┃█ ┃ ╚══════════════════╝ ┃█ ┃ ╚══════════════════╝ ┃█ ┃ ┃█ ┃ ┃█ ┃ ┃█ ┃ ┃█ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛█ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛█

§8 Human · 1%

████████████████████████████████████ ████████████████████████████████████ ┏━Topology 3: Writer + Readers━━━━━┓ ┏━Topology 4: Buffered Ingest━━━━━━┓ ┃ ┃█ ┃ ┃█ ┃ ╭──────╮ ┃█ ┃ ╭──────────────────╮ ┃█ ┃ │Writer│ ┃█ ┃ │ Buffer Producer │ ┃█ ┃ ╰───┬──╯ ┃█ ┃ ╰─────────┬────────╯ ┃█ ┃ │ ┃█ ┃ │ ┃█ ┃ ╔═══════════════▼══════════════╗ ┃█ ┃ ╔═══════════════▼══════════════╗ ┃█ ┃ ║ S3 ║ ┃█ ┃ ║ S3 ║ ┃█ ┃

§9 Human · 1%

╚═══════════════┬══════════════╝ ┃█ ┃ ╚═══════════════▲══════════════╝ ┃█ ┃ │ ┃█ ┃ ┌──────┴──────┐ ┃█ ┃ ╭───▼──╮ ┃█ ┃ ╭────▼───╮ ╭────▼───╮ ┃█ ┃ │Reader│ ┃█ ┃ │ Writer │ │ Reader │ ┃█ ┃ ╰──────╯ ┃█ ┃ ╰────────╯ ╰────────╯ ┃█ ┃ ┃█ ┃ ┃█ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛█ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛█