← Back to all products
$29
Vector Search Setup
Python vector search with index building, cosine similarity, and approximate nearest neighbor search.
MarkdownPython
📄 Product Preview
Try the interactive reader and demo tools below, or get the full product with all content unlocked.
📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample📁 File Structure 11 files
vector-search-setup/
├── LICENSE
├── README.md
├── examples/
│ ├── basic_usage.py
│ └── sample_documents.jsonl
├── free-sample.zip
├── guide/
│ ├── 01_features.md
│ ├── 02_project-structure.md
│ ├── 03_data-format.md
│ └── 04_faq.md
├── index.html
└── src/
└── vector_search_setup.py
📖 Documentation Preview README excerpt
Vector Search Setup
Python vector search engine with index building, cosine similarity, LSH-powered approximate nearest neighbors, and a query API. All math from scratch. Zero dependencies.
Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).
Features
- Cosine similarity — Manual implementation of dot product, magnitude, and cosine similarity
- LSH indexing — Locality-sensitive hashing for sub-linear approximate nearest neighbor search
- Exact search — Brute-force search for small indexes where perfect recall matters
- Text-to-vector — Hashed bag-of-words encoder converts text to fixed-dimension vectors
- Vocabulary builder — Automatic vocabulary extraction from your document corpus
- Index persistence — Save and load indexes to/from JSON files
- Performance benchmark — Compare exact vs. approximate search speed and recall
- CLI interface — Build indexes, query them, and run benchmarks from the terminal
Quick Start
# Run the interactive demo with built-in sample data
python src/vector_search_setup.py --demo
# Build an index from a JSONL file
python src/vector_search_setup.py --build-index data.jsonl --output my_index.json
# Query an existing index
python src/vector_search_setup.py --query "machine learning algorithms" --index my_index.json --top-k 5
# Run performance benchmark
python src/vector_search_setup.py --benchmark --dim 128 --num-vectors 5000
Project Structure
vector-search-setup/
├── README.md
├── LICENSE
├── src/
│ └── vector_search_setup.py # Core engine (~400 lines)
└── examples/
├── basic_usage.py # Programmatic usage example
└── sample_documents.jsonl # Sample data for index building
CLI Reference
| Flag | Description |
|---|---|
--demo | Run demo with built-in sample data |
--build-index FILE | Build index from JSONL file |
--output FILE | Output path for built index (default: index.json) |
--query TEXT | Search query text |
--index FILE | Path to a saved index file |
--top-k N | Number of results (default: 5) |
--exact | Use exact (brute force) search |
--benchmark | Run performance benchmark |
--dim N | Vector dimension for benchmark (default: 64) |
--num-vectors N | Number of vectors for benchmark (default: 1000) |
... continues with setup instructions, usage examples, and more.
📄 Code Sample .py preview
src/vector_search_setup.py
#!/usr/bin/env python3
"""
Vector Search Setup — AI Toolkit (DataNest)
A self-contained vector search engine with index building, cosine similarity,
approximate nearest neighbors (locality-sensitive hashing), and a query API.
Implements all math from scratch — zero external dependencies.
Python 3.10+ stdlib only.
Usage:
python vector_search_setup.py --build-index data.jsonl --output index.json
python vector_search_setup.py --query "search text" --index index.json --top-k 5
python vector_search_setup.py --demo
python vector_search_setup.py --benchmark --dim 128 --num-vectors 1000
"""
from __future__ import annotations
import argparse
import hashlib
import json
import logging
import math
import random
import sys
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
logger = logging.getLogger("vector_search_setup")
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
DEFAULT_DIM: int = 64 # default embedding dimension
DEFAULT_TOP_K: int = 5 # default number of results to return
LSH_NUM_TABLES: int = 8 # number of LSH hash tables for ANN
LSH_NUM_BITS: int = 16 # bits per hash (trade-off: precision vs speed)
EPSILON: float = 1e-10 # avoid division by zero in similarity calcs
# ---------------------------------------------------------------------------
# ... 563 more lines ...