← Back to all products
$29
Embedding Generator
Python text embedding pipeline with tokenization, vector generation, similarity search, and caching.
MarkdownPython
📄 Product Preview
Try the interactive reader and demo tools below, or get the full product with all content unlocked.
📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample📁 File Structure 11 files
embedding-generator/
├── LICENSE
├── README.md
├── examples/
│ ├── basic_usage.py
│ └── sample_corpus.txt
├── free-sample.zip
├── guide/
│ ├── 01_features.md
│ ├── 02_quick-start.md
│ ├── 03_configuration.md
│ └── 04_license.md
├── index.html
└── src/
└── embedding_generator.py
📖 Documentation Preview README excerpt
Embedding Generator
Python text embedding pipeline with tokenization, vector generation, similarity search, and caching. Zero dependencies.
Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).
Features
- Multiple embedding methods — Hash-trick, Bag-of-Words, and TF-IDF
- Tokenizer — Configurable tokenization with stop-word removal
- Similarity search — Cosine similarity ranking over a corpus
- LRU caching — Automatic embedding cache with configurable size
- Corpus builder — Build vocabulary and IDF from your documents
- CLI interface — Embed text, load corpora, and search from terminal
- Zero dependencies — Python stdlib only
Quick Start
# Embed a single text
python src/embedding_generator.py --text "machine learning is great"
# Load a corpus and search
python src/embedding_generator.py --file examples/sample_corpus.txt --query "neural networks"
# Interactive mode
python src/embedding_generator.py
# Use TF-IDF method
python src/embedding_generator.py --method tfidf --file examples/sample_corpus.txt --query "data science"
Configuration
| Flag | Default | Description |
|---|---|---|
--method | hash | Embedding method: hash, bow, tfidf |
--dim | 128 | Vector dimension (hash method only) |
--top-k | 5 | Number of search results |
--text | — | Single text to embed |
--file | — | Corpus file (one doc per line) |
--query | — | Search query (requires --file) |
License
MIT — use in personal, commercial, or client projects. No attribution required.
📄 Code Sample .py preview
src/embedding_generator.py
#!/usr/bin/env python3
"""
Embedding Generator — AI Toolkit (DataNest)
A self-contained text embedding pipeline with tokenization, vector generation
using bag-of-words and TF-IDF approaches, similarity search, and LRU caching.
Zero external dependencies — Python 3.10+ stdlib only.
Usage:
python embedding_generator.py # interactive demo
python embedding_generator.py --text "your text here" # embed single text
python embedding_generator.py --file corpus.txt --query "search term"
"""
from __future__ import annotations
import argparse
import hashlib
import json
import logging
import math
import re
import sys
import time
from collections import Counter
from dataclasses import dataclass, field
from functools import lru_cache
from pathlib import Path
from typing import Any, Sequence
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
logger = logging.getLogger("embedding_generator")
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
CACHE_SIZE: int = 512 # max cached embeddings
DEFAULT_DIM: int = 128 # default vector dimension for hash embeddings
STOP_WORDS: frozenset[str] = frozenset(
"a an the is are was were be been being have has had do does did "
"will would shall should may might can could of in to for with on at by "
"from as into through during before after above below between out off "
"over under again further then once here there when where why how all "
"each every both few more most other some such no nor not only own same "
"so than too very and but or if while because until about against".split()
)
# ---------------------------------------------------------------------------
# Tokenizer
# ---------------------------------------------------------------------------
# ... 275 more lines ...