← Back to all products

Embedding Generator

$29

Python text embedding pipeline with tokenization, vector generation, similarity search, and caching.

📁 11 files
MarkdownPython

📄 Product Preview

Try the interactive reader and demo tools below, or get the full product with all content unlocked.

📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample

📁 File Structure 11 files

embedding-generator/ ├── LICENSE ├── README.md ├── examples/ │ ├── basic_usage.py │ └── sample_corpus.txt ├── free-sample.zip ├── guide/ │ ├── 01_features.md │ ├── 02_quick-start.md │ ├── 03_configuration.md │ └── 04_license.md ├── index.html └── src/ └── embedding_generator.py

📖 Documentation Preview README excerpt

Embedding Generator

Python text embedding pipeline with tokenization, vector generation, similarity search, and caching. Zero dependencies.

Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).

Features

  • Multiple embedding methods — Hash-trick, Bag-of-Words, and TF-IDF
  • Tokenizer — Configurable tokenization with stop-word removal
  • Similarity search — Cosine similarity ranking over a corpus
  • LRU caching — Automatic embedding cache with configurable size
  • Corpus builder — Build vocabulary and IDF from your documents
  • CLI interface — Embed text, load corpora, and search from terminal
  • Zero dependencies — Python stdlib only

Quick Start


# Embed a single text
python src/embedding_generator.py --text "machine learning is great"

# Load a corpus and search
python src/embedding_generator.py --file examples/sample_corpus.txt --query "neural networks"

# Interactive mode
python src/embedding_generator.py

# Use TF-IDF method
python src/embedding_generator.py --method tfidf --file examples/sample_corpus.txt --query "data science"

Configuration

FlagDefaultDescription
--methodhashEmbedding method: hash, bow, tfidf
--dim128Vector dimension (hash method only)
--top-k5Number of search results
--textSingle text to embed
--fileCorpus file (one doc per line)
--querySearch query (requires --file)

License

MIT — use in personal, commercial, or client projects. No attribution required.

📄 Code Sample .py preview

src/embedding_generator.py #!/usr/bin/env python3 """ Embedding Generator — AI Toolkit (DataNest) A self-contained text embedding pipeline with tokenization, vector generation using bag-of-words and TF-IDF approaches, similarity search, and LRU caching. Zero external dependencies — Python 3.10+ stdlib only. Usage: python embedding_generator.py # interactive demo python embedding_generator.py --text "your text here" # embed single text python embedding_generator.py --file corpus.txt --query "search term" """ from __future__ import annotations import argparse import hashlib import json import logging import math import re import sys import time from collections import Counter from dataclasses import dataclass, field from functools import lru_cache from pathlib import Path from typing import Any, Sequence logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s: %(message)s") logger = logging.getLogger("embedding_generator") # --------------------------------------------------------------------------- # Constants # --------------------------------------------------------------------------- CACHE_SIZE: int = 512 # max cached embeddings DEFAULT_DIM: int = 128 # default vector dimension for hash embeddings STOP_WORDS: frozenset[str] = frozenset( "a an the is are was were be been being have has had do does did " "will would shall should may might can could of in to for with on at by " "from as into through during before after above below between out off " "over under again further then once here there when where why how all " "each every both few more most other some such no nor not only own same " "so than too very and but or if while because until about against".split() ) # --------------------------------------------------------------------------- # Tokenizer # --------------------------------------------------------------------------- # ... 275 more lines ...
Buy Now — $29 Back to Products