← Back to all products
$29
Fine-Tuning Pipeline
Python data pipeline for LLM fine-tuning with data cleaning, formatting, and quality checks.
MarkdownPythonLLMOpenAI
📄 Product Preview
Try the interactive reader and demo tools below, or get the full product with all content unlocked.
📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample📁 File Structure 11 files
fine-tuning-pipeline/
├── LICENSE
├── README.md
├── examples/
│ ├── basic_usage.py
│ └── sample_training_data.jsonl
├── free-sample.zip
├── guide/
│ ├── 01_features.md
│ ├── 02_project-structure.md
│ ├── 03_supported-formats.md
│ └── 04_faq.md
├── index.html
└── src/
└── fine_tuning_pipeline.py
📖 Documentation Preview README excerpt
Fine-Tuning Pipeline
Python data pipeline for LLM fine-tuning: data cleaning, JSONL formatting, validation, train/test splitting, quality checks, and dataset statistics. Zero dependencies.
Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).
Features
- Format detection — Auto-detects chat, completion, and instruction (Alpaca) formats
- Format conversion — Convert between OpenAI chat, legacy completion, and Alpaca formats
- Text cleaning — Normalize whitespace, smart quotes, control characters, and unicode
- Validation — Check for missing fields, empty content, token limits, and format errors
- Token counting — Approximate token counts for budget estimation
- Train/test split — Reproducible random splitting with configurable ratio
- Dataset statistics — Token distributions, format breakdown, and system message analysis
- CLI interface — Full pipeline from raw data to fine-tuning-ready output
Quick Start
# Run demo with sample data
python src/fine_tuning_pipeline.py --demo
# Run the full pipeline: clean → convert → validate → split
python src/fine_tuning_pipeline.py --input raw_data.jsonl --output prepared/
# Validate a dataset
python src/fine_tuning_pipeline.py --validate dataset.jsonl
# Show dataset statistics
python src/fine_tuning_pipeline.py --stats dataset.jsonl
# Split with custom ratio
python src/fine_tuning_pipeline.py --split dataset.jsonl --ratio 0.9 --output prepared/
Project Structure
fine-tuning-pipeline/
├── README.md
├── LICENSE
├── src/
│ └── fine_tuning_pipeline.py # Core engine (~430 lines)
└── examples/
├── basic_usage.py # Programmatic usage example
└── sample_training_data.jsonl # Sample data in mixed formats
CLI Reference
| Flag | Description |
|---|---|
--demo | Run demo with sample data |
--input FILE | Input data file (JSONL) |
--output DIR | Output directory (default: ./prepared) |
--validate FILE | Validate a dataset file |
--stats FILE | Show dataset statistics |
--split FILE | Split a dataset into train/test |
--ratio FLOAT | Train/test split ratio (default: 0.8) |
... continues with setup instructions, usage examples, and more.
📄 Code Sample .py preview
src/fine_tuning_pipeline.py
#!/usr/bin/env python3
"""
Fine-Tuning Pipeline — AI Toolkit (DataNest)
A complete fine-tuning data pipeline: data cleaning, JSONL formatting,
validation, train/test splitting, quality checks, token counting, and
dataset statistics. Prepares data for fine-tuning any LLM (OpenAI,
Anthropic, open-source). Zero external dependencies — Python 3.10+ stdlib only.
Usage:
python fine_tuning_pipeline.py --input raw_data.jsonl --output prepared/
python fine_tuning_pipeline.py --validate dataset.jsonl
python fine_tuning_pipeline.py --stats dataset.jsonl
python fine_tuning_pipeline.py --demo
python fine_tuning_pipeline.py --split dataset.jsonl --ratio 0.8 --output prepared/
"""
from __future__ import annotations
import argparse
import json
import logging
import random
import re
import statistics
import sys
import time
from dataclasses import dataclass, field
from enum import Enum, auto
from pathlib import Path
from typing import Any
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
logger = logging.getLogger("fine_tuning_pipeline")
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
MAX_TOKENS_PER_EXAMPLE: int = 4096 # safety limit per training example
MIN_COMPLETION_LENGTH: int = 5 # completions shorter than this are suspect
MAX_COMPLETION_LENGTH: int = 8000 # upper bound for most fine-tuning APIs
MIN_PROMPT_LENGTH: int = 10 # prompts shorter than this are too vague
CHARS_PER_TOKEN: float = 4.0 # rough estimate for token counting
DEFAULT_TRAIN_RATIO: float = 0.8 # 80/20 train/test split
SEED: int = 42 # reproducible shuffling
# ... 626 more lines ...