← Back to all products

Fine-Tuning Pipeline

$29

Python data pipeline for LLM fine-tuning with data cleaning, formatting, and quality checks.

📁 11 files
MarkdownPythonLLMOpenAI

📄 Product Preview

Try the interactive reader and demo tools below, or get the full product with all content unlocked.

📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample

📁 File Structure 11 files

fine-tuning-pipeline/ ├── LICENSE ├── README.md ├── examples/ │ ├── basic_usage.py │ └── sample_training_data.jsonl ├── free-sample.zip ├── guide/ │ ├── 01_features.md │ ├── 02_project-structure.md │ ├── 03_supported-formats.md │ └── 04_faq.md ├── index.html └── src/ └── fine_tuning_pipeline.py

📖 Documentation Preview README excerpt

Fine-Tuning Pipeline

Python data pipeline for LLM fine-tuning: data cleaning, JSONL formatting, validation, train/test splitting, quality checks, and dataset statistics. Zero dependencies.

Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).

Features

  • Format detection — Auto-detects chat, completion, and instruction (Alpaca) formats
  • Format conversion — Convert between OpenAI chat, legacy completion, and Alpaca formats
  • Text cleaning — Normalize whitespace, smart quotes, control characters, and unicode
  • Validation — Check for missing fields, empty content, token limits, and format errors
  • Token counting — Approximate token counts for budget estimation
  • Train/test split — Reproducible random splitting with configurable ratio
  • Dataset statistics — Token distributions, format breakdown, and system message analysis
  • CLI interface — Full pipeline from raw data to fine-tuning-ready output

Quick Start


# Run demo with sample data
python src/fine_tuning_pipeline.py --demo

# Run the full pipeline: clean → convert → validate → split
python src/fine_tuning_pipeline.py --input raw_data.jsonl --output prepared/

# Validate a dataset
python src/fine_tuning_pipeline.py --validate dataset.jsonl

# Show dataset statistics
python src/fine_tuning_pipeline.py --stats dataset.jsonl

# Split with custom ratio
python src/fine_tuning_pipeline.py --split dataset.jsonl --ratio 0.9 --output prepared/

Project Structure


fine-tuning-pipeline/
├── README.md
├── LICENSE
├── src/
│   └── fine_tuning_pipeline.py    # Core engine (~430 lines)
└── examples/
    ├── basic_usage.py              # Programmatic usage example
    └── sample_training_data.jsonl  # Sample data in mixed formats

CLI Reference

FlagDescription
--demoRun demo with sample data
--input FILEInput data file (JSONL)
--output DIROutput directory (default: ./prepared)
--validate FILEValidate a dataset file
--stats FILEShow dataset statistics
--split FILESplit a dataset into train/test
--ratio FLOATTrain/test split ratio (default: 0.8)

... continues with setup instructions, usage examples, and more.

📄 Code Sample .py preview

src/fine_tuning_pipeline.py #!/usr/bin/env python3 """ Fine-Tuning Pipeline — AI Toolkit (DataNest) A complete fine-tuning data pipeline: data cleaning, JSONL formatting, validation, train/test splitting, quality checks, token counting, and dataset statistics. Prepares data for fine-tuning any LLM (OpenAI, Anthropic, open-source). Zero external dependencies — Python 3.10+ stdlib only. Usage: python fine_tuning_pipeline.py --input raw_data.jsonl --output prepared/ python fine_tuning_pipeline.py --validate dataset.jsonl python fine_tuning_pipeline.py --stats dataset.jsonl python fine_tuning_pipeline.py --demo python fine_tuning_pipeline.py --split dataset.jsonl --ratio 0.8 --output prepared/ """ from __future__ import annotations import argparse import json import logging import random import re import statistics import sys import time from dataclasses import dataclass, field from enum import Enum, auto from pathlib import Path from typing import Any # --------------------------------------------------------------------------- # Logging # --------------------------------------------------------------------------- logging.basicConfig( level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", ) logger = logging.getLogger("fine_tuning_pipeline") # --------------------------------------------------------------------------- # Constants # --------------------------------------------------------------------------- MAX_TOKENS_PER_EXAMPLE: int = 4096 # safety limit per training example MIN_COMPLETION_LENGTH: int = 5 # completions shorter than this are suspect MAX_COMPLETION_LENGTH: int = 8000 # upper bound for most fine-tuning APIs MIN_PROMPT_LENGTH: int = 10 # prompts shorter than this are too vague CHARS_PER_TOKEN: float = 4.0 # rough estimate for token counting DEFAULT_TRAIN_RATIO: float = 0.8 # 80/20 train/test split SEED: int = 42 # reproducible shuffling # ... 626 more lines ...
Buy Now — $29 Back to Products