← Back to all products
$19
Model Evaluation Tool
Python model evaluation suite with accuracy, precision, recall, F1, and per-class reports.
MarkdownPython
📄 Product Preview
Try the interactive reader and demo tools below, or get the full product with all content unlocked.
📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample📁 File Structure 11 files
model-evaluation-tool/
├── LICENSE
├── README.md
├── examples/
│ ├── basic_usage.py
│ └── sample_predictions.jsonl
├── free-sample.zip
├── guide/
│ ├── 01_features.md
│ ├── 02_project-structure.md
│ ├── 03_usage-examples.md
│ └── 04_license.md
├── index.html
└── src/
└── model_evaluation_tool.py
📖 Documentation Preview README excerpt
Model Evaluation Tool
Python model evaluation suite: accuracy, precision, recall, F1, confusion matrix, per-class reports, and benchmark runner. All metrics computed from scratch. Zero dependencies.
Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).
Features
- Core metrics — Accuracy, precision, recall, F1 score (macro and weighted)
- Confusion matrix — Multi-class confusion matrix with ASCII table rendering
- Per-class report — Precision, recall, F1, and support for every class
- Benchmark runner — Evaluate across multiple test splits with mean/std statistics
- File loaders — Load labels from JSONL, plain text, or structured JSON
- JSON export — Export full reports for dashboards and CI pipelines
- CLI interface — Evaluate, benchmark, and export from the terminal
- No dependencies — All math implemented from scratch using stdlib only
Quick Start
# Run demo with sample sentiment analysis data
python src/model_evaluation_tool.py --demo
# Evaluate predictions vs. labels
python src/model_evaluation_tool.py --predictions preds.txt --labels true.txt
# Export report as JSON
python src/model_evaluation_tool.py --predictions preds.txt --labels true.txt --export report.json
# Print confusion matrix
python src/model_evaluation_tool.py --confusion-matrix preds.txt true.txt
# Run benchmark across splits
python src/model_evaluation_tool.py --benchmark preds.txt true.txt --export bench.json
Project Structure
model-evaluation-tool/
├── README.md
├── LICENSE
├── src/
│ └── model_evaluation_tool.py # Core engine (~350 lines)
└── examples/
├── basic_usage.py # Programmatic usage example
└── sample_predictions.jsonl # Sample prediction data
CLI Reference
| Flag | Description |
|---|---|
--demo | Run demo with built-in sample data |
--predictions FILE | Predictions file (JSONL or plain text) |
--labels FILE | True labels file (JSONL or plain text) |
--confusion-matrix PREDS LABELS | Print confusion matrix |
--benchmark PREDS LABELS | Run benchmark evaluation |
--export FILE | Export report to JSON |
... continues with setup instructions, usage examples, and more.
📄 Code Sample .py preview
src/model_evaluation_tool.py
#!/usr/bin/env python3
"""
Model Evaluation Tool — AI Toolkit (DataNest)
A complete model evaluation suite: accuracy, precision, recall, F1 score,
confusion matrix, per-class metrics, and a benchmark runner — all computed
from scratch. Zero external dependencies — Python 3.10+ stdlib only.
Usage:
python model_evaluation_tool.py --predictions preds.jsonl --labels labels.jsonl
python model_evaluation_tool.py --demo
python model_evaluation_tool.py --confusion-matrix preds.jsonl labels.jsonl
python model_evaluation_tool.py --benchmark preds.jsonl labels.jsonl --export report.json
"""
from __future__ import annotations
import argparse
import json
import logging
import statistics
import sys
import time
from collections import Counter
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
logger = logging.getLogger("model_evaluation_tool")
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
EPSILON: float = 1e-10 # avoid division by zero
# ---------------------------------------------------------------------------
# Core Metrics — implemented from scratch
# ---------------------------------------------------------------------------
def accuracy(y_true: list[str], y_pred: list[str]) -> float:
"""
Compute accuracy: fraction of predictions that match the true labels.
This is the simplest metric but can be misleading with imbalanced
# ... 465 more lines ...