← Back to all products

Model Evaluation Tool

$19

Python model evaluation suite with accuracy, precision, recall, F1, and per-class reports.

📁 11 files

MarkdownPython

📄 Product Preview

Try the interactive reader and demo tools below, or get the full product with all content unlocked.

📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample

📁 File Structure 11 files

model-evaluation-tool/ ├── LICENSE ├── README.md ├── examples/ │ ├── basic_usage.py │ └── sample_predictions.jsonl ├── free-sample.zip ├── guide/ │ ├── 01_features.md │ ├── 02_project-structure.md │ ├── 03_usage-examples.md │ └── 04_license.md ├── index.html └── src/ └── model_evaluation_tool.py

📖 Documentation Preview README excerpt

Model Evaluation Tool

Python model evaluation suite: accuracy, precision, recall, F1, confusion matrix, per-class reports, and benchmark runner. All metrics computed from scratch. Zero dependencies.

Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).

Features

Core metrics — Accuracy, precision, recall, F1 score (macro and weighted)
Confusion matrix — Multi-class confusion matrix with ASCII table rendering
Per-class report — Precision, recall, F1, and support for every class
Benchmark runner — Evaluate across multiple test splits with mean/std statistics
File loaders — Load labels from JSONL, plain text, or structured JSON
JSON export — Export full reports for dashboards and CI pipelines
CLI interface — Evaluate, benchmark, and export from the terminal
No dependencies — All math implemented from scratch using stdlib only

Quick Start


# Run demo with sample sentiment analysis data
python src/model_evaluation_tool.py --demo

# Evaluate predictions vs. labels
python src/model_evaluation_tool.py --predictions preds.txt --labels true.txt

# Export report as JSON
python src/model_evaluation_tool.py --predictions preds.txt --labels true.txt --export report.json

# Print confusion matrix
python src/model_evaluation_tool.py --confusion-matrix preds.txt true.txt

# Run benchmark across splits
python src/model_evaluation_tool.py --benchmark preds.txt true.txt --export bench.json

Project Structure


model-evaluation-tool/
├── README.md
├── LICENSE
├── src/
│   └── model_evaluation_tool.py    # Core engine (~350 lines)
└── examples/
    ├── basic_usage.py               # Programmatic usage example
    └── sample_predictions.jsonl     # Sample prediction data

CLI Reference

Flag	Description
`--demo`	Run demo with built-in sample data
`--predictions FILE`	Predictions file (JSONL or plain text)
`--labels FILE`	True labels file (JSONL or plain text)
`--confusion-matrix PREDS LABELS`	Print confusion matrix
`--benchmark PREDS LABELS`	Run benchmark evaluation
`--export FILE`	Export report to JSON

... continues with setup instructions, usage examples, and more.

📄 Code Sample .py preview

src/model_evaluation_tool.py #!/usr/bin/env python3 """ Model Evaluation Tool — AI Toolkit (DataNest) A complete model evaluation suite: accuracy, precision, recall, F1 score, confusion matrix, per-class metrics, and a benchmark runner — all computed from scratch. Zero external dependencies — Python 3.10+ stdlib only. Usage: python model_evaluation_tool.py --predictions preds.jsonl --labels labels.jsonl python model_evaluation_tool.py --demo python model_evaluation_tool.py --confusion-matrix preds.jsonl labels.jsonl python model_evaluation_tool.py --benchmark preds.jsonl labels.jsonl --export report.json """ from __future__ import annotations import argparse import json import logging import statistics import sys import time from collections import Counter from dataclasses import dataclass, field from pathlib import Path from typing import Any # --------------------------------------------------------------------------- # Logging # --------------------------------------------------------------------------- logging.basicConfig( level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", ) logger = logging.getLogger("model_evaluation_tool") # --------------------------------------------------------------------------- # Constants # --------------------------------------------------------------------------- EPSILON: float = 1e-10 # avoid division by zero # --------------------------------------------------------------------------- # Core Metrics — implemented from scratch # --------------------------------------------------------------------------- def accuracy(y_true: list[str], y_pred: list[str]) -> float: """ Compute accuracy: fraction of predictions that match the true labels. This is the simplest metric but can be misleading with imbalanced # ... 465 more lines ...

Buy Now — $19 Back to Products