← Back to all products

Model Evaluation Tool

$19

Python model evaluation suite with accuracy, precision, recall, F1, and per-class reports.

📁 11 files
MarkdownPython

📄 Product Preview

Try the interactive reader and demo tools below, or get the full product with all content unlocked.

📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample

📁 File Structure 11 files

model-evaluation-tool/ ├── LICENSE ├── README.md ├── examples/ │ ├── basic_usage.py │ └── sample_predictions.jsonl ├── free-sample.zip ├── guide/ │ ├── 01_features.md │ ├── 02_project-structure.md │ ├── 03_usage-examples.md │ └── 04_license.md ├── index.html └── src/ └── model_evaluation_tool.py

📖 Documentation Preview README excerpt

Model Evaluation Tool

Python model evaluation suite: accuracy, precision, recall, F1, confusion matrix, per-class reports, and benchmark runner. All metrics computed from scratch. Zero dependencies.

Part of the AI Toolkit collection by [CodeVault](https://ai-toolkit.codevault.dev).

Features

  • Core metrics — Accuracy, precision, recall, F1 score (macro and weighted)
  • Confusion matrix — Multi-class confusion matrix with ASCII table rendering
  • Per-class report — Precision, recall, F1, and support for every class
  • Benchmark runner — Evaluate across multiple test splits with mean/std statistics
  • File loaders — Load labels from JSONL, plain text, or structured JSON
  • JSON export — Export full reports for dashboards and CI pipelines
  • CLI interface — Evaluate, benchmark, and export from the terminal
  • No dependencies — All math implemented from scratch using stdlib only

Quick Start


# Run demo with sample sentiment analysis data
python src/model_evaluation_tool.py --demo

# Evaluate predictions vs. labels
python src/model_evaluation_tool.py --predictions preds.txt --labels true.txt

# Export report as JSON
python src/model_evaluation_tool.py --predictions preds.txt --labels true.txt --export report.json

# Print confusion matrix
python src/model_evaluation_tool.py --confusion-matrix preds.txt true.txt

# Run benchmark across splits
python src/model_evaluation_tool.py --benchmark preds.txt true.txt --export bench.json

Project Structure


model-evaluation-tool/
├── README.md
├── LICENSE
├── src/
│   └── model_evaluation_tool.py    # Core engine (~350 lines)
└── examples/
    ├── basic_usage.py               # Programmatic usage example
    └── sample_predictions.jsonl     # Sample prediction data

CLI Reference

FlagDescription
--demoRun demo with built-in sample data
--predictions FILEPredictions file (JSONL or plain text)
--labels FILETrue labels file (JSONL or plain text)
--confusion-matrix PREDS LABELSPrint confusion matrix
--benchmark PREDS LABELSRun benchmark evaluation
--export FILEExport report to JSON

... continues with setup instructions, usage examples, and more.

📄 Code Sample .py preview

src/model_evaluation_tool.py #!/usr/bin/env python3 """ Model Evaluation Tool — AI Toolkit (DataNest) A complete model evaluation suite: accuracy, precision, recall, F1 score, confusion matrix, per-class metrics, and a benchmark runner — all computed from scratch. Zero external dependencies — Python 3.10+ stdlib only. Usage: python model_evaluation_tool.py --predictions preds.jsonl --labels labels.jsonl python model_evaluation_tool.py --demo python model_evaluation_tool.py --confusion-matrix preds.jsonl labels.jsonl python model_evaluation_tool.py --benchmark preds.jsonl labels.jsonl --export report.json """ from __future__ import annotations import argparse import json import logging import statistics import sys import time from collections import Counter from dataclasses import dataclass, field from pathlib import Path from typing import Any # --------------------------------------------------------------------------- # Logging # --------------------------------------------------------------------------- logging.basicConfig( level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", ) logger = logging.getLogger("model_evaluation_tool") # --------------------------------------------------------------------------- # Constants # --------------------------------------------------------------------------- EPSILON: float = 1e-10 # avoid division by zero # --------------------------------------------------------------------------- # Core Metrics — implemented from scratch # --------------------------------------------------------------------------- def accuracy(y_true: list[str], y_pred: list[str]) -> float: """ Compute accuracy: fraction of predictions that match the true labels. This is the simplest metric but can be misleading with imbalanced # ... 465 more lines ...
Buy Now — $19 Back to Products