← Back to all products

Web Scraper

$29

Python web scraping framework with CSS selectors, pagination, rate limiting, proxy rotation, and caching.

📁 10 files
JSONMarkdownPython

📄 Product Preview

Try the interactive reader and demo tools below, or get the full product with all content unlocked.

📖 Interactive Reader (Free Preview) ⚙ Try Demo Tools 📦 Download Free Sample

📁 File Structure 10 files

web-scraper/ ├── LICENSE ├── README.md ├── examples/ │ └── scrape_config.json ├── free-sample.zip ├── guide/ │ ├── 01_features.md │ ├── 02_quick-start.md │ ├── 03_configuration-reference.md │ └── 04_faq.md ├── index.html └── src/ └── web_scraper.py

📖 Documentation Preview README excerpt

Web Scraper

A complete web scraping framework built on Python stdlib. CSS-style selectors, automatic pagination, rate limiting, proxy rotation, and multi-format export — no pip installs needed.

Features

  • CSS-like selectors — Extract elements by tag, .class, or #id
  • Automatic pagination — Follow next-page links across multiple pages
  • Rate limiting — Configurable delay between requests (be a polite scraper)
  • Proxy rotation — Round-robin through a list of proxy servers
  • Retry with backoff — Exponential backoff on 429/5xx errors
  • Multi-format export — Save to JSON, CSV, or SQLite
  • Deduplication — Content hashing prevents duplicate entries
  • Config file support — Define scrape jobs in JSON for repeatability
  • Metadata tracking — Each result tagged with source URL, page number, timestamp

Requirements

  • Python 3.10+
  • No external dependencies (stdlib only)

Quick Start


# Scrape all h2 elements from a page
python src/web_scraper.py --url "https://example.com" --selector "h2"

# Scrape product cards with pagination
python src/web_scraper.py --url "https://example.com/shop" --selector ".product" \
    --follow --next-selector "a.next" --max-pages 5

# Export to CSV
python src/web_scraper.py --url "https://example.com" --selector ".item" \
    --format csv --output items.csv

# Use a config file for complex jobs
python src/web_scraper.py --config examples/scrape_config.json

Selector Syntax

SelectorMatches
divAll <div> elements
.productElements with class="product"
#mainElement with id="main"
a.link<a> elements with class="link"
span.price<span> elements with class="price"

Configuration Reference

Create a JSON config file for repeatable scrape jobs:


{
    "url": "https://api.example.com/products",
    "selector": ".product-card",
    "follow_links": true,
    "next_page_selector": "a.next-page",
    "max_pages": 5,

*... continues with setup instructions, usage examples, and more.*

📄 Code Sample .py preview

src/web_scraper.py #!/usr/bin/env python3 """ Web Scraper — Automation Hub (DataNest) A complete web scraping framework built on Python stdlib. Supports CSS-style selector parsing, XPath-like path extraction, automatic pagination, rate limiting, proxy rotation, and multi-format export (JSON, CSV, SQLite). Usage: python web_scraper.py --url "https://api.example.com/items" --selector ".item" python web_scraper.py --config scrape_config.json python web_scraper.py --url "https://example.com/page" --output data.csv Dependencies: Python 3.10+ stdlib only (no pip packages) License: MIT """ from __future__ import annotations import argparse import csv import hashlib import html.parser import json import logging import random import re import sqlite3 import time import urllib.error import urllib.parse import urllib.request from dataclasses import dataclass, field from datetime import datetime, timezone from pathlib import Path from typing import Any # --------------------------------------------------------------------------- # Constants # --------------------------------------------------------------------------- DEFAULT_USER_AGENT = ( "Mozilla/5.0 (compatible; DataNestScraper/1.0; " "+https://datanest-stores.pages.dev/automation-hub)" ) DEFAULT_DELAY_SECONDS = 1.0 # Polite crawling — don't hammer the server MAX_RETRIES = 3 RETRY_BACKOFF_BASE = 2.0 # Exponential backoff: 2s, 4s, 8s DEFAULT_TIMEOUT = 30 # HTTP request timeout in seconds # ... 485 more lines ...
Buy Now — $29 Back to Products