Architecture

Module Structure

All source code lives under src/link_checker/.

Module

Responsibility

__init__.py

Package init, __version__

cli.py

Entry point, argparse, logging setup

config.py

CrawlConfig dataclass, YAML loading, merge

url_utils.py

URL normalization, prefix matching, depth calculation

classifier.py

URL decision tree, AssetType enum

http_client.py

HTTP requests with retry, redirect, SSL handling

html_parser.py

Link extraction, anchor collection, base href handling

results.py

Thread-safe results aggregation

crawler.py

Main crawl engine with ThreadPoolExecutor

report.py

Plain-text report generator

progress.py

Periodic stderr progress updates

Data Flow

        flowchart TD
    CLI["cli.py<br/>main(), argparse"] --> Config["config.py<br/>CrawlConfig"]
    CLI --> Crawler["crawler.py<br/>Crawler class"]
    Crawler --> HttpClient["http_client.py<br/>HEAD/GET, retry, SSL"]
    Crawler --> HtmlParser["html_parser.py<br/>extract_links()"]
    Crawler --> Classifier["classifier.py<br/>URL decision tree"]
    Crawler --> UrlUtils["url_utils.py<br/>normalize, prefix match"]
    Crawler --> Results["results.py<br/>CrawlResults"]
    Crawler --> Progress["progress.py<br/>stderr updates"]
    Results --> Report["report.py<br/>generate_report()"]
    Report --> CLI
    

Threading Model

The crawler uses a ThreadPoolExecutor with at most --max-threads workers. Shared state (visited URL set, anchor registry, results) is protected by threading.Lock instances. The visit-once guarantee is enforced atomically: if two threads discover the same URL simultaneously, exactly one thread issues the HTTP request.