Architecture ============ Module Structure ---------------- All source code lives under ``src/link_checker/``. .. list-table:: :widths: 20 80 :header-rows: 1 * - Module - Responsibility * - ``__init__.py`` - Package init, ``__version__`` * - ``cli.py`` - Entry point, argparse, logging setup * - ``config.py`` - ``CrawlConfig`` dataclass, YAML loading, merge * - ``url_utils.py`` - URL normalization, prefix matching, depth calculation * - ``classifier.py`` - URL decision tree, ``AssetType`` enum * - ``http_client.py`` - HTTP requests with retry, redirect, SSL handling * - ``html_parser.py`` - Link extraction, anchor collection, base href handling * - ``results.py`` - Thread-safe results aggregation * - ``crawler.py`` - Main crawl engine with ``ThreadPoolExecutor`` * - ``report.py`` - Plain-text report generator * - ``progress.py`` - Periodic stderr progress updates Data Flow --------- .. mermaid:: flowchart TD CLI["cli.py
main(), argparse"] --> Config["config.py
CrawlConfig"] CLI --> Crawler["crawler.py
Crawler class"] Crawler --> HttpClient["http_client.py
HEAD/GET, retry, SSL"] Crawler --> HtmlParser["html_parser.py
extract_links()"] Crawler --> Classifier["classifier.py
URL decision tree"] Crawler --> UrlUtils["url_utils.py
normalize, prefix match"] Crawler --> Results["results.py
CrawlResults"] Crawler --> Progress["progress.py
stderr updates"] Results --> Report["report.py
generate_report()"] Report --> CLI Threading Model --------------- The crawler uses a ``ThreadPoolExecutor`` with at most ``--max-threads`` workers. Shared state (visited URL set, anchor registry, results) is protected by ``threading.Lock`` instances. The visit-once guarantee is enforced atomically: if two threads discover the same URL simultaneously, exactly one thread issues the HTTP request.