Architecture
============
Module Structure
----------------
All source code lives under ``src/link_checker/``.
.. list-table::
:widths: 20 80
:header-rows: 1
* - Module
- Responsibility
* - ``__init__.py``
- Package init, ``__version__``
* - ``cli.py``
- Entry point, argparse, logging setup
* - ``config.py``
- ``CrawlConfig`` dataclass, YAML loading, merge
* - ``url_utils.py``
- URL normalization, prefix matching, depth calculation
* - ``classifier.py``
- URL decision tree, ``AssetType`` enum
* - ``http_client.py``
- HTTP requests with retry, redirect, SSL handling
* - ``html_parser.py``
- Link extraction, anchor collection, base href handling
* - ``results.py``
- Thread-safe results aggregation
* - ``crawler.py``
- Main crawl engine with ``ThreadPoolExecutor``
* - ``report.py``
- Plain-text report generator
* - ``progress.py``
- Periodic stderr progress updates
Data Flow
---------
.. mermaid::
flowchart TD
CLI["cli.py
main(), argparse"] --> Config["config.py
CrawlConfig"]
CLI --> Crawler["crawler.py
Crawler class"]
Crawler --> HttpClient["http_client.py
HEAD/GET, retry, SSL"]
Crawler --> HtmlParser["html_parser.py
extract_links()"]
Crawler --> Classifier["classifier.py
URL decision tree"]
Crawler --> UrlUtils["url_utils.py
normalize, prefix match"]
Crawler --> Results["results.py
CrawlResults"]
Crawler --> Progress["progress.py
stderr updates"]
Results --> Report["report.py
generate_report()"]
Report --> CLI
Threading Model
---------------
The crawler uses a ``ThreadPoolExecutor`` with at most ``--max-threads``
workers. Shared state (visited URL set, anchor registry, results) is protected
by ``threading.Lock`` instances. The visit-once guarantee is enforced
atomically: if two threads discover the same URL simultaneously, exactly one
thread issues the HTTP request.