Architecture

Module Structure

All source code lives under src/link_checker/.

Module	Responsibility
`__init__.py`	Package init, `__version__`
`cli.py`	Entry point, argparse, logging setup
`config.py`	`CrawlConfig` dataclass, YAML loading, merge
`url_utils.py`	URL normalization, prefix matching, depth calculation
`classifier.py`	URL decision tree, `AssetType` enum
`http_client.py`	HTTP requests with retry, redirect, SSL handling
`html_parser.py`	Link extraction, anchor collection, base href handling
`results.py`	Thread-safe results aggregation
`crawler.py`	Main crawl engine with `ThreadPoolExecutor`
`report.py`	Plain-text report generator
`progress.py`	Periodic stderr progress updates

Data Flow

        flowchart TD
    CLI["cli.py<br/>main(), argparse"] --> Config["config.py<br/>CrawlConfig"]
    CLI --> Crawler["crawler.py<br/>Crawler class"]
    Crawler --> HttpClient["http_client.py<br/>HEAD/GET, retry, SSL"]
    Crawler --> HtmlParser["html_parser.py<br/>extract_links()"]
    Crawler --> Classifier["classifier.py<br/>URL decision tree"]
    Crawler --> UrlUtils["url_utils.py<br/>normalize, prefix match"]
    Crawler --> Results["results.py<br/>CrawlResults"]
    Crawler --> Progress["progress.py<br/>stderr updates"]
    Results --> Report["report.py<br/>generate_report()"]
    Report --> CLI

Threading Model

The crawler uses a ThreadPoolExecutor with at most --max-threads workers. Shared state (visited URL set, anchor registry, results) is protected by threading.Lock instances. The visit-once guarantee is enforced atomically: if two threads discover the same URL simultaneously, exactly one thread issues the HTTP request.