Architecture
Module Structure
All source code lives under src/link_checker/.
Module |
Responsibility |
|---|---|
|
Package init, |
|
Entry point, argparse, logging setup |
|
|
|
URL normalization, prefix matching, depth calculation |
|
URL decision tree, |
|
HTTP requests with retry, redirect, SSL handling |
|
Link extraction, anchor collection, base href handling |
|
Thread-safe results aggregation |
|
Main crawl engine with |
|
Plain-text report generator |
|
Periodic stderr progress updates |
Data Flow
flowchart TD
CLI["cli.py<br/>main(), argparse"] --> Config["config.py<br/>CrawlConfig"]
CLI --> Crawler["crawler.py<br/>Crawler class"]
Crawler --> HttpClient["http_client.py<br/>HEAD/GET, retry, SSL"]
Crawler --> HtmlParser["html_parser.py<br/>extract_links()"]
Crawler --> Classifier["classifier.py<br/>URL decision tree"]
Crawler --> UrlUtils["url_utils.py<br/>normalize, prefix match"]
Crawler --> Results["results.py<br/>CrawlResults"]
Crawler --> Progress["progress.py<br/>stderr updates"]
Results --> Report["report.py<br/>generate_report()"]
Report --> CLI
Threading Model
The crawler uses a ThreadPoolExecutor with at most --max-threads
workers. Shared state (visited URL set, anchor registry, results) is protected
by threading.Lock instances. The visit-once guarantee is enforced
atomically: if two threads discover the same URL simultaneously, exactly one
thread issues the HTTP request.