Configuration File ================== An optional YAML file may be passed with ``--config-file`` to set defaults for all CLI options and define URL classification lists. YAML Schema ----------- .. code-block:: yaml # --- CLI option overrides (all optional) --- root_url: "https://example.com/docs" timeout: 15 retries: 5 max_requests: 1000 max_depth: 8 max_threads: 20 max_referencing_pages: 20 log_level: "debug" # case-insensitive output: "report.txt" log_file: "crawl.log" ignore_http_to_https_redirects: true # --- URL classification lists (all optional) --- asset_urls: - "https://example.com/static/images" - "https://example.com/static/docs" no_crawl_urls: - "https://example.com/archive" ignore_urls: - "https://example.com/legacy" .. note:: Any unrecognised key in the YAML file raises an error immediately, listing all unknown keys and the complete set of valid key names. This catches typos such as ``non_crawl_urls`` instead of ``no_crawl_urls`` at startup rather than silently ignoring the setting. HTTP-to-HTTPS Redirect Filtering -------------------------------- ``ignore_http_to_https_redirects`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When set to ``true``, any redirect where the **only** difference between the original URL and the final URL is a scheme upgrade from ``http`` to ``https`` (same host, path, and query) is **silently omitted** from the Redirects section of the report. This is useful when your site has been fully migrated to HTTPS but some pages still contain ``http://`` links: those links trigger a redirect but are otherwise harmless and need not be actioned. **CLI equivalent:** ``--ignore-http-to-https-redirects`` .. note:: Only *pure* scheme upgrades are suppressed. A redirect from ``http://example.com/old`` → ``https://example.com/new`` (path differs) is still reported. URL Classification Lists ------------------------ ``asset_urls`` ~~~~~~~~~~~~~~ Expected locations for asset files (images, PDFs, etc.). When defined and non-empty, any asset discovered outside these prefixes is reported as **misplaced**. ``no_crawl_urls`` ~~~~~~~~~~~~~~~~~ URLs matching these prefixes are **checked for existence** (HTTP request issued) but are **never crawled** — their HTML content is not parsed and links are not extracted. ``ignore_urls`` ~~~~~~~~~~~~~~~ URLs matching these prefixes are **completely skipped** — no HTTP request is made. They appear in the report's "Ignore URL Matches" section so you can see which ignored URLs are still being referenced. Prefix Matching Rules --------------------- All prefix lists use **path-segment boundary** matching. A prefix ``P`` matches a candidate URL ``C`` if: 1. The hosts are identical (case-insensitive, scheme ignored). 2. The path of ``C`` equals the path of ``P``, or starts with ``P``'s path followed by ``/``. For example, prefix ``https://example.com/dir1`` matches: - ``https://example.com/dir1`` ✓ - ``https://example.com/dir1/`` ✓ - ``https://example.com/dir1/foo/bar`` ✓ - ``http://example.com/dir1/foo`` ✓ (scheme-insensitive) But does **not** match: - ``https://example.com/dir1-foo`` ✗ - ``https://example.com/dir10`` ✗ CLI Precedence -------------- When an option appears in both the config file and on the command line, the **command-line value takes precedence**.