Configuration File

An optional YAML file may be passed with --config-file to set defaults for all CLI options and define URL classification lists.

YAML Schema

# --- CLI option overrides (all optional) ---
root_url: "https://example.com/docs"
timeout: 15
retries: 5
max_requests: 1000
max_depth: 8
max_threads: 20
max_referencing_pages: 20
log_level: "debug"        # case-insensitive
output: "report.txt"
log_file: "crawl.log"
ignore_http_to_https_redirects: true

# --- URL classification lists (all optional) ---
asset_urls:
  - "https://example.com/static/images"
  - "https://example.com/static/docs"

no_crawl_urls:
  - "https://example.com/archive"

ignore_urls:
  - "https://example.com/legacy"

Note

Any unrecognised key in the YAML file raises an error immediately, listing all unknown keys and the complete set of valid key names. This catches typos such as non_crawl_urls instead of no_crawl_urls at startup rather than silently ignoring the setting.

HTTP-to-HTTPS Redirect Filtering

`ignore_http_to_https_redirects`

When set to true, any redirect where the only difference between the original URL and the final URL is a scheme upgrade from http to https (same host, path, and query) is silently omitted from the Redirects section of the report.

This is useful when your site has been fully migrated to HTTPS but some pages still contain http:// links: those links trigger a redirect but are otherwise harmless and need not be actioned.

CLI equivalent: --ignore-http-to-https-redirects

Note

Only pure scheme upgrades are suppressed. A redirect from http://example.com/old → https://example.com/new (path differs) is still reported.

URL Classification Lists

`asset_urls`

Expected locations for asset files (images, PDFs, etc.). When defined and non-empty, any asset discovered outside these prefixes is reported as misplaced.

`no_crawl_urls`

URLs matching these prefixes are checked for existence (HTTP request issued) but are never crawled — their HTML content is not parsed and links are not extracted.

`ignore_urls`

URLs matching these prefixes are completely skipped — no HTTP request is made. They appear in the report’s “Ignore URL Matches” section so you can see which ignored URLs are still being referenced.

Prefix Matching Rules

All prefix lists use path-segment boundary matching. A prefix P matches a candidate URL C if:

The hosts are identical (case-insensitive, scheme ignored).
The path of C equals the path of P, or starts with P’s path followed by /.

For example, prefix https://example.com/dir1 matches:

https://example.com/dir1 ✓
https://example.com/dir1/ ✓
https://example.com/dir1/foo/bar ✓
http://example.com/dir1/foo ✓ (scheme-insensitive)

But does not match:

https://example.com/dir1-foo ✗
https://example.com/dir10 ✗

CLI Precedence

When an option appears in both the config file and on the command line, the command-line value takes precedence.

Configuration File

YAML Schema

HTTP-to-HTTPS Redirect Filtering

ignore_http_to_https_redirects

URL Classification Lists

asset_urls

no_crawl_urls

ignore_urls