Configuration File
An optional YAML file may be passed with --config-file to set defaults for
all CLI options and define URL classification lists.
YAML Schema
# --- CLI option overrides (all optional) ---
root_url: "https://example.com/docs"
timeout: 15
retries: 5
max_requests: 1000
max_depth: 8
max_threads: 20
max_referencing_pages: 20
log_level: "debug" # case-insensitive
output: "report.txt"
log_file: "crawl.log"
ignore_http_to_https_redirects: true
# --- URL classification lists (all optional) ---
asset_urls:
- "https://example.com/static/images"
- "https://example.com/static/docs"
no_crawl_urls:
- "https://example.com/archive"
ignore_urls:
- "https://example.com/legacy"
Note
Any unrecognised key in the YAML file raises an error immediately, listing
all unknown keys and the complete set of valid key names. This catches
typos such as non_crawl_urls instead of no_crawl_urls at startup
rather than silently ignoring the setting.
HTTP-to-HTTPS Redirect Filtering
ignore_http_to_https_redirects
When set to true, any redirect where the only difference between the
original URL and the final URL is a scheme upgrade from http to https
(same host, path, and query) is silently omitted from the Redirects section
of the report.
This is useful when your site has been fully migrated to HTTPS but some pages
still contain http:// links: those links trigger a redirect but are otherwise
harmless and need not be actioned.
CLI equivalent: --ignore-http-to-https-redirects
Note
Only pure scheme upgrades are suppressed. A redirect from
http://example.com/old → https://example.com/new (path differs)
is still reported.
URL Classification Lists
asset_urls
Expected locations for asset files (images, PDFs, etc.). When defined and non-empty, any asset discovered outside these prefixes is reported as misplaced.
no_crawl_urls
URLs matching these prefixes are checked for existence (HTTP request issued) but are never crawled — their HTML content is not parsed and links are not extracted.
ignore_urls
URLs matching these prefixes are completely skipped — no HTTP request is made. They appear in the report’s “Ignore URL Matches” section so you can see which ignored URLs are still being referenced.
Prefix Matching Rules
All prefix lists use path-segment boundary matching. A prefix P matches
a candidate URL C if:
The hosts are identical (case-insensitive, scheme ignored).
The path of
Cequals the path ofP, or starts withP’s path followed by/.
For example, prefix https://example.com/dir1 matches:
https://example.com/dir1✓https://example.com/dir1/✓https://example.com/dir1/foo/bar✓http://example.com/dir1/foo✓ (scheme-insensitive)
But does not match:
https://example.com/dir1-foo✗https://example.com/dir10✗
CLI Precedence
When an option appears in both the config file and on the command line, the command-line value takes precedence.