rms-link-checker

User's Guide:

  • Installation
  • Usage
  • Configuration File
  • Report Format
  • Troubleshooting

Developer's Guide:

  • Developer Setup
  • Architecture
  • API Reference
  • Testing
  • Contributing
  • Releasing
rms-link-checker
  • API Reference
  • View page source

API Reference

rms-link-checker package.

Website crawler, link checker, and content analyzer.

Command-line entry point for rms-link-checker.

main() → None[source]

Entry point for the link_check CLI command.

Configuration dataclass and YAML loading for rms-link-checker.

class CrawlConfig(root_url: str, timeout: int = 10, retries: int = 3, max_requests: int | None = None, max_depth: int | None = None, max_threads: int = 10, max_referencing_pages: int = 10, log_level: str = 'INFO', output: str | None = None, log_file: str | None = None, asset_urls: tuple[str, ...]=<factory>, no_crawl_urls: tuple[str, ...]=<factory>, ignore_urls: tuple[str, ...]=<factory>, verify: bool | str = True, ignore_http_to_https_redirects: bool = False)[source]

Bases: object

Immutable crawl configuration.

root_url

The root URL to begin crawling from.

Type:

str

timeout

Timeout in seconds for each HTTP request.

Type:

int

retries

Number of retry attempts for transient failures.

Type:

int

max_requests

Maximum total HTTP requests (None = unlimited).

Type:

int | None

max_depth

Maximum directory depth (None = unlimited).

Type:

int | None

max_threads

Maximum concurrent threads.

Type:

int

max_referencing_pages

Max referencing pages per URL in report.

Type:

int

log_level

Minimum log level string.

Type:

str

output

File path for report (None = stdout).

Type:

str | None

log_file

File path for log messages (None = stderr).

Type:

str | None

asset_urls

Expected URL prefixes for asset files.

Type:

tuple[str, …]

no_crawl_urls

URL prefixes to check but not crawl.

Type:

tuple[str, …]

ignore_urls

URL prefixes to skip entirely.

Type:

tuple[str, …]

verify

TLS certificate verification. True (default) uses the system CA bundle; False disables verification (insecure); a string is treated as a path to a CA-bundle file.

Type:

bool | str

ignore_http_to_https_redirects

When True, redirects where only the scheme changes from http to https (same host, path, and query) are silently dropped from the Redirects section of the report. Defaults to False.

Type:

bool

asset_urls: tuple[str, ...]
ignore_http_to_https_redirects: bool = False
ignore_urls: tuple[str, ...]
log_file: str | None = None
log_level: str = 'INFO'
max_depth: int | None = None
max_referencing_pages: int = 10
max_requests: int | None = None
max_threads: int = 10
no_crawl_urls: tuple[str, ...]
output: str | None = None
retries: int = 3
root_url: str
timeout: int = 10
verify: bool | str = True
load_config(cli_namespace: Namespace, *, config_path: str | None = None) → CrawlConfig[source]

Build a CrawlConfig by merging a YAML file and CLI overrides.

Precedence (highest to lowest): CLI arguments > YAML file > defaults.

Parameters:
  • cli_namespace – Parsed argparse namespace. Fields set to None are treated as “not specified” and do not override YAML or defaults.

  • config_path – Path to a YAML configuration file. If None, no file is loaded. May also be read from cli_namespace.config_file.

Returns:

A fully populated CrawlConfig instance.

Raises:

ValueError – If required fields are missing, the config file cannot be parsed, or any value fails validation.

URL normalization, prefix matching, and classification utilities.

add_trailing_slash(url: str) → str[source]

Ensure a URL path that has no file extension ends with /.

This should be applied to internal URLs (same domain as the crawl root) to canonicalize bare directory paths:

  • /cassini → /cassini/

  • /cassini/ → /cassini/ (unchanged)

  • /cassini/page.html → /cassini/page.html (has extension, unchanged)

  • /data.csv → /data.csv (has extension, unchanged)

  • /path/.htaccess → /path/.htaccess (dotfile, treated as a file)

Leading-dot filenames (e.g. .htaccess, .gitignore) are considered files and do not receive a trailing slash.

Apply after normalize_url() so that index-file stripping has already run (/cassini/index.html → /cassini/ → unchanged here).

Parameters:

url – An already-normalized http/https URL.

Returns:

The URL with a trailing slash added to any extension-free path.

get_depth(url_path: str, root_path: str) → int[source]

Return the directory depth of url_path relative to root_path.

Depth 0 is the root page itself (or its trailing-slash variant). Each additional directory segment adds 1.

Parameters:
  • url_path – The path to measure.

  • root_path – The root path to measure from.

Returns:

Integer depth >= 0.

get_file_extension(url: str) → str | None[source]

Return the lowercase file extension from the URL path, or None.

Only inspects the path component (ignores query string and fragment). Returns None for paths ending in / or having no dot in the last segment.

Parameters:

url – The URL to inspect.

Returns:

Lowercase extension including the leading dot (e.g. '.pdf'), or None if there is no extension.

is_html_extension(ext: str | None) → bool[source]

Return True if ext indicates an HTML-like page (or no extension at all).

Parameters:

ext – Lowercase file extension including leading dot, or None.

Returns:

True for extensions in HTML_EXTENSIONS and for None.

is_http_to_https_redirect(original_url: str, final_url: str) → bool[source]

Return True if original_url and final_url differ only in scheme upgrade.

A pure HTTP-to-HTTPS redirect is one where the original URL uses http and the final URL uses https, with the same host (case-insensitive), path, and query string. Any other difference (different host, path, query, or port) returns False.

Parameters:
  • original_url – The URL before the redirect.

  • final_url – The URL after the redirect.

Returns:

True if the redirect is a simple HTTP-to-HTTPS scheme upgrade.

is_http_url(url: str) → bool[source]

Return True if url uses the http or https scheme.

Parameters:

url – The URL string to test.

Returns:

True for HTTP/HTTPS URLs, False for everything else.

is_same_domain(url: str, root_url: str) → bool[source]

Return True if url has the same host (including port) as root_url.

Comparison is case-insensitive. Scheme is ignored. Subdomains are considered different domains.

Parameters:
  • url – The URL to check.

  • root_url – The reference URL whose host to compare against.

Returns:

True if both URLs share the same host.

is_under_root(url_path: str, root_path: str) → bool[source]

Return True if url_path is at or under root_path on a segment boundary.

Parameters:
  • url_path – The path component of the URL to test.

  • root_path – The root path to test containment against.

Returns:

True if url_path is the same as or under root_path.

matches_prefix(candidate_url: str, prefix_url: str) → bool[source]

Return True if candidate_url matches prefix_url per spec §5.7 rules.

Matching rules:

  • Scheme is ignored (http and https are equivalent).

  • Host comparison is case-insensitive (and must be equal).

  • Path of candidate must equal path of prefix, or start with prefix path followed by / (segment-boundary matching).

Parameters:
  • candidate_url – The URL to test.

  • prefix_url – The prefix URL to match against.

Returns:

True if candidate_url matches prefix_url.

normalize_internal_url(url: str) → tuple[str, str | None][source]

Normalize an internal (same-domain) URL to its canonical form.

Applies all transformations from normalize_url() and additionally adds a trailing slash to directory-like paths (no file extension):

  • /cassini → /cassini/

  • /cassini/ → /cassini/ (unchanged)

  • /cassini/index.html → /cassini/ (index stripped + already slash)

  • /cassini/page.html → /cassini/page.html (has extension, unchanged)

Use this for deduplication and request URLs of internal crawl targets. For external URLs use normalize_url() alone to avoid altering the request path in ways the server may not expect.

Parameters:

url – The URL to normalize.

Returns:

A tuple of (canonical_url, fragment_or_none).

normalize_url(url: str) → tuple[str, str | None][source]

Normalize a URL to its canonical form.

Transformations applied:

  • Scheme is preserved (http remains http, https remains https).

  • Host is lowercased.

  • Fragment is stripped (returned separately).

  • Query string is preserved as part of the URL identity.

  • Directory-index filenames (index.html, index.php, etc.) are stripped so /cassini/index.html → /cassini/.

Note that bare directory paths without a trailing slash (e.g. /cassini) are left unchanged by this function. Use add_trailing_slash() to add the trailing slash when you know the URL refers to a directory (e.g. for internal crawl targets).

Parameters:

url – The URL to normalize.

Returns:

A tuple of (canonical_url, fragment_or_none). For non-HTTP URLs the original string is returned unchanged with no fragment.

URL decision tree and asset type classification.

class AssetType(*values)[source]

Bases: Enum

Category of a non-HTML asset URL.

DATA = 'Data'
DOCUMENT = 'Document'
IMAGE = 'Image'
INFRASTRUCTURE = 'Infrastructure'
OTHER = 'Other'
class UrlDisposition(*values)[source]

Bases: Enum

How the crawler should handle a given URL (spec §6 decision tree).

ALREADY_VISITED = 'already_visited'
DEPTH_LIMITED = 'depth_limited'
EXTERNAL = 'external'
IGNORED = 'ignored'
INTERNAL_ASSET = 'internal_asset'
INTERNAL_CRAWL = 'internal_crawl'
NON_HTTP = 'non_http'
NO_CRAWL = 'no_crawl'
classify_asset(extension: str) → AssetType[source]

Return the AssetType for a file extension.

Parameters:

extension – Lowercase file extension including leading dot (e.g. '.jpg').

Returns:

The AssetType for the extension.

classify_url(url: str, *, config: CrawlConfig, root_url: str, root_path: str, visited_set: set[str], depth: int) → UrlDisposition[source]

Classify a URL per the spec §6 decision tree.

Steps (in order): 1. Non-HTTP scheme → UrlDisposition.NON_HTTP 2. Matches ignore_urls → UrlDisposition.IGNORED 3. Already visited → UrlDisposition.ALREADY_VISITED 4. Matches no_crawl_urls → UrlDisposition.NO_CRAWL 5. External (different domain or above root path) → UrlDisposition.EXTERNAL 6. Exceeds depth limit → UrlDisposition.DEPTH_LIMITED 7. Internal HTML/no-extension → UrlDisposition.INTERNAL_CRAWL 8. Internal asset → UrlDisposition.INTERNAL_ASSET

Parameters:
  • url – The URL to classify (absolute, not yet normalized for query/fragment).

  • config – Current crawl configuration.

  • root_url – Normalized root URL.

  • root_path – Path component of the root URL.

  • visited_set – Set of already-visited canonical URLs.

  • depth – Directory depth of url relative to root.

Returns:

The UrlDisposition for the URL.

is_misplaced_asset(url: str, *, config: CrawlConfig, root_url: str, root_path: str) → bool[source]

Return True if url is a misplaced asset per spec §8.6.

An asset is misplaced when all of the following hold: 1. asset_urls is defined and non-empty. 2. The asset does not fall under any asset_urls prefix. 3. The asset is not external (is on the same domain and under root). 4. The asset is not matched by an ignore_urls prefix. 5. The asset is not matched by a no_crawl_urls prefix.

Parameters:
  • url – The asset URL to test.

  • config – Current crawl configuration.

  • root_url – Normalized root URL.

  • root_path – Path component of the root URL.

Returns:

True if the asset is misplaced.

HTTP client with retry, redirect, SSL, and User-Agent support.

class HttpClient(*, timeout: int, retries: int, user_agent: str, verify: bool | str = True, sleep: Callable[[float], None] | None = None)[source]

Bases: object

HTTP client wrapping requests with retry, redirect, and SSL handling.

Parameters:
  • timeout – Timeout in seconds for each individual request attempt.

  • retries – Maximum number of retry attempts for transient failures.

  • user_agent – User-Agent header value to send with every request.

  • verify – TLS certificate verification. Pass False to disable (e.g. for internal environments with self-signed certificates), or a path to a CA bundle. Defaults to True.

  • sleep – Callable used to pause between retry attempts. Defaults to time.sleep(). Pass a no-op (e.g. lambda _: None) in tests to avoid real waits.

request(url: str, *, method: str = 'HEAD') → RequestResult[source]

Issue an HTTP request, following redirects manually and retrying on transient errors.

If method is 'HEAD' and the server returns 405, automatically retries with 'GET'.

Parameters:
  • url – The URL to request.

  • method – HTTP method string ('HEAD' or 'GET').

Returns:

A RequestResult describing the outcome.

property ssl_warned_domains: set[str]

Domains that have already triggered an SSL warning.

class RedirectHop(url: str, status_code: int)[source]

Bases: object

A single hop in an HTTP redirect chain.

url

The URL of this hop.

Type:

str

status_code

The HTTP status code returned for this hop.

Type:

int

status_code: int
url: str
class RequestResult(final_url: str, status_code: int, headers: dict[str, str], body: str | None, content_type: str | None, redirect_chain: list[~link_checker.http_client.RedirectHop] = <factory>, error: str | None = None, bytes_downloaded: int = 0)[source]

Bases: object

Result of an HTTP request, including all redirect hops.

final_url

The final URL after following all redirects.

Type:

str

status_code

The final HTTP status code.

Type:

int

headers

Response headers from the final response.

Type:

dict[str, str]

body

Response body text (only populated for GET requests).

Type:

str | None

content_type

Value of the Content-Type header, if present.

Type:

str | None

redirect_chain

List of redirect hops leading to the final URL.

Type:

list[link_checker.http_client.RedirectHop]

error

Human-readable error string if the request failed.

Type:

str | None

bytes_downloaded

Approximate bytes received.

Type:

int

body: str | None
bytes_downloaded: int = 0
content_type: str | None
error: str | None = None
final_url: str
headers: dict[str, str]
redirect_chain: list[RedirectHop]
status_code: int

HTML link extraction, anchor collection, and base-href handling.

class ExtractedLink(url: str, source_element: str, source_attribute: str, is_asset: bool)[source]

Bases: object

A link extracted from an HTML page.

url

Absolute URL of the link.

Type:

str

source_element

HTML element name (e.g. "a", "img").

Type:

str

source_attribute

HTML attribute name (e.g. "href", "src").

Type:

str

is_asset

True if from an asset-only element (never crawled).

Type:

bool

is_asset: bool
source_attribute: str
source_element: str
url: str
extract_anchors(html: str) → frozenset[str][source]

Extract all anchor IDs from html (id attributes and <a name>).

Parameters:

html – Raw HTML string.

Returns:

Frozen set of anchor identifier strings.

extract_links(html: str, page_url: str, *, base_url: str | None = None) → list[ExtractedLink][source]

Extract all links from html and return them as ExtractedLink objects.

Relative URLs are resolved against base_url (if given) or the <base href> tag in the document, falling back to page_url.

Parameters:
  • html – Raw HTML string to parse.

  • page_url – URL of the page (used for relative URL resolution).

  • base_url – Override for relative URL resolution. Supersedes any <base href> found in the document.

Returns:

List of ExtractedLink with absolute URLs.

find_base_href(html: str) → str | None[source]

Return the first <base href> value from the document <head>.

Parameters:

html – Raw HTML string.

Returns:

The href value, or None if no <base href> is found.

Thread-safe crawl results aggregation.

class BrokenAnchor(target_url: str, referencing_pages: list[str] = <factory>)[source]

Bases: object

A fragment reference that could not be resolved.

target_url

The full URL including fragment.

Type:

str

referencing_pages

Pages containing the broken anchor link.

Type:

list[str]

referencing_pages: list[str]
target_url: str
class BrokenLink(url: str, status_code: int, error: str, referencing_pages: list[str] = <factory>)[source]

Bases: object

A link that returned an error or non-success status.

url

The broken URL.

Type:

str

status_code

HTTP status code (0 = network error).

Type:

int

error

Error description string.

Type:

str

referencing_pages

Pages that contained this broken link.

Type:

list[str]

error: str
referencing_pages: list[str]
status_code: int
url: str
class CrawlResults[source]

Bases: object

Thread-safe container for all crawl results.

All add_* and record_* methods are safe to call from multiple threads simultaneously.

add_broken_anchor(*, target_url: str, referrer: str) → None[source]

Record a broken anchor (fragment not found in target page).

Parameters:
  • target_url – Full URL including the missing fragment.

  • referrer – Page that contained the link.

add_broken_link(*, url: str, status_code: int, error: str, referrer: str) → None[source]

Record a broken link.

Parameters:
  • url – The broken URL.

  • status_code – HTTP status code.

  • error – Error or status description.

  • referrer – Page that linked to this URL.

add_ignore_match(*, url: str, referrer: str) → None[source]

Record a URL that was ignored.

Parameters:
  • url – The ignored URL.

  • referrer – Page containing the link.

add_misplaced_asset(*, url: str, asset_type: str, referrer: str) → None[source]

Record a misplaced asset.

Parameters:
  • url – The asset URL.

  • asset_type – String label of the asset type.

  • referrer – Page that referenced the asset.

add_no_crawl_match(*, url: str, referrer: str) → None[source]

Record a URL that matched a no-crawl prefix.

Parameters:
  • url – The URL.

  • referrer – Page containing the link.

add_non200(*, url: str, status_code: int, referrer: str) → None[source]

Record a URL that returned a non-200 final status.

Parameters:
  • url – The URL.

  • status_code – HTTP status code.

  • referrer – Page that linked to the URL.

add_non_http_link(*, url: str, scheme: str, referrer: str) → None[source]

Record a non-HTTP scheme link.

Parameters:
  • url – The full non-HTTP URL.

  • scheme – The URL scheme (e.g. 'mailto').

  • referrer – Page containing the link.

add_redirect(*, original_url: str, final_url: str, status_code: int, referrer: str) → None[source]

Record a redirect.

Parameters:
  • original_url – The URL that redirected.

  • final_url – Destination URL after all redirects.

  • status_code – Final HTTP status code.

  • referrer – Page that linked to original_url.

add_ssl_warning(*, url: str, domain: str, error: str, referrer: str) → None[source]

Record an SSL certificate error.

Parameters:
  • url – The URL that triggered the SSL error.

  • domain – The domain of the URL.

  • error – SSL error description.

  • referrer – Page that referenced the URL.

add_unvalidated_anchor(*, target_url: str, reason: str, referrer: str) → None[source]

Record an anchor that could not be validated.

Parameters:
  • target_url – Full URL including fragment.

  • reason – Why validation was skipped ('no-crawl', 'external', 'depth-limited').

  • referrer – Page containing the link.

property broken_anchors: list[BrokenAnchor]

List of broken anchors sorted by target URL.

property broken_links: list[BrokenLink]

List of broken links sorted by URL.

has_problems() → bool[source]

Return True if any crawl problems were found (exit code 1).

Returns:

True if there are broken links, non-200 responses, broken anchors, misplaced assets, or SSL warnings.

property ignore_matches: list[IgnoreMatch]

List of ignore matches sorted by URL.

merge_referrer(url: str, referrer: str) → None[source]

Add referrer to every existing result entry that tracks url.

If no entry for url exists yet (the fetch is still in-flight), the referrer is queued and will be applied automatically when the entry is created by the corresponding add_* call.

Parameters:
  • url – Canonical URL to look up.

  • referrer – Page that linked to url.

property misplaced_assets: list[MisplacedAsset]

List of misplaced assets sorted by asset type then URL.

property no_crawl_matches: list[NoCrawlMatch]

List of no-crawl matches sorted by URL.

property non200_responses: list[Non200Response]

List of non-200 responses sorted by status code then URL.

property non_http_links: list[NonHttpLink]

List of non-HTTP scheme links sorted by URL.

record_request(url: str, *, bytes_downloaded: int = 0, crawled: bool = False, external: bool = False) → None[source]

Update statistics for a completed HTTP request.

Parameters:
  • url – The URL that was requested.

  • bytes_downloaded – Bytes received.

  • crawled – True if the page was crawled (GET + parsed).

  • external – True if the URL was external.

property redirects: list[RedirectInfo]

List of redirects sorted by original URL.

property ssl_warnings: list[SslWarning]

List of SSL warnings sorted by domain.

property statistics: CrawlStatistics

Snapshot of the crawl statistics.

property unvalidated_anchors: list[UnvalidatedAnchor]

List of unvalidated anchors sorted by target URL.

class CrawlStatistics(start_time: float = <factory>, total_requests: int = 0, bytes_downloaded: int = 0, pages_crawled: int = 0, pages_checked: int = 0, external_checked: int = 0, per_domain_requests: dict[str, int]=<factory>)[source]

Bases: object

Aggregated crawl statistics.

start_time

UNIX timestamp when the crawl started.

Type:

float

total_requests

Total HTTP requests issued.

Type:

int

bytes_downloaded

Total bytes received.

Type:

int

pages_crawled

Internal pages fetched with GET and parsed.

Type:

int

pages_checked

Internal pages checked without parsing.

Type:

int

external_checked

External URLs checked.

Type:

int

per_domain_requests

Dict mapping domain → request count.

Type:

dict[str, int]

bytes_downloaded: int = 0
external_checked: int = 0
pages_checked: int = 0
pages_crawled: int = 0
per_domain_requests: dict[str, int]
start_time: float
total_requests: int = 0
class IgnoreMatch(url: str, referencing_pages: list[str] = <factory>)[source]

Bases: object

A URL that was ignored due to matching an ignore_urls prefix.

url

The ignored URL.

Type:

str

referencing_pages

Pages containing this URL.

Type:

list[str]

referencing_pages: list[str]
url: str
class MisplacedAsset(url: str, asset_type: str, referencing_pages: list[str] = <factory>)[source]

Bases: object

An asset found outside its expected asset_urls prefixes.

url

The asset URL.

Type:

str

asset_type

String name of the asset type category.

Type:

str

referencing_pages

Pages that linked to this asset.

Type:

list[str]

asset_type: str
referencing_pages: list[str]
url: str
class NoCrawlMatch(url: str, referencing_pages: list[str] = <factory>)[source]

Bases: object

A URL that matched a no_crawl_urls prefix.

url

The URL.

Type:

str

referencing_pages

Pages containing this URL.

Type:

list[str]

referencing_pages: list[str]
url: str
class Non200Response(url: str, status_code: int, referencing_pages: list[str] = <factory>)[source]

Bases: object

A URL that returned a non-200 final status.

url

The URL.

Type:

str

status_code

HTTP status code.

Type:

int

referencing_pages

Pages that contained this link.

Type:

list[str]

referencing_pages: list[str]
status_code: int
url: str
class NonHttpLink(url: str, scheme: str, referencing_pages: list[str] = <factory>)[source]

Bases: object

A non-HTTP scheme link encountered during crawl.

url

The non-HTTP URL.

Type:

str

scheme

The scheme (e.g. 'mailto', 'tel').

Type:

str

referencing_pages

Pages that contained this link.

Type:

list[str]

referencing_pages: list[str]
scheme: str
url: str
class RedirectInfo(original_url: str, final_url: str, status_code: int, referencing_pages: list[str] = <factory>)[source]

Bases: object

A URL that redirected to another location.

original_url

The URL that redirected.

Type:

str

final_url

The URL after all redirects.

Type:

str

status_code

The HTTP status code of the first redirect hop (e.g. 301, 302).

Type:

int

referencing_pages

Pages that contained the original URL.

Type:

list[str]

final_url: str
original_url: str
referencing_pages: list[str]
status_code: int
class SslWarning(domain: str, error: str, affected_urls: list[tuple[str, list[str]]]=<factory>)[source]

Bases: object

An SSL error recorded for a domain.

domain

The domain that produced the SSL error.

Type:

str

error

Error message.

Type:

str

affected_urls

List of (url, referencing_pages) tuples.

Type:

list[tuple[str, list[str]]]

affected_urls: list[tuple[str, list[str]]]
domain: str
error: str
class UnvalidatedAnchor(target_url: str, reason: str, referencing_pages: list[str] = <factory>)[source]

Bases: object

A fragment reference that could not be validated (no HTML was parsed).

target_url

The full URL including fragment.

Type:

str

reason

Why validation was skipped ('no-crawl', 'external', 'depth-limited').

Type:

str

referencing_pages

Pages containing this anchor link.

Type:

list[str]

reason: str
referencing_pages: list[str]
target_url: str

Main crawl engine with thread pool, visit-once logic, and result aggregation.

class Crawler(config: CrawlConfig, progress: ProgressReporter | None = None, sleep: Callable[[float], None] | None = None)[source]

Bases: object

Main crawl engine.

Uses a ThreadPoolExecutor to process URLs concurrently. Enforces visit-once semantics, depth limits, and all other spec 5-9 rules.

Parameters:
  • config – Crawl configuration.

  • progress – Optional progress reporter to update during the crawl.

  • sleep – Callable used for inter-retry pauses inside the HTTP client. Defaults to time.sleep(). Pass lambda _: None in tests to make retries instantaneous.

abort() → None[source]

Signal the crawl to stop after in-flight requests complete.

Safe to call from any thread (e.g. a signal handler). Already-submitted workers finish naturally; no new URLs are dequeued or requested.

crawl() → CrawlResults[source]

Run the full crawl starting from config.root_url.

Returns:

CrawlResults with all findings.

property results: CrawlResults

Return the accumulated crawl results.

May be partial if called after abort() before crawl() has returned.

Plain-text report generator for the 11 report sections.

generate_report(results: CrawlResults, config: CrawlConfig) → str[source]

Generate the full plain-text report for a completed crawl.

Produces 11 sections covering configuration, statistics, broken links, broken anchors, non-200 responses, redirects, misplaced assets, ignored URLs, non-HTTP links, SSL warnings, and unvalidated anchors.

Parameters:
  • results – Completed crawl results.

  • config – Crawl configuration used.

Returns:

The full report as a multi-line string.

Periodic stderr progress reporting during crawl.

class ProgressReporter(interval: float = 5.0, output: Callable[[str], None] | None = None)[source]

Bases: object

Emits periodic progress updates to stderr.

Updates are written approximately every interval seconds.

Parameters:
  • interval – Time in seconds between progress updates.

  • output – Callable that accepts a string and writes it somewhere. Defaults to printing to sys.stderr.

start() → None[source]

Start emitting periodic progress updates.

stop() → None[source]

Stop emitting progress updates.

update(*, checked: int, queued: int, active_threads: int, elapsed: float) → None[source]

Update the current progress values.

Parameters:
  • checked – Number of URLs checked so far.

  • queued – Number of URLs currently in queue.

  • active_threads – Number of active worker threads.

  • elapsed – Elapsed time in seconds.

Previous Next

© Copyright 2026, SETI Institute.

Built with Sphinx using a theme provided by Read the Docs.