API Reference
rms-link-checker package.
Website crawler, link checker, and content analyzer.
Command-line entry point for rms-link-checker.
Configuration dataclass and YAML loading for rms-link-checker.
- class CrawlConfig(root_url: str, timeout: int = 10, retries: int = 3, max_requests: int | None = None, max_depth: int | None = None, max_threads: int = 10, max_referencing_pages: int = 10, log_level: str = 'INFO', output: str | None = None, log_file: str | None = None, asset_urls: tuple[str, ...]=<factory>, no_crawl_urls: tuple[str, ...]=<factory>, ignore_urls: tuple[str, ...]=<factory>, verify: bool | str = True, ignore_http_to_https_redirects: bool = False)[source]
Bases:
objectImmutable crawl configuration.
- root_url
The root URL to begin crawling from.
- Type:
- timeout
Timeout in seconds for each HTTP request.
- Type:
- retries
Number of retry attempts for transient failures.
- Type:
- max_requests
Maximum total HTTP requests (None = unlimited).
- Type:
int | None
- max_depth
Maximum directory depth (None = unlimited).
- Type:
int | None
- max_threads
Maximum concurrent threads.
- Type:
- max_referencing_pages
Max referencing pages per URL in report.
- Type:
- log_level
Minimum log level string.
- Type:
- output
File path for report (None = stdout).
- Type:
str | None
- log_file
File path for log messages (None = stderr).
- Type:
str | None
- verify
TLS certificate verification.
True(default) uses the system CA bundle;Falsedisables verification (insecure); a string is treated as a path to a CA-bundle file.
- ignore_http_to_https_redirects
When
True, redirects where only the scheme changes fromhttptohttps(same host, path, and query) are silently dropped from the Redirects section of the report. Defaults toFalse.- Type:
- ignore_http_to_https_redirects: bool = False
- log_level: str = 'INFO'
- max_referencing_pages: int = 10
- max_threads: int = 10
- retries: int = 3
- root_url: str
- timeout: int = 10
- load_config(cli_namespace: Namespace, *, config_path: str | None = None) CrawlConfig[source]
Build a
CrawlConfigby merging a YAML file and CLI overrides.Precedence (highest to lowest): CLI arguments > YAML file > defaults.
- Parameters:
cli_namespace – Parsed argparse namespace. Fields set to
Noneare treated as “not specified” and do not override YAML or defaults.config_path – Path to a YAML configuration file. If
None, no file is loaded. May also be read fromcli_namespace.config_file.
- Returns:
A fully populated
CrawlConfiginstance.- Raises:
ValueError – If required fields are missing, the config file cannot be parsed, or any value fails validation.
URL normalization, prefix matching, and classification utilities.
- add_trailing_slash(url: str) str[source]
Ensure a URL path that has no file extension ends with
/.This should be applied to internal URLs (same domain as the crawl root) to canonicalize bare directory paths:
/cassini→/cassini//cassini/→/cassini/(unchanged)/cassini/page.html→/cassini/page.html(has extension, unchanged)/data.csv→/data.csv(has extension, unchanged)/path/.htaccess→/path/.htaccess(dotfile, treated as a file)
Leading-dot filenames (e.g.
.htaccess,.gitignore) are considered files and do not receive a trailing slash.Apply after
normalize_url()so that index-file stripping has already run (/cassini/index.html→/cassini/→ unchanged here).- Parameters:
url – An already-normalized http/https URL.
- Returns:
The URL with a trailing slash added to any extension-free path.
- get_depth(url_path: str, root_path: str) int[source]
Return the directory depth of url_path relative to root_path.
Depth 0 is the root page itself (or its trailing-slash variant). Each additional directory segment adds 1.
- Parameters:
url_path – The path to measure.
root_path – The root path to measure from.
- Returns:
Integer depth >= 0.
- get_file_extension(url: str) str | None[source]
Return the lowercase file extension from the URL path, or None.
Only inspects the path component (ignores query string and fragment). Returns None for paths ending in
/or having no dot in the last segment.- Parameters:
url – The URL to inspect.
- Returns:
Lowercase extension including the leading dot (e.g.
'.pdf'), or None if there is no extension.
- is_html_extension(ext: str | None) bool[source]
Return True if ext indicates an HTML-like page (or no extension at all).
- Parameters:
ext – Lowercase file extension including leading dot, or None.
- Returns:
True for extensions in
HTML_EXTENSIONSand for None.
- is_http_to_https_redirect(original_url: str, final_url: str) bool[source]
Return True if original_url and final_url differ only in scheme upgrade.
A pure HTTP-to-HTTPS redirect is one where the original URL uses
httpand the final URL useshttps, with the same host (case-insensitive), path, and query string. Any other difference (different host, path, query, or port) returns False.- Parameters:
original_url – The URL before the redirect.
final_url – The URL after the redirect.
- Returns:
True if the redirect is a simple HTTP-to-HTTPS scheme upgrade.
- is_http_url(url: str) bool[source]
Return True if url uses the
httporhttpsscheme.- Parameters:
url – The URL string to test.
- Returns:
True for HTTP/HTTPS URLs, False for everything else.
- is_same_domain(url: str, root_url: str) bool[source]
Return True if url has the same host (including port) as root_url.
Comparison is case-insensitive. Scheme is ignored. Subdomains are considered different domains.
- Parameters:
url – The URL to check.
root_url – The reference URL whose host to compare against.
- Returns:
True if both URLs share the same host.
- is_under_root(url_path: str, root_path: str) bool[source]
Return True if url_path is at or under root_path on a segment boundary.
- Parameters:
url_path – The path component of the URL to test.
root_path – The root path to test containment against.
- Returns:
True if url_path is the same as or under root_path.
- matches_prefix(candidate_url: str, prefix_url: str) bool[source]
Return True if candidate_url matches prefix_url per spec §5.7 rules.
Matching rules:
Scheme is ignored (http and https are equivalent).
Host comparison is case-insensitive (and must be equal).
Path of candidate must equal path of prefix, or start with prefix path followed by
/(segment-boundary matching).
- Parameters:
candidate_url – The URL to test.
prefix_url – The prefix URL to match against.
- Returns:
True if candidate_url matches prefix_url.
- normalize_internal_url(url: str) tuple[str, str | None][source]
Normalize an internal (same-domain) URL to its canonical form.
Applies all transformations from
normalize_url()and additionally adds a trailing slash to directory-like paths (no file extension):/cassini→/cassini//cassini/→/cassini/(unchanged)/cassini/index.html→/cassini/(index stripped + already slash)/cassini/page.html→/cassini/page.html(has extension, unchanged)
Use this for deduplication and request URLs of internal crawl targets. For external URLs use
normalize_url()alone to avoid altering the request path in ways the server may not expect.- Parameters:
url – The URL to normalize.
- Returns:
A tuple of
(canonical_url, fragment_or_none).
- normalize_url(url: str) tuple[str, str | None][source]
Normalize a URL to its canonical form.
Transformations applied:
Scheme is preserved (
httpremainshttp,httpsremainshttps).Host is lowercased.
Fragment is stripped (returned separately).
Query string is preserved as part of the URL identity.
Directory-index filenames (
index.html,index.php, etc.) are stripped so/cassini/index.html→/cassini/.
Note that bare directory paths without a trailing slash (e.g.
/cassini) are left unchanged by this function. Useadd_trailing_slash()to add the trailing slash when you know the URL refers to a directory (e.g. for internal crawl targets).- Parameters:
url – The URL to normalize.
- Returns:
A tuple of
(canonical_url, fragment_or_none). For non-HTTP URLs the original string is returned unchanged with no fragment.
URL decision tree and asset type classification.
- class AssetType(*values)[source]
Bases:
EnumCategory of a non-HTML asset URL.
- DATA = 'Data'
- DOCUMENT = 'Document'
- IMAGE = 'Image'
- INFRASTRUCTURE = 'Infrastructure'
- OTHER = 'Other'
- class UrlDisposition(*values)[source]
Bases:
EnumHow the crawler should handle a given URL (spec §6 decision tree).
- ALREADY_VISITED = 'already_visited'
- DEPTH_LIMITED = 'depth_limited'
- EXTERNAL = 'external'
- IGNORED = 'ignored'
- INTERNAL_ASSET = 'internal_asset'
- INTERNAL_CRAWL = 'internal_crawl'
- NON_HTTP = 'non_http'
- NO_CRAWL = 'no_crawl'
- classify_asset(extension: str) AssetType[source]
Return the
AssetTypefor a file extension.- Parameters:
extension – Lowercase file extension including leading dot (e.g.
'.jpg').- Returns:
The
AssetTypefor the extension.
- classify_url(url: str, *, config: CrawlConfig, root_url: str, root_path: str, visited_set: set[str], depth: int) UrlDisposition[source]
Classify a URL per the spec §6 decision tree.
Steps (in order): 1. Non-HTTP scheme →
UrlDisposition.NON_HTTP2. Matchesignore_urls→UrlDisposition.IGNORED3. Already visited →UrlDisposition.ALREADY_VISITED4. Matchesno_crawl_urls→UrlDisposition.NO_CRAWL5. External (different domain or above root path) →UrlDisposition.EXTERNAL6. Exceeds depth limit →UrlDisposition.DEPTH_LIMITED7. Internal HTML/no-extension →UrlDisposition.INTERNAL_CRAWL8. Internal asset →UrlDisposition.INTERNAL_ASSET- Parameters:
url – The URL to classify (absolute, not yet normalized for query/fragment).
config – Current crawl configuration.
root_url – Normalized root URL.
root_path – Path component of the root URL.
visited_set – Set of already-visited canonical URLs.
depth – Directory depth of url relative to root.
- Returns:
The
UrlDispositionfor the URL.
- is_misplaced_asset(url: str, *, config: CrawlConfig, root_url: str, root_path: str) bool[source]
Return True if url is a misplaced asset per spec §8.6.
An asset is misplaced when all of the following hold: 1.
asset_urlsis defined and non-empty. 2. The asset does not fall under anyasset_urlsprefix. 3. The asset is not external (is on the same domain and under root). 4. The asset is not matched by anignore_urlsprefix. 5. The asset is not matched by ano_crawl_urlsprefix.- Parameters:
url – The asset URL to test.
config – Current crawl configuration.
root_url – Normalized root URL.
root_path – Path component of the root URL.
- Returns:
True if the asset is misplaced.
HTTP client with retry, redirect, SSL, and User-Agent support.
- class HttpClient(*, timeout: int, retries: int, user_agent: str, verify: bool | str = True, sleep: Callable[[float], None] | None = None)[source]
Bases:
objectHTTP client wrapping
requestswith retry, redirect, and SSL handling.- Parameters:
timeout – Timeout in seconds for each individual request attempt.
retries – Maximum number of retry attempts for transient failures.
user_agent – User-Agent header value to send with every request.
verify – TLS certificate verification. Pass
Falseto disable (e.g. for internal environments with self-signed certificates), or a path to a CA bundle. Defaults toTrue.sleep – Callable used to pause between retry attempts. Defaults to
time.sleep(). Pass a no-op (e.g.lambda _: None) in tests to avoid real waits.
- request(url: str, *, method: str = 'HEAD') RequestResult[source]
Issue an HTTP request, following redirects manually and retrying on transient errors.
If method is
'HEAD'and the server returns 405, automatically retries with'GET'.- Parameters:
url – The URL to request.
method – HTTP method string (
'HEAD'or'GET').
- Returns:
A
RequestResultdescribing the outcome.
- class RedirectHop(url: str, status_code: int)[source]
Bases:
objectA single hop in an HTTP redirect chain.
- url
The URL of this hop.
- Type:
- status_code
The HTTP status code returned for this hop.
- Type:
- status_code: int
- url: str
- class RequestResult(final_url: str, status_code: int, headers: dict[str, str], body: str | None, content_type: str | None, redirect_chain: list[~link_checker.http_client.RedirectHop] = <factory>, error: str | None = None, bytes_downloaded: int = 0)[source]
Bases:
objectResult of an HTTP request, including all redirect hops.
- final_url
The final URL after following all redirects.
- Type:
- status_code
The final HTTP status code.
- Type:
- body
Response body text (only populated for GET requests).
- Type:
str | None
- content_type
Value of the
Content-Typeheader, if present.- Type:
str | None
- redirect_chain
List of redirect hops leading to the final URL.
- Type:
list[link_checker.http_client.RedirectHop]
- error
Human-readable error string if the request failed.
- Type:
str | None
- bytes_downloaded
Approximate bytes received.
- Type:
- bytes_downloaded: int = 0
- final_url: str
- redirect_chain: list[RedirectHop]
- status_code: int
HTML link extraction, anchor collection, and base-href handling.
- class ExtractedLink(url: str, source_element: str, source_attribute: str, is_asset: bool)[source]
Bases:
objectA link extracted from an HTML page.
- url
Absolute URL of the link.
- Type:
- source_element
HTML element name (e.g.
"a","img").- Type:
- source_attribute
HTML attribute name (e.g.
"href","src").- Type:
- is_asset
True if from an asset-only element (never crawled).
- Type:
- is_asset: bool
- source_attribute: str
- source_element: str
- url: str
- extract_anchors(html: str) frozenset[str][source]
Extract all anchor IDs from html (
idattributes and<a name>).- Parameters:
html – Raw HTML string.
- Returns:
Frozen set of anchor identifier strings.
- extract_links(html: str, page_url: str, *, base_url: str | None = None) list[ExtractedLink][source]
Extract all links from html and return them as
ExtractedLinkobjects.Relative URLs are resolved against base_url (if given) or the
<base href>tag in the document, falling back to page_url.- Parameters:
html – Raw HTML string to parse.
page_url – URL of the page (used for relative URL resolution).
base_url – Override for relative URL resolution. Supersedes any
<base href>found in the document.
- Returns:
List of
ExtractedLinkwith absolute URLs.
- find_base_href(html: str) str | None[source]
Return the first
<base href>value from the document<head>.- Parameters:
html – Raw HTML string.
- Returns:
The href value, or None if no
<base href>is found.
Thread-safe crawl results aggregation.
- class BrokenAnchor(target_url: str, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA fragment reference that could not be resolved.
- target_url
The full URL including fragment.
- Type:
- target_url: str
- class BrokenLink(url: str, status_code: int, error: str, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA link that returned an error or non-success status.
- url
The broken URL.
- Type:
- status_code
HTTP status code (0 = network error).
- Type:
- error
Error description string.
- Type:
- error: str
- status_code: int
- url: str
- class CrawlResults[source]
Bases:
objectThread-safe container for all crawl results.
All
add_*andrecord_*methods are safe to call from multiple threads simultaneously.- add_broken_anchor(*, target_url: str, referrer: str) None[source]
Record a broken anchor (fragment not found in target page).
- Parameters:
target_url – Full URL including the missing fragment.
referrer – Page that contained the link.
- add_broken_link(*, url: str, status_code: int, error: str, referrer: str) None[source]
Record a broken link.
- Parameters:
url – The broken URL.
status_code – HTTP status code.
error – Error or status description.
referrer – Page that linked to this URL.
- add_ignore_match(*, url: str, referrer: str) None[source]
Record a URL that was ignored.
- Parameters:
url – The ignored URL.
referrer – Page containing the link.
- add_misplaced_asset(*, url: str, asset_type: str, referrer: str) None[source]
Record a misplaced asset.
- Parameters:
url – The asset URL.
asset_type – String label of the asset type.
referrer – Page that referenced the asset.
- add_no_crawl_match(*, url: str, referrer: str) None[source]
Record a URL that matched a no-crawl prefix.
- Parameters:
url – The URL.
referrer – Page containing the link.
- add_non200(*, url: str, status_code: int, referrer: str) None[source]
Record a URL that returned a non-200 final status.
- Parameters:
url – The URL.
status_code – HTTP status code.
referrer – Page that linked to the URL.
- add_non_http_link(*, url: str, scheme: str, referrer: str) None[source]
Record a non-HTTP scheme link.
- Parameters:
url – The full non-HTTP URL.
scheme – The URL scheme (e.g.
'mailto').referrer – Page containing the link.
- add_redirect(*, original_url: str, final_url: str, status_code: int, referrer: str) None[source]
Record a redirect.
- Parameters:
original_url – The URL that redirected.
final_url – Destination URL after all redirects.
status_code – Final HTTP status code.
referrer – Page that linked to original_url.
- add_ssl_warning(*, url: str, domain: str, error: str, referrer: str) None[source]
Record an SSL certificate error.
- Parameters:
url – The URL that triggered the SSL error.
domain – The domain of the URL.
error – SSL error description.
referrer – Page that referenced the URL.
- add_unvalidated_anchor(*, target_url: str, reason: str, referrer: str) None[source]
Record an anchor that could not be validated.
- Parameters:
target_url – Full URL including fragment.
reason – Why validation was skipped (
'no-crawl','external','depth-limited').referrer – Page containing the link.
- property broken_anchors: list[BrokenAnchor]
List of broken anchors sorted by target URL.
- property broken_links: list[BrokenLink]
List of broken links sorted by URL.
- has_problems() bool[source]
Return True if any crawl problems were found (exit code 1).
- Returns:
True if there are broken links, non-200 responses, broken anchors, misplaced assets, or SSL warnings.
- property ignore_matches: list[IgnoreMatch]
List of ignore matches sorted by URL.
- merge_referrer(url: str, referrer: str) None[source]
Add referrer to every existing result entry that tracks url.
If no entry for url exists yet (the fetch is still in-flight), the referrer is queued and will be applied automatically when the entry is created by the corresponding
add_*call.- Parameters:
url – Canonical URL to look up.
referrer – Page that linked to url.
- property misplaced_assets: list[MisplacedAsset]
List of misplaced assets sorted by asset type then URL.
- property no_crawl_matches: list[NoCrawlMatch]
List of no-crawl matches sorted by URL.
- property non200_responses: list[Non200Response]
List of non-200 responses sorted by status code then URL.
- property non_http_links: list[NonHttpLink]
List of non-HTTP scheme links sorted by URL.
- record_request(url: str, *, bytes_downloaded: int = 0, crawled: bool = False, external: bool = False) None[source]
Update statistics for a completed HTTP request.
- Parameters:
url – The URL that was requested.
bytes_downloaded – Bytes received.
crawled – True if the page was crawled (GET + parsed).
external – True if the URL was external.
- property redirects: list[RedirectInfo]
List of redirects sorted by original URL.
- property ssl_warnings: list[SslWarning]
List of SSL warnings sorted by domain.
- property statistics: CrawlStatistics
Snapshot of the crawl statistics.
- property unvalidated_anchors: list[UnvalidatedAnchor]
List of unvalidated anchors sorted by target URL.
- class CrawlStatistics(start_time: float = <factory>, total_requests: int = 0, bytes_downloaded: int = 0, pages_crawled: int = 0, pages_checked: int = 0, external_checked: int = 0, per_domain_requests: dict[str, int]=<factory>)[source]
Bases:
objectAggregated crawl statistics.
- start_time
UNIX timestamp when the crawl started.
- Type:
- total_requests
Total HTTP requests issued.
- Type:
- bytes_downloaded
Total bytes received.
- Type:
- pages_crawled
Internal pages fetched with GET and parsed.
- Type:
- pages_checked
Internal pages checked without parsing.
- Type:
- external_checked
External URLs checked.
- Type:
- bytes_downloaded: int = 0
- external_checked: int = 0
- pages_checked: int = 0
- pages_crawled: int = 0
- start_time: float
- total_requests: int = 0
- class IgnoreMatch(url: str, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA URL that was ignored due to matching an
ignore_urlsprefix.- url
The ignored URL.
- Type:
- url: str
- class MisplacedAsset(url: str, asset_type: str, referencing_pages: list[str] = <factory>)[source]
Bases:
objectAn asset found outside its expected
asset_urlsprefixes.- url
The asset URL.
- Type:
- asset_type
String name of the asset type category.
- Type:
- asset_type: str
- url: str
- class NoCrawlMatch(url: str, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA URL that matched a
no_crawl_urlsprefix.- url
The URL.
- Type:
- url: str
- class Non200Response(url: str, status_code: int, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA URL that returned a non-200 final status.
- url
The URL.
- Type:
- status_code
HTTP status code.
- Type:
- status_code: int
- url: str
- class NonHttpLink(url: str, scheme: str, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA non-HTTP scheme link encountered during crawl.
- url
The non-HTTP URL.
- Type:
- scheme
The scheme (e.g.
'mailto','tel').- Type:
- scheme: str
- url: str
- class RedirectInfo(original_url: str, final_url: str, status_code: int, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA URL that redirected to another location.
- original_url
The URL that redirected.
- Type:
- final_url
The URL after all redirects.
- Type:
- status_code
The HTTP status code of the first redirect hop (e.g. 301, 302).
- Type:
- final_url: str
- original_url: str
- status_code: int
- class SslWarning(domain: str, error: str, affected_urls: list[tuple[str, list[str]]]=<factory>)[source]
Bases:
objectAn SSL error recorded for a domain.
- domain
The domain that produced the SSL error.
- Type:
- error
Error message.
- Type:
- domain: str
- error: str
- class UnvalidatedAnchor(target_url: str, reason: str, referencing_pages: list[str] = <factory>)[source]
Bases:
objectA fragment reference that could not be validated (no HTML was parsed).
- target_url
The full URL including fragment.
- Type:
- reason
Why validation was skipped (
'no-crawl','external','depth-limited').- Type:
- reason: str
- target_url: str
Main crawl engine with thread pool, visit-once logic, and result aggregation.
- class Crawler(config: CrawlConfig, progress: ProgressReporter | None = None, sleep: Callable[[float], None] | None = None)[source]
Bases:
objectMain crawl engine.
Uses a
ThreadPoolExecutorto process URLs concurrently. Enforces visit-once semantics, depth limits, and all other spec 5-9 rules.- Parameters:
config – Crawl configuration.
progress – Optional progress reporter to update during the crawl.
sleep – Callable used for inter-retry pauses inside the HTTP client. Defaults to
time.sleep(). Passlambda _: Nonein tests to make retries instantaneous.
- abort() None[source]
Signal the crawl to stop after in-flight requests complete.
Safe to call from any thread (e.g. a signal handler). Already-submitted workers finish naturally; no new URLs are dequeued or requested.
- crawl() CrawlResults[source]
Run the full crawl starting from
config.root_url.- Returns:
CrawlResultswith all findings.
- property results: CrawlResults
Return the accumulated crawl results.
May be partial if called after
abort()beforecrawl()has returned.
Plain-text report generator for the 11 report sections.
- generate_report(results: CrawlResults, config: CrawlConfig) str[source]
Generate the full plain-text report for a completed crawl.
Produces 11 sections covering configuration, statistics, broken links, broken anchors, non-200 responses, redirects, misplaced assets, ignored URLs, non-HTTP links, SSL warnings, and unvalidated anchors.
- Parameters:
results – Completed crawl results.
config – Crawl configuration used.
- Returns:
The full report as a multi-line string.
Periodic stderr progress reporting during crawl.
- class ProgressReporter(interval: float = 5.0, output: Callable[[str], None] | None = None)[source]
Bases:
objectEmits periodic progress updates to stderr.
Updates are written approximately every interval seconds.
- Parameters:
interval – Time in seconds between progress updates.
output – Callable that accepts a string and writes it somewhere. Defaults to printing to
sys.stderr.
- update(*, checked: int, queued: int, active_threads: int, elapsed: float) None[source]
Update the current progress values.
- Parameters:
checked – Number of URLs checked so far.
queued – Number of URLs currently in queue.
active_threads – Number of active worker threads.
elapsed – Elapsed time in seconds.