Welcome to link-checker
’s documentation!
Link Checker
A Python tool that checks websites for broken links and catalogs internal assets.
Features
Crawls websites starting from a root URL that respects URL hierarchy boundaries (won’t crawl “up” from the starting URL)
Detects broken internal links
Catalogs references to non-HTML assets (images, text files, etc.)
Only visits each page once
Checks external links but does not crawl them
Provides detailed logging
Allows specifying paths to exclude from internal asset reporting
Supports checking but not crawling specific website sections
Installation
pip install rms-link-checker
Or from source:
git clone https://github.com/SETI/rms-link-checker.git
cd rms-link-checker
pip install -e .
You can also install using pipx
, which allows you to install the software and its
dependencies in isolation without needing to set up a virtual environment:
pipx install rms-link-checker
Usage
link_checker https://example.com
Options
--verbose
or-v
: Increase verbosity (can be used multiple times)--output
or-o
: Specify output file for results (default: stdout)--log-file
: Write log messages to a file (in addition to console output)--log-level
: Set the minimum level for messages in the log file (DEBUG, INFO, WARNING, ERROR, CRITICAL)--timeout
: Timeout in seconds for HTTP requests (default: 10.0)--max-requests
: Maximum number of requests to make (default: unlimited)--max-depth
: Maximum depth to crawl (default: unlimited)--max-threads
: Maximum number of concurrent threads for requests (default: 10)--ignore-asset-paths-file
: Specify a file containing paths to ignore when reporting internal assets (one per line)--ignore-internal-paths-file
: Specify a file containing paths to check once but not crawl (one per line)--ignore-external-links-file
: Specify a file containing external links to ignore in reporting (one per line)
Examples
Simple check:
link_checker https://example.com
Check a specific section of a website (won’t crawl to parent directories):
link_checker https://example.com/section/subsection
Ignore specific asset paths:
# Create a file with paths to ignore
echo "/images" > ignore_assets.txt
echo "css" >> ignore_assets.txt # Leading slash is optional
echo "scripts" >> ignore_assets.txt
link_checker https://example.com --ignore-asset-paths-file ignore_assets.txt
Check but don’t crawl specific sections:
# Create a file with paths to check but not crawl
echo "docs" > ignore_crawl.txt # Leading slash is optional
echo "/blog" >> ignore_crawl.txt
link_checker https://example.com --ignore-internal-paths-file ignore_crawl.txt
Verbose output with detailed logging:
link_checker https://example.com -vv
Verbose output with logs written to a file:
link_checker https://example.com -vv --log-file=link_checker.log
Verbose output with logs written to a file, but only warnings and errors:
link_checker https://example.com -vv --log-file=link_checker.log --log-level=WARNING
Limit crawl depth and set a longer timeout:
link_checker https://example.com --max-depth=3 --timeout=30.0
Limit the number of requests to avoid overwhelming the server:
link_checker https://example.com --max-requests=50
Control the number of concurrent threads for faster checking on a powerful system:
link_checker https://example.com --max-threads=20
Or reduce threads to be more gentle on the server:
link_checker https://example.com --max-threads=4
Report Format
The report includes:
Configuration summary (root URL, hierarchy boundary, and ignored paths)
Broken links found (grouped by page)
Internal assets (grouped by type)
Summary with counts (visited pages, broken links, assets)
Stats on ignored assets, limited-crawl sections, and URLs outside hierarchy
Contributing
Information on contributing to this package can be found in the Contributing Guide.
Links
Licensing
This code is licensed under the Apache License v2.0.