Web crawling and link discovery

The web crawler crawls all domains and sub-domains it finds starting from the URI defined in the web application configuration, parsing HTML and extracting links it finds. By default the web crawler will crawl all domains and sub-domains from the starting URI.

Following Links

The web crawler automatically balances the web site crawling to follow links down the web site branch (number of clicks) and across the branch (links at the same level), and tracks unique links that have already been crawled. This enables the crawler to obtain a high degree of site coverage while avoiding the re-scanning of redundant and recursive links. The list of links crawled is identified by QID 150009 Links Crawled.

Maximum number of links to crawl

The web crawler crawls up to 8,000 links per web application. The number of links include form submission, links requested as an anonymous user and links requested as an authenticated user. The user may configure this setting.

External links

Any external links and external form actions that are found to be present for a web application are not crawled. We use the term "external" to refer to links discovered on a host (FQDN or IP address) which is not the virtual host (starting host) or domain added for multi-site support. External links not crawled are identified as information gathered by QID 150010 External Links Discovered and QID 150014 External Form Actions Discovered.

Exclude List/Allow List

The exclude list feature allows you to prevent the web crawler from making certain requests for certain links in your web application.

Want to create a black list and or white list? It's easy, just edit the web application settings.

Important! Automated web application scanning has the potential of causing data loss.

How to change the scope of scanning

You can change the scope of scanning in the web application settings. The options are: limit crawling to the starting URI and its sub-directories, crawl only sub-domains, or crawl specified domains.