Manual for Integrity 9 - Settings

Blacklists and whitelists - Do not check / Only follow / Do not follow

In a nutshell, 'check' means ask the server for the status of that page without actually visiting the page. 'Follow' means visit that page and scrape the links off it.

Checking a link means sending a request and receiving a status code (200, 404, whatever). Integrity will check all of the links it finds on your starting page. If you've checked 'Check this page only' then it stops there.

Otherwise, it'll take each of those links it's found on your first page and 'follow' the internal ones. That means it'll request and load the content of the page, then trawl the content to find the links on that page. Then follow those in turn.

Note that it won't 'follow' external links, because it would then be crawling someone else's site - it just needs to 'check' external links

You can ask Integrity to list but not check certain links, to ignore certain links completely, or not to follow certain links. You do this by setting up a rule and typing part of a url into the relevant box. For example, if you want to only check the section of your site below /engineering you would choose 'Only follow...' and type '/engineering' (without quotes). You don't need to know about pattern matching such as regex or wildcards, just type a part of the url. Separate multiple keywords or phrases with commas.

You don't need to use complex pattern-matching, just type the word or phrase. A match is made if any of the phrases appear in the url.

These rules do allow certain pattern characters. An asterisk means 'any number of any character' and a dollar sign means 'at the end'.

You can highlight pages that are matched by the 'do not follow' or 'only follow' rules. This option is on the first tab of Preferences.

Number of threads

This slider sets the number of requests that Integrity can make at once. Using more threads may crawl your site faster, but it will use more of your computer's resources and your internet bandwidth. More importantly, it'll hit your website harder.

The default is 12, minimum is one and maximum is 40. I've found that using more than this has little effect. If your site is fast to respond, then you may get maximum crawl speed with the slide half way.

Beware - your site may start to give timeouts if you have this setting too high. In some cases, too many threads may stop the server from responding or it may block your IP. If moving the number of threads to the minimum doesn't cure this problem, see 'Timeout and Delay' below.

Timeout and Delay

If you're getting timeouts you may first try reducing the number of threads you're using.

Integrity now has a setting for accurate rate-limiting. "Limit requests to X per minute". You don't need to do any calculations involving the number threads, Integrity will take that into account, and reduce the threads if necessary.

If your server is simply being slow to respond or your connection is busy, you can increase the timeout (in seconds).

Check for broken images / Check linked js and css files

Hyperlinks appear on the page in the form <a href=.... Integrity can also check for broken images <img src=.... srcset=... and other linked files <link href=...

in the case of images, their alt text will be reported in place of link text (Integrity Pro's SEO section can report images without alt text).

If you check 'load images' then the data for each image is loaded so that the actual size can be determined (reported in the SEO table). If that switch is off, then Integrity can only go by the size reported in the response header, if present.

Ignore external

If you're only interested in your internal pages (if you're generating a sitemap, for example) then this control can speed up the scan.

Don't follow 'nofollow'

Integrity can check links for the 'rel = nofollow' attribute. By default, Integrity will follow all links but see which links have nofollow, you can do so. Or you can ask Integrity not to follow those 'nofollow' links.

To check for nofollow in links, Use the column selector above the links tables and select 'nofollow'. Integrity will check for the attribute in the links and show Yes or No in the table columns. (You can of course re-order the table columns by dragging and dropping).

You'll find the checkbox 'don't follow 'nofollow' links' on the Settings screen. With that unchecked, Integrity will still follow those links.

Integrity can also check for nofollow in the robots meta tag. If it finds it then it'll treat and mark all links on that page as being nofollow and won't follow those if the 'don't follow' checkbox is checked.

Check links on error pages

With this setting on, Integrity will parse the content of custom error pages and any other page which is returned following a 4xx or 5xx. A relative link on such a page can be a source of an infinite loop.

Treat subdomains of root domain as internal

A page is considered internal if it has the same domain, ie peacockmedia.software. This checkbox chooses whether a subdomain should be considered part of the same site, ie is blog.peacockmedia.software the same site as www.peacockmedia.software.

Ignore querystrings

The querystring is information within the url of a page. It follows a '?' - for example www.mysite.co.uk/index.html?thisis=thequerystring. If you don't use querystrings on your site, then it won't matter whether you set this option. If your page is the same with or without the querysrting (for example, if it contains a session id) then check 'ignore querystrings'. If the querystring determines which page appears (for example, if it contains the page id) then you shouldn't ignore querystrings, because Integrity won't crawl your site properly.

Ignore trailing slashes

This setting should be on by default, and it should only be switched off in very rare cases where the trailing slash makes a difference. ie if mysite.com/mypage returns a different page from mysite.com/mypage/ or gives an error.

Pages have no file extension

For the most part, this setting has little effect on the crawl, but if your starting url is the kind without a file extension (mysite.com/mypage/) then there's no way to know that /mypage is a page rather than a directory. Interity is limited to the directory that you start in, so it's important to check this option in order for your site to be crawled fully.

Archive pages while crawling

When Integrity crawls the site, it has to pull in the html code for each page in order to find the links. WIth the archive option switched on, it simply dumps the html as a file in a location that you specify at the end of the crawl.

Since v6.3, the archive functionality is enhanced. (Integrity Plus and Integrity Pro, not Integrity.) You have the option to dump your html pages without changing them (as before) or to process the files somewhat so that the saved files can be viewed and browsed more easily. This option is within a panel of options accessed from the 'options' button beside the Archive checkbox. The options also appear in the Save dialog (if shown) at the end of the crawl.

If you're looking for more thorough archiving, please see Website Watchman.

Preferences

Limiting Crawl

If your site is big enough or if Integrity hits some kind of loop (more common than you'd think) it would eventually run out of memory and just crash. Setting a limit is better than a crash.

By default it's set to stop at 200,000 links, but you can alter this limit in Prefs. It'll probably handle many more, but be aware that if you increase the limit and if your site is big enough, it may fail at some point. If it does, then it'll be necessary to break down the site usng blacklisting / whitelisting.

You can also alter the number of 'levels' that Integrity will crawl (clicks from home). Some people prefer to limit a crawl this way.

User agent string

You can change the user-agent string to make Integrity appear to be a browser to the server (known as 'spoofing').

Go to Preferences and select from one of the pre-defined browsers or paste your chosen user-agent string into the box.

There is an incredibly comprehensive list of browser user-agent strings on this page: http://www.zytrax.com/tech/web/browser_ids.htm

If you would like to find the user-agent string of the browser you're using now, just hit this link:
What's my user-agent string?

Generally be tolerant rather than strict

When Integrity crawls the site, it has to pull in the html code for each page in order to find the links. If code is hand written it may not be perfectly-formed and contain certain problems but the page will still appear to work properly in a web browser.

This option will allow Integrity to be more forgiving and overlook certain problems (as web browsers tend to be) or be less forgiving and flag up these problems so that you can fix them.

Ignore ../ that travels above the domain

This is a very common problem. ../ means 'the directory above' and relative links may start with one or more, ie ../../../ which may indicate a folder above the starting point of the domain. Browsers tend to be tolerant to this. Some people feel that if a link works in a browser then they don't want it reported as a problem. but I do recommend that you leave the box unchecked and fix such links.

Sitemap > XML / Template (Plus and Pro)

Use these settings to configure the output XML.

Sitemap > Check for robots.txt and robots meta tag (Plus and Pro)

The robots.txt file and robots meta tag allow your site to indicate to web robots such as the Googlebot and Integrity that you wish them to ignore certain pages. The preference in Integrity is off by default and can be switched on in preferences. All links are followed and checked regardless of this setting, but if a page is marked as 'noindex' in the robots meta tag or disallowed in the robots.txt file, it will not be included in the sitemap results. robots.txt must have a lowercase filename, be placed in the root directory of your website and be constructed as at http://www.robotstxt.org/robotstxt.html

SEO > Parameters (Pro)

Use these settings to set the thresholds for the various SEO checks - optimum length for title, description, keyword density, minimum number of words per page, maximum number of links etc.

SEO > Options > Keyword analysis while scanning (Pro)

This checkbox enables the 'High keyword density' filter in the SEO results. It demands quite a bit of work while the scan is running.

<<index

You are viewing part of the manual / help for Integrity for Mac OSX by Peacockmedia.