Scrutiny link checker and webmaster tools for Apple Macintosh

Blacklists and whitelists - Do not check / Only follow / Do not follow

In a nutshell, 'check' means ask the server for the status of that page without actually visiting the page. 'Follow' means visit that page and scrape the links off it.

Checking a link means sending a request and receiving a status code (200, 404, whatever). Scrutiny will check all of the links it finds on your starting page. If you've checked 'Check this page only' then it stops there.

Otherwise, it'll take each of those links it's found on your first page and 'follow' them. That means it'll request and load the content of the page, then trawl the content to find the links on that page. It adds all of the links it finds to its list and then goes through those checking them, and if appropriate, following them in turn.

Note that it won't 'follow' external links, because it would then be crawling someone else's site - it just needs to 'check' external links

You can ask Scrutiny to not check certain links, to only follow or not to follow certain links. You do this by setting up a rule and typing part of a url into the relevant box. For example, if you want to only check the section of your site below /engineering you would choose 'Only follow...' and type '/engineering' (without quotes). You don't need to know about pattern matching such as regex or wildcards, just type a part of the url. Separate multiple keywords or phrases with commas.

You can now limit the crawl based on keywords or a phrase in the content, or highlight certain pages based on content.

No complex pattern-matching, just type the word or phrase and check 'Check content as well as url'. A match is made if any of the phrases appear in the url or the content.

You can highlight pages that are matched by the 'do not follow' or 'only follow' rules. This option is on the first tab of Preferences.

Note if you wish to see a list of pages containing certain text, then it's easier to click through to the task screen and choose 'Search pages'. A dialog will pop up allowing you to enter the search text.

Number of threads

This slider sets the number of requests that Scrutiny can make at once Using more threads may crawl your site faster, but it will use more of your computer's resources and your internet bandwidth, and also hit your website harder.

Using fewer will allow you to use your computer while the crawl is going on with the minimum disruption.

The default is 12, minimum is one and maximum is 40. I've found that using more than this has little effect. If your site is fast to respond, then you may get maximum crawl speed with the slide half way.

Beware - your site may start to give timeouts if you have this setting too high. In some cases, too many threads may stop the server from responding or responding to your IP. If moving the number of threads to the minimum doesn't cure this problem, see 'Timeout and Delay' below.

Timeout and Delay

If you're getting timeouts you may first reduce the number of threads you're using.

Your server may not respond to many simultaneous requests - it may have trouble coping or may deliberately stop responding if being bombarded from the same IP. If you get many timeouts at the same time, there are a couple of things you can do. First of all, move the number of threads to the extreme left, then Scrutiny will send one request at a time, and process the result before sending the next. This alone may work. If not, then the delay field allows you to set a delay (in seconds). You can set this to what you like, but a fraction of a second may be enough.

If your server is simply being slow to respond or your connection is busy, you can increase the timeout (in seconds).

Archive pages while crawling

When Scrutiny crawls the site, it has to pull in the html code for each page in order to find the links. WIth the archive option switched on, it simply dumps the html as a file in a location that you specify at the end of the crawl.

Since v6.3, the archive functionality is enhanced. (Integrity Plus and Scrutiny only, not Integrity.) You have the option to dump your html pages without changing them (as before) or to process the files so that the archive can be viewed in a browser (in a sitesucker-type way) This option is within a panel of options accessed from the 'options' button beside the Archive checkbox. In Scrutiny the options also appear in the Save dialog (if shown) at the end of the crawl.

Ignore querystrings

The querystring is information within the url of a page. It follows a '?' - for example www.mysite.co.uk/index.html?thisis=thequerystring. If you don't use querystrings on your site, then it won't matter whether you set this option. If your page is the same with or without the querysrting (for example, if it contains a session id) then check 'ignore querystrings'. If the querystring determines which page appears (for example, if it contains the page id) then you shouldn't ignore querystrings, because Scrutiny won't crawl your site properly.

Don't follow 'nofollow'

Scrutiny can check links for the 'rel = nofollow' attribute. By default, Scrutiny will follow all links but see which links have nofollow, you can do so. Or you can ask Scrutiny not to follow those 'nofollow' links.

To check for nofollow in links, go to Preferences > Views and switch on nofollow in one or both of the links tables. With the column showing in either view, Scrutiny will check for the attribute in the links it finds, and show Yes or No in the table column. (You can of course re-order the table columns by dragging and dropping).

You'll find the checkbox for 'don't follow 'nofollow' links' on the Sites and Settings screen. With that unchecked, Scrutiny will still follow those links.

Scrutiny can also check for nofollow in the robots meta tag. If it finds it then it'll treat and mark all links on that page as being nofollow and won't follow those if the 'don't follow' checkbox is checked.

Wordpress / SEO-friendly urls (no file extensions)

For the most part, this setting has little effect on the crawl, but if your starting url is the kind without a file extension (mysite.com/mypage/) then there's no way to know that mypage is a page rather than a directory. Interity and Scrutiny are limited to the directory that you start in, so in this case it's important to check this option in order for your site to be crawled fully.

Run javascript on page before scanning

Some pages require javascript and will display 'noscript' text to browsers with javascript disabled. Scrutiny will see this text by default. A page requiring javascript may use javascript to populate some or all content. This may not be good practice (although the Googlebot can now execute javascript in some cases) but if your site requires javascript to be switched on then Scrutiny 5 can run javascript before scanning the page.

Check 'Run javascript on page before scanning' to switch on this feature. The scan will be slower, so only use this option if absolutely necessary.

Check links within PDF documents

With this option checked, Scrutiny will load in pdf files on your website and scan them for links. Links found within those pdfs will be checked and included in the Links list and internal links will be followed but the pdf itself will not be included as a page in your sitemap.

PDF files can be large and so using this option can increase the memory used by Scruting while it's scanning.

Limiting Crawl

If your site is big enough or if Scrutiny hits some kind of loop (more common than you'd think) it would eventually run out of memory and just crash. Setting a limit is better than a crash.

By default it's set to stop at 200,000 links, but you can alter this limit in Prefs. It'll probably handle more, but be aware that if you increase the limit and if your site is big enough, it may fail at some point. If it does, then it'll be necessary to break down the site usng blacklisting / whitelisting.

You can also alter the number of 'levels' that Scrutiny will crawl (clicks from home). Some people prefer to limit a crawl this way.

User agent string

You can change the user-agent string to make it appear to the server to be a browser (known as 'spoofing').

Go to Preferences and select from one of the pre-defined browsers or paste your chosen user-agent string into the box.

There is an incredibly comprehensive list of browser user-agent strings on this page: http://www.zytrax.com/tech/web/browser_ids.htm

If you would like to find the user-agent string of the browser you're using now, just hit this link:
What's my user-agent string?

Ignore leading / trailing slashes and mismatched quotes around urls

When Scrutiny crawls the site, it has to pull in the html code for each page in order to find the links. If code is hand written it may not be perfectly-formed and contain certain problems but the page will still appear to work properly in a web browser.

This option will allow the app to be more forgiving and overlook these problems as the web browser does (if checked) or be less forgiving and flag up these problems so that you can fix them.

Sitemap > Check for robots.txt and robots meta tag (Scrutiny feature)

The robots.txt file and robots meta tag allow you to indicate to web robots such as the Google robot and Scrutiny that you wish them to ignore certain pages. The preference in Scrutiny is off by default and switched on in preferences. All links are followed and checked regardless of this setting, but if a page is marked as 'noindex' in the robots meta tag or disallowed in the robots.txt file, it will not be included in the sitemap results. robots.txt must have a lowercase filename, be placed in the root directory of your website and be constructed as at http://www.robotstxt.org/robotstxt.html

Validation > Location of Validator (Scrutiny feature)

By default, this screen uses W3C's HTML validation service. Since the w3c validator switched to the 'nu' engine it hasn't been possible to support validating all pages. But pages can still be validated individually via a context menu on the SEO results table.

You can download, install and run the validator on your Mac for free. There's a free app that can make this installation easy. If you are successful, you can enter the url of your instance of the validator in the appropriate box in Preferences. More information is here

Users with the w3c validator installed locally who find that it still works with Scrutiny, should continue to use Scrutiny v5.9.18 which is the latest version that can validate all pages.

Manual for Scrutiny 6 - Settings