Integrity and Scrutiny software support - FAQs

Before emailing your question, please take a quick look at the FAQ's below to see whether your question's answered.

These FAQ's accompany the manuals for Integrity and Scrutiny.

Also see Integrity's home page or Scrutiny's home page for full version history and other information.

If your problem isn't answered below, please use this form.

The crawl is incomplete, finishes early, or "unexpectedly few results".

Try these things, one at a time.

In Preferences > Advanced, check that 'Handle cookies by default' is enabled. This should be enabled by default on newer installations of Integrity, but may be off if you're a longer-time user.
In Preferences > Global, make sure that the User agent string is set to one of the real browsers, rather than Integrity's own.
If the problem is that the crawl runs for a while, but then starts timing out or giving errors, see below "Pages time out / the web server stops responding etc"
See whether your starting url is listed in the Link URLs view, if so, see what the status is. This may give you a clue as to what's wrong
Are you running any kind of protection such as Cloudflare, Incapsula etc? These systems are designed to allow real users access, but block the kind of activity that Integrity and Scrutiny produce. One simple trick that sometimes works is to switch the User-agent string to Googlebot. If all of these steps fail, you'll have to see whether you can get your IP address whitelisted or find another way to bypass that system when you scan.
If the crawl is finding nothing on the starting url (and you've checked that this is entered correctly) Or it's finding some links but not important navigation links etc, try Render Page (run js). (Integrity Pro only). This is a last resort, don't leave it switched on if it makes no difference.
Contact support, we'll be happy to look into specific cases and help if possible.

The crawl stops right away, and the status of the starting url mentions "Protocol error".

Are you using an earlier system, 10.14, 10.13 or earlier?

Some websites give this error when using Integrity versions 10 or earlier on systems 10.14 or earlier.

The reason isn't fully understood but it may relate to Integrity using older system APIs. The new version of Integrity, v12, seems fine with the sites in question, but is supported on system 10.14 and higher.

Many links are reported as "An SSL error has occurred and a secure connection to the server cannot be made".

Is the site on shared hosting?

The server may not be able to keep up with the fast and multithreaded requests. The answer to this is usually rate limiting.

In Settings &ht; Advanced (if you're using v12) first try sliding the 'Threads' slider down to 3.

If this doesn't help, then switch on 'limit requests to X per minute'. First try 500, then 250, then 100 and if necessary fewer (The box won't accept numbers below 30s).

Can I put in a username and password to crawl pages that require authentication?

This can have disastrous results if used without care, because some website systems have a web interface with controls (including 'delete' buttons) that look to web crawlers like links.

Scrutiny is able to do this with the necessary advice, warnings and disclaimers. Click 'Advanced settings' on the settings screen. Full information is here.

Should I set "ignore querystrings"?

The querystring is information within the url of a page. It follows a '?' - for example www.mysite.co.uk/index.html?thisis=thequerystring. If you don't use querystrings on your site, then it won't matter whether you set this option. If your page is the same with or without the querystring (for example, if it contains a session id) then check 'ignore querystrings'. If the querystring determines which page appears (for example, if it contains the page id) then you shouldn't ignore querystrings, because Integrity or Scrutiny won't crawl your site properly.

What does altering the number of threads do?

Using more threads may crawl your site faster, but it will use more of your computer's resources and your internet bandwidth. More importantly, more threads will bombard your server with more requests at a time, which some may not be able to handle (see 'pages time out / stop responding / server error' below).

Using fewer will allow you to use your computer while the crawl is going on with the minimum disruption, and make less of a demand on your web server.

The default is twelve, minimum is one and maximum is 40. (Before v4, max was 30 and default was seven.) I've found that using more than this has little effect. The optimum number will depend on your server and your connection. Experiment!

Pages time out / the web server stops responding / 509 / 429 / 999 server error

This isn't uncommon. Some servers will respond to many simultaneous requests, but some will have trouble coping, or may deliberately stop responding if being bombarded from the same IP.

There are a couple of things you can do. First of all, the 'threads' slider sets the number of requests that Scrutiny/Integrity can make at once. If you move this to the extreme left, then Scrutiny/Integrity will send one request at a time, and process the result before sending the next. This alone may work.

Since version 8, you have a neat control for the maximum number of requests per minute.

You don't need to do any maths; it's not 'per thread'. Scrutiny will calculate things according to the number of threads you've set (and using a few threads will help to keep things running smoothly). It will reduce the number of threads if appropriate for your specified maximum requests.

If your server is simply being slow to respond, you can increase the timeout.

Many 404s are reported for links that are fine

This can happen with certain servers where both http:// links and https:// links appear on the site. It appears that some servers don't like rapid requests for both http and https urls. Try starting at a https:// url and blacklisting http:// links (make a rule 'do not check urls containing http://') and see whether the https:// links then return the correct code.

An alternative answer is that a request isn't just a request. A lot of information is exchanged during an http request / response, and different clients will send a different set of information. The user-agent string is just one variable.

Another possible answer is that the server may be returning a bad code, followed by a page that appears fine. Just because you see a page, doesn't mean that the browser has received a good response code.

A link to [a social networking site ie Youtube, Facebook] is reported as a bad link or an error in Scrutiny, but the link works fine in my browser?

In your browser, log out of the site in question, then visit the link. You'll then be seeing the same page that Scrutiny sees because, by default, it doesn't attempt to authenticate.

If you see a page that says something like 'you need to be logged in to see this content' then this is the answer. It's debatable whether a site should return a 404 if the page is asking you to log in, but that should be taken up with the site in question.

You have several options. You could switch on authentication in Scrutiny (you may not need to give Scrutiny the username and password, only need to be logged in using Safari). You could set up a rule so that Scrutiny does not check these links, or you could change your profile on the social networking site so that the content is visible to everyone.

Limitations

If your site is a larger site then the memory use and demand on the processor and HD (virtual memory) will increase as the lists of pages crawled and links checked get longer.

Scrutiny has become more efficient over the last couple of versions, but if the site is large enough (millions of links) then the app will eventually run out of resources and obviously can't continue.

- You can crawl the site in parts. You can do this by scanning by subdomain, scanning by directory or using blacklisting or whitelisting.

If you start within a subdomain (eg engineering.mysite.com ) the scan will be limited to that subdomain
If you start within a 'directory' (eg mysite.com/engineering ) the scan will be limited to that directory
if you create a whitelist rule which says 'only follow links containing /engineering the scan will be limited to urls which contain that fragment.

- Make sure Integrity isn't going into a loop or crawling the same page multiple times because of a session id or date in a querystring - you can switch off querystrings in the settings, but make sure that content that you want to crawl isn't controlled by information in the querystring (eg a page id)
- See if you're crawling unnecessary pages, such as a messageboard. To Integrity and Scrutiny, a well-used messageboard can look like tens of thousands of unique pages and it will try to list and check all of those pages. Again, you can exclude these pages by blacklisting part of the url or querystring or ignoring querystrings.

I use Google advertising on my pages and don't want hits on these ads from my IP address

The Google Adsense code on your page is just a snippet of javascript and doesn't contain the adverts or the links. When a browser loads the page, it runs the javascript and the ads are then pulled in. Integrity and Scrutiny don't run javascript by default (make sure the option is turned off in Scrutiny) so it won't see any ads or find the links within them.

A link that's listed in the form "www.mysite.com/../page.html" is reported as an error but when I click it in the browser it works perfectly well

Sometimes a link is written in the html as '../mypage.html'. This means that the page is to be found in the directory above, which is fine as long as the link is deep in the site. If it appears on a top-level page in that form, then it's technically incorrect because no-one should have access to the directory above your domain. Browsers tend to tolerate this and assume the link is supposed to point to the root of your site. At present, Integrity and Scrutiny do not make this assumption by default and report the error. Since v6.8.1 there is a preference to "e;tolerate ../ that travels above the domain"e;, you'll find that under Preferences > Global (or General) > Tolerance

A link that uses non-ascii or unicode characters is reported as an error but when I click it in the browser it works perfectly well

I need to make Integrity or Scrutiny appear to be a 'real' browser

You can change the user-agent string to make it appear to the server to be a browser (known as 'spoofing').

Go to Preferences and paste your chosen user-agent string into the box, or choose one from the drop-down list.

There is an incredibly comprehensive list of browser user-agent strings on this page: http://www.zytrax.com/tech/web/browser_ids.htm

If you would like to find the user-agent string of the browser you're using now, just hit this link:
What's my user-agent string?

There are many other server request fields, and Scrutiny allows you to specify some of those. To experiment with the user-agent string and other request parameters, you can use this http request / response tool

What's the difference between 'checking' and 'following'?

In a nutshell, checking means just asking the server for the status of that page without actually visiting the page. Following means visiting that page and scraping all the links off it.

Checking a link is sending a request and receiving a status code (200, 404, whatever). Integrity and Scrutiny will check all of the links it finds on your starting page. If you've checked 'Check this page only' then it stops there.

But otherwise, it'll take each of those links it's found on your first page and 'follow' them. That means requesting and loading the content of the page, then going through the content finding the links on that page. It adds all the links it finds to its list and then goes through those checking them, and if appropriate, following them in turn. Note that it won't 'follow' external links, because it would then be crawling someone else's site - it just needs to 'check' external links

You can ask Integrity or Scrutiny to not check certain links, to only follow or not to follow certain links. You do this by typing part of a url into the relevant box. For example, if you want to only check the section of your site below /engineering you would type '/engineering' (without quotes) into the 'Only follow urls containing...' box. (You will also need to start your crawl at a page containing that term).

You don't need to know about pattern matching such as regex or wildcards, just type a part of the url.

What do the red and orange colours mean in the list?

To check a link, Integrity sends a request and receives a status code back from your server (200, 404, whatever).

The 'status' column tells you the code that the server returns to Integrity when it checks each link. 200 means that the link is good, 300 means there's something not quite right (usually a redirection) but the link still works, 400 codes mean that the link is bad and the page can't be accessed and 500 codes mean some kind of error with the server. So the higher the number, the worse the error and Integrity colours these (by default) white, orange and red.

If you don't consider a redirection a problem, then you can now switch the orange colour off in Preferences.

Integrity Pro and Scrutiny now make a number of HTML validation checks, and pages with one or more validation errors will be coloured orange. You can switch off the orange colouring for html validation warnings in Preferences > Views, or disabled html validation completely in Preferences > Validation.

(There's a full list of all the possible status codes here: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes) but Integrity helpfully gives you a description of the status as well as the code number.

a 200 is shown for a link where the server doesn't exist

Your provider may be recognising the fact and inserting a page of their own (possibly with a search box and some ads) and returning a 200 code. They call this a helpful service, but it's unhelpful when we're trying to find bad links.

You may be able to ask your service provider to turn this behaviour off (either via a page on their website or by contacting them). Failing that you can use the 'soft 404' feature to raise a problem for such urls. There is a longer explanation of this problem and solution here

It crashes

This is rare, as far as we're aware, and when it does happen we really would like to know. I'm happy to investigate. Please use this form

The details within the crash report may be helpful, please send that if possible. Even more important than the report itself is exactly what we need to do to experience the same problem.

Disc space is eaten while Scrutiny runs

This should only happen for very large sites, and since version 6, Integrity and Scrutiny will be much less resource-hungry. Here are some measures to make Scrutiny more efficient.

Uncheck Preferences > SEO > Count occurrences of keywords in content

Also uncheck Settings > Archive pages while crawling

If either of these boxes are checked, Scrutiny necessarily caches the page content. Depending on the size and number of your pages this can mean a significant amount of space. Unless you save the archive after the scan, this cache will be deleted when you quit or failing that, when you start the next scan.

What does the archive feature do?

When Integrity crawls the site, it has to pull in the html code for each page in order to find the links. With the archive mode switched on, it simply saves that html as a file in a location that you specify at the end of the crawl.

Since version 6.3, the archive functionality is enhanced. You have the option to dump your html pages without changing them, or to process the files so that the archive can be viewed in a browser (in a sitesucker-type way). This option is within the save dialog at the end of the crawl, or from a button beside the archive checkbox.

What does 'Check for robots.txt and robots meta tag' do? (Scrutiny feature)

The robots.txt file and robots meta tag allow you to indicate to web robots such as the Google robot and Scrutiny that you wish them to ignore certain pages. The preference in Scrutiny's preferences is off by default. All links are followed and checked regardless of this setting, but if a page is marked as 'noindex' in the robots meta tag or disallowed in the robots.txt file, it will not be included in the sitemap, SEO or validation checks. robots.txt must have a lowercase filename, be placed in the root directory of your website and be constructed as shown at http://www.robotstxt.org/robotstxt.html