WebScraper

Manual | Support | Main product page

Version History.

Version 4.15.8 released Sept 2023

Updates Paddle (licensing) framework for compatibility with MacOS26.

Version 4.15.7 released Sep 2024

Fixes a hang that could have happened at start of scan under certain circumstances
Increases minimum supported system to 10.14

Version 4.15.6 released March 2023

Fixes a crash that could have happened under certain circumstances

Version 4.15.5 released October 2022

Fix for output filtering based on term in the content.

Version 4.15.4 released May 2022

Important fix for those wanting to output json. backslashes in the data weren't being escaped in the output, potentially leading to invalid json.

Version 4.15.3 released October 2021

Adds setting 'Legacy webview'. The new default = use the up-to-date WebKit webview for rendering, however, the legacy version may work better in some cases and so is retained as an option.
The setting 'Attempt authentication' has been relabeled 'handle cookies' (the button's function remains unchanged) because it's sometimes advantageous to have cookie handling switched on regardless of whether you're attempting to authenticate and the new label describes what the button actually does.

Version 4.15.2 released August 2021

Small enhancement concerning downloading of images where the 'single folder' option is chosen. Images are saved by their filename into a chosen download directory. If there were two images with the same filename but different paths in the url, one would have overwritten the other in the download directory. So now a check is made and a _1, _2 etc is added into the filename for saving.
First build as Universal binary to run natively on Intel 64 or Apple Silicon M1 Macs

Version 4.15.1 released April 2021

Adds timeout control under Advanced Scan Settings. Previously this was internally set to 30s. If you experience timeouts, it's important to use the threads slider and/or 'limit requests to X per minute' to limit the crawl. this will usually cure the problem. Context help added by the timeout field to this effect.

Version 4.15.0 released March 2021

Adds option to 'render page / run js' before parsing it for links. It has a serious impact on resources and slows the scan. It's almost always not necessary. Will only run javascript which is run on loading. Will not trawl javascript code for urls, or run javascript which is triggered by a user action, 'onClick' or scrolling for example.

Version 4.14.4 released January 2021

Fixes a problem preventing scanning of a list of urls

Version 4.14.3 released December 2020

Updates the selectable user-agent strings and adds more (in particular, Edge and some more mobile browsers)
Changes default setting for treating http:// links on the same domain (when starting with an https:// url). Now treats them as internal, which is probably what's expected.
Updates Paddle's licensing framework to the latest Big Sur/M1 compatible version

Version 4.14.2 released October 2020

Fixes a problem with the 'plain text' content option
Inherits some general updates in the crawling engine

Version 4.14.1 released August 2020

Adds option to recreate directory structure when downloading pdfs or images to a local folder

Version 4.13.0 released August 2020

Improvements to crawling engine, particularly with regard to image discovery; now finds image urls within inline styles

Version 4.13.1 released August 2020

Fixes crash where a regex is found on the page but the collecting part is empty string.

Version 4.13.0 released August 2020

Improvements to crawling engine, particularly with regard to image discovery; now finds image urls within inline styles

Version 4.12.1 released July 2020

Adds support for charset=GBK, charset=koi8-r, charset=euc-kr and some other Latin and non-Latin character encodings.
Some changes to licensing functionality; a fairly major update to the Paddle licensing framework and Webscraper's program flow at startup, but should be invisible to the user.

Version 4.12.0 released April 2020

Adds option to download and save pdf files to a folder as it scans.

Version 4.11.0 released March 2020

Adds option in simple setup and complex setup for scraping email addresses.
Adds field in Preferences for editing the regular expression that is used when scraping email addresses.
Note that web pages may obfuscate email addresses to prevent scraping. Even if the email address appears normally on the page, it may not appear in the page's source.

Version 4.10.2 released December 2019

Small fix to the 'split into multiple files if output exceeds 64,000 rows' option
Splitting files at 64000 rows now takes account of the header row
System requirement increases to MacOS 10.10

Version 4.10.1 released December 2019

Adds option to strip html markup from results of class/id or regex extraction
Adds option to leave hash (#) in url (by default this is trimmed, assumed to be location within document fragment. But for some sites which use the hash in their urls incorrectly, it may be an important part of the page url). The option should be left switched off unless you are sure that it definitely needs to be turned on
System requirement increases to MacOS 10.10

Version 4.9.2 released October 2019

Fixes a bug causing the class list to not show in the class helper window immediately after the user has scanned a non-html starting url.
Adds scrolling to the test results field in the regex helper window. (remember that it's possible to adjust the size of the split panes in the helper window in order to make the test results field larger)

Version 4.9.1 released September 2019

Adds option for regex columns to only take the first match and ignore further matches
Tested and supported on Catalina

Version 4.9.0 released July 2019

Adds td to the list of tags which are searched when you specify a class or id (it already searched span, div, p, dd, h1-h7)
Inherits improvements to the crawling engine (as used in the recently-released Integrity and Scrutiny v9)
App is notarized by Apple from version 4.9.0

Version 4.8.4 released March 2019

New option for when saving images - option for a longer filename based on the image's url path. This may not usually be necessary, but if an image appears on many pages with the same filename (eg /300w.png) then this leads to the image being overwritten every time it's saved, if the filename only is used.
'New Project' clears any results in the results tab and resets a number of other things.

Version 4.8.3 released March 2019

'New Project' wasn't clearing the 'stop at X rows in the results' checkbox, leading to confusion. It does now.

Version 4.8.2 released January 2019

Important fix, bug affecting extracting contents of named class / id where <div id = "...

Version 4.8.1 released January 2019

Fixes problem experienced when using proxycrawl service and starting with a local list of urls.

Version 4.8.0 released January 2019

Can use the ProxyCrawl service to use different proxy servers and user-agent string etc for each request. Simply set up an account with ProxyCrawl (free up to 1000 successful requests per month), enter your token in Preferences, switch on "Use ProxyCrawl" in your site's advanced settings.
File menu now has a 'Save Project' option as well as a 'Save Project As...' option which work as you'd expect.
Fixes issue causing black / whitelist rules from a previously open project to appear in a project after a certain sequence of events.
Fixes main tab view switching to empty results tab after a saved project is opened.

Version 4.7.4 released January 2019

When opening a project file after running a scan or partial scan, any existing results are cleared from the table and Go button is reset
Fixes and improvements to class/id helper:
- if search box had been used to filter the list, the incorrect item could be sent after double-clicking a result to choose it
- search box above list is now case-insensitive and searches the class names as well as the contents

Version 4.7.3 unreleased

Inherits some minor improvements to the scanning engine
Some 'under the hood' changes to enable some advanced options

Version 4.7.2 released November 2018

Small but important enhancement to whitelisting rules. If a page meets the 'output filter' rules (which means that it's an 'information page' or 'detail page') it'll be included in the crawl regardless of the rules that are set up in the scan blacklist / whitelist rules.
this makes it easier to set up WebScraper where you want to limit the scan to search results or a certain section of the site, but gather information from detail pages which don't meet those scan rules.
Some updates to the context help and other small fixes / enhancements.
(4.7.22) when user clicks on application icon, main window is brought to front, re-opening it if it has been closed.
(4.7.23) fixes problem with resizing in the source panel of the regex helper

Version 4.7.1 released November 2018

If the option to split multiple values onto separate rows is used, and data in any cell exceeded 1000 characters, then the data would be truncated, this is now increased to 10kb
If the option to split multiple values onto separate rows is used, and the multi-value data in a cell contains the same return character that denotes the end-of-line in the CSV, then the 'split rows' function would fail
Other small bug-fixes

Version 4.7.0 released October 2018

Adds a new tab, 'Post process output file'. A couple of options have been shifted there, relating to the CSV file (splitting multiple values onto separate rows, and splitting the output file into 64k chunks).
A new option added to the 'post process' tab; 'remove rows where this column is empty...'.
A new option added to 'stop at X rows in the results' which is more relatable than the existing 'stop at X links' (which is a safety valve and is still present. That one should contain a number which is bigger than the number of links on the site that you're scanning. The default of 200,000 should be fine but add a zero if necessary.)
If large csv file was being split into parts with max 64k rows, files after the first one wouldn't contain headings, they do now.
Temp files are cleaned up when Webscraper quits normally.

Version 4.6.0 released October 2018

Class and Regex helpers are more helpful:
- Able to select text in Class helper to see which classes apply and choose the most appropriate one
- Able to press 'Use this' in Regex helper to insert the current expression into the 'add column' dialog and return to that dialog
- Replaces the 'press return to test' with a 'test' button to avoid the unexpected
- Related to the above, fixes a problem that has existed since the class and regex helper windows became one. If the helper is switched from class to regex or vice versa before an expression or class is chosen, then the 'add column' dialog will now show the appropriate tab with the expression or class filled in.
Helps you to write the regular expression:
- Copy and paste a suitable chunk from the source code, select part that you want to actually collect, press the new "(xyz)" button
- If the part you want to collect is a decimal number, press the "(123)" button
- Press the "XYZ" button to replace selected parts of the pasted code that you don't want to collect but that may be different on each page
- Press the "↵" button to replace all whitespace with a suitable expression fragment. This makes the expression more robust by allowing for invisible space to vary between pages
Expression field within Regex helper is smarter:
- Automatically trims whitespace and return characters from each end of pasted source code. With a single-line text field this is often a cause of frustration and confusion
- Automatically replaces return characters within multi-line pasted code to make it more reliable and make sure that everything is visible in the single-line text field
Adds user preferences for many of these things
Adds preference for when exporting to CSV format, if the output file is bigger than 64k rows, to save multiple files. Older versions of Excel (and current versions of Numbers) have a limit of 64k rows. This preference is off by default and the decision needs to be made before running the scan because the output is split while scanning.
Small fixes
- when project was saved, the output filter switch ("all of the following / one of the following") wasn't being saved. This now fixed

Version 4.5.0 released October 2018

Adds option to simply add a column for 'h1' through to 'h4' (under 'Content' in the 'Add a column' dialog. (Useful for info within a heading that doesn't have a class or id).
Enhances the 'list of urls' functionality. If your starting point is a local list of urls in plain text, and if they are 'deep links' rather than domains, then the 'down but not up' rule will apply unless the 'crawl above starting directory' checkbox is ticked.
Small interface glitch corrected. If scan was run to completion, small changes made to the configuration, Go pressed again, the scan would proceed, clearing previous data but the counter within the url / progress field (to the right) would not be reset.
Inherits any recent changes within the Integrity crawling engine
(4.5.1) A small but important bug fix - the output filter table would sometimes not allow deletion or addition of criteria. This was inconsistent, possibly related to saving and re-loading a project.

Version 4.4.2 released September 2018

dark-mode-ready
Fixes bug that prevented information being accessed within headings below h3 (h4, h5 etc) by class or id.

Version 4.4.1 released August 2018

dark-mode-ready
Fixes bug that could result in column information (complex setup) becoming misaligned after dragging and dropping to reorder the columns..

Version 4.4.0 released August 2018

dark-mode-ready
Adds 'crawl above starting directory' control (below blacklist / whitelist table on Scan tab). This is useful in cases where you want to start at a deep url, but to collect data from linked pages which aren't necessarily within the starting directory. You will then probably want to limit your scan using 'crawl maximum links from home' or blacklisting / whitelisting.

Version 4.3.2 released August 2018

Built to take advantage of 10.14's new dark mode. Respect's user's system-wide setting
Updates the Integrity crawling engine within the app to the latest version

Version 4.3.1 released May 2018

Fixes issue where extraneous dashes could appear within markdown
Fixes bug causing some text to be missing from markdown immediately following comments in the html
Takes out some diagnostic messages and some directory structure creation when downloading images to a folder

Version 4.3 released May 2018

Improves the scan 'blacklist / whitelist rules' in the UI. Previously a couple of fields for comma-separated lists of terms. Now a table allowing you to create rules with more options which are more human-readable, such as "ignore urls that contain..." or "only follow urls that don't contain....". Existing rules in your saved projects should appear in the new format
Now detects the situation where the scan doesn't proceed because the starting url returns a bad status. Previously nothing happened at that point, now a dialog is shown, displaying the starting url and the status returned, and gives some advice.

Version 4.2.3 (not released)

Adds support for the tag, its src is included in the link results

Version 4.2.2 released May 2018

Fixes a problem with the scan > ignore urls containing / only follow urls containing

Version 4.2.1 released May 2018

Adds option to ignore <nav> and <header> / <footer> when extracting content as plain text / markdown (defaults for a new project are to *include* the contents of the nav, header and footer)
Fixes bug in engine relating to the 'output filter' (only scraping data from pages containing or not containing X)
Slight change to the way that 'ignore urls containing' under scan works. It wasn't ignoring these ulrs completely, but 'not following'. Practically this probably makes little difference but the operation more accurately matches the wording now, and may make the scan more efficient.
too much information was being sent to the console in recent version(s), this is tidied up a bit.

Version 4.2.0 released May 2018

Improves the ' output file column builder' table - columns appear in columns rather than rows as before, so hopefully easier to use. You can drag the columns to re-order them, edit their headings, edit the configuration of that column or delete the column.
Improves the output file filter (Used to be called 'information page contains'). This can now be regarded as a 'select where' and allows for setting up a number of rules, AND'd or OR'd. These can be based on a 'contains' partial match, or regex. With more options for each rule such as contains / doesn't contain, and applying the rule to the url or the entire content.
Adds a proper links table, this can be used to collect / list all links discovered on the way, and optionally image urls too. This list can be filtered for just links / just images / internal / external / redirected / pdf documents
Adds capability to easily extract headings (h1-h7) with particular class / id (previously the class or id method was limited to divs, spans, p's and dd's)
Alters the 'count' that is displayed at the right of the address bar. Now it literally displays the number of pages scraped which = rows in the output table. Previously it was a count of pages discovered, which may not be the same number now that you can make rules that act as an output filter.
Fixes recently-introduced bug which prevented your output columns from saving properly in a saved project

Version 4.1.1 released April 2018

Allows editing of your table columns (previously, to change anything other than the column heading, it was necessary to delete the row and add a new one).
Also allows re-ordering of columns, by dragging and dropping in the 'preview' table lower down
Unifies the helper windows. Also now available from the View menu. This makes the helper window a potentially useful standalone tool
Adds 'copy' button to the regex helper, copies the expression to the clipboard so that it can be used in the 'add column' dialog

Version 4.1.0 released April 2018

Adds capability of downloading images to a folder during the scan. See Complex setup > Output file columns > Also download images to folder.
- Images can optionally be downloaded only if they match a pattern, either partial url or regex match. (leave box blank to download all images discovered)
Adds option to filter output file - ie only include data in output file from certain pages (eg information pages or product pages). This is done by matching the url of the page being scraped, either by partial url (eg /product/) or a regex match
Fixes issue with saving project. (note that saving project does not save data, only settings and configuration. Save data separately using Export from the Results screen or File > Export)

Version 4.0.0 released as beta April 2018

Incorporates the version 8 crawling engine which has many improvements
Adds 'limit requests to X per minute' control
Updates pre-defined user-agent strings

Version 3.0.2 released as full release Oct 2017

Adds 'text' as an option for output file format. This is designed for archiving website content (markdown or plain text) in a single text file
Fixes some issues with the Markdown conversion - adds options to include images, and include link urls within the markdown.
Fixes some odd things happening in the new interface with certain 'simple interface' settings
adds a few checks and balances to prevent the user doing anything illogical like changing the file format and pressing 'export' again without re-scanning. (the output file is built while the scan is running so the format and options can't be changed)

Version 3.0.1 released as beta Oct 2017

Fixes a couple of issues with the new interface which might have given unexpected results

Version 3.0 released as beta Oct 2017

Improves interface - 'Integrity-like'
Removes the 'preview' and puts the results in a table within the app, with an export button
Adds a 'simple setup' for very quick and easy grab of a single item from each page. For more columns in the output you can switch to 'complex setup' for the column selection options as per version 2.

Version 2.1.0 released Sep 2017

Adds option to open a list of urls to be scraped in a local text file (or csv with a single column containing the urls)
Adds File > Open List of Links for this purpose
Updates the engine to inherit many improvements and fixes from the Integrity crawler

Version 2.0.4 released May 2017

Fixes a problem causing regex extraction and helper to fail if page uses ISO-8859-1 encoding

Version 2.0.3 released Apr 2017

Fixes problem causing scan to continue with the previous scan rather than starting a new one, if the previous one has been paused and then a new project opened or started
Fixes a problem with scanning locally (file://) - scan was extending beyond the local files if there were external links on the pages
Fixes helpers not working with local html files
Fixes problem with class helper, if the class /id list was too long and needed scrolling, the 'hover to highlight' functionality wasn't working beyond the area originally visible
Live view also greys out external links - they won't be scraped but they are in the live view table because they're part of the crawling process.
Adds field "information page" alongside blacklist and whitelist fields. This is useful if your information appears on detail pages which can be identified with a partial url (/mip/ in the case of yellowpages.com) and you want to collect information from such pages, but not parse the links on those pages.

Version 2.0.2 released Apr 2017

Adds option to split multiple values in csv column - the row is multiplied to show each separate value in a separate row
Adds option to ignore session id - this is important when you want to scan a site where parameters in the querystring is important, but it contains a session id which may change as the scan progresses, preventing the scan from finishing.
Adds display of number of pages scanned to the 'progress bar' field while the scan is running
Fixes bug messing up output file if regular expression contains quotes
Crawl is paused if user uses the breadcrumb widget to navigate away from the crawl screen
Preview button is prevented if scan is running - this could cause chaotic results

Version 2.0.1 released Apr 2017

Out of beta, adds 30 day trial period and purchase options.
Fixes login window (for authentication)
Some small fixes

Version 2.0 (free beta) released Apr 2017

Improves navigation, logical progression of screens with breadcrumb widget
Adds save / load project (saves settings / configuration, not data)
Changes logical flow of program - instead of performing the scan (storing the data) and then generating the output file, the output file is configured, and then the scan is performed.
- pro: scan is far more efficient (larger sites can be scanned)
- con: if you want to reconfigure your output file, another scan must be performed
Live view (for debugging / helping with configuration) now clearly shows which urls it will include in the output file, and which meet the blacklist rules / don't meet the whitelist rules
Fixes problem with some classes not being recognised
Improves class helper - improves red box highlighting where itemprop is being used within div, span etc
New icon for application and document icon for saved projects
Many small fixes and improvements

Version 1.4.6 released Mar 2017

Rolls out fix to class parser where end comment like this -------> would not be recognised as an end comment
Fixes problem causing some id's not to be picked up
Adds recognition of itemprop="" to id & class

Version 1.4.4 / 1.4.5 released Mar 2017

Correction to incorrect combination of control characters If user had selected the Windows-type line separator (a CR+LF, was inserting LF+CR)
Fixes problem preventing user from selecting plain text as a page content option when adding a column to their output file.

Version 1.4.3 released Mar 2017

Important fix to the crawling engine around auto-detection of whether starting url is a page or directory in ambiguous cases (this affects the scope of the scan)

Version 1.4.2 released Feb 2017

Changes to the output file builder to make that neater and more user-friendly
Fixes crash/hang when app is unlicensed and fewer than 5 pages have been scanned
Inherits a fix to the engine, not always recognising an end comment where it looks like this: -------------->
Fixes occasional issue with changes to starting url field not being recognised right away

Version 1.4.1 released Jan 2017

Fixes bug causing some data to not appear in output file sometimes where classes are being used.

Version 1.4 released Jan 2017

Adds much better output file builder, allows user to select and add columns, then preview the result using quicklook
Gives columns in output file human-readable column headings rather than WebScraper internal field names You can edit these column headings (or keys if you're outputing json)
If you use the json output, for quicklook to show a preview of the output, you may need a quicklook json plugin such as http://www.sagtau.com/quicklookjson.html
Adds regex helper (allows you to mess around with your pattern and see the result)

Improvement to class helper window, handles clicks in the web preview pane, user can click to deeper pages (and back up) and displayed url and class list is updated accordingly

Improvement to handle properly the situation where user's classes or ids having the same name as one of the core fields like 'headings', 'description', 'title' etc.

Version 1.3 released Jan 2017

Adds ability to log into a site and perform the scan as an authenticated user (must be used with caution - only log in as a user with read access - not admin rights to the website.) Don't forget to blacklist urls containing 'logout' or what have you.
Adds 'crawl blacklist' field (don't follow urls containing...). Can be used with the 'crawl whitelist' (only follow urls containing...) or as an alternative. Can help to limit the crawl to the pages you're interested in.
Crawl whitelist and blacklist can accept multiple values - separate with a comma (whitespace optional but harmless)
Adds Regex checkbox / field to output options. If brought into play, the whole source for each page is checked and matches are included in the output file. 'Capture groups' are respected (only capture groups are concatenated and included in the output file - unless no groups are included in the expression, in which case the whole match is used.)
Adds low disk space detection - offers to stop or continue before space (on the system disk '/' ) becomes critical

Version 1.2 (no longer beta) released Jan 2017

Improves class helper - it now works more smoothly
Adds purchase price, 30 day trial period, limits output file to 10 rows while in trial mode

Version 1.1 (still beta) released Dec 2016

Now can extract data within <dd> and <p> (as long as the tags have a class or id)
Allows you to add multiple classes or ids to your output file. In the 'class or id' box, type the class / id names, separated by comma
Allows you to limit your scan by whitelisting, ie if you type a partial url into the whitelist box, after the homepage, Webscraper will only follow links matching your whitelist term. This allows for some limited pattern matching.
Adds a context help system, click the 'i' buttons for information about specific controls.

Version 1.0.3 released Nov 2016

Adds 'Class helper' window - shows you all the classes / id's that exist, and the contents of those, for your starting url or any other page. Plus a preview, highlighting the hovered class / id in red. Thus allowing you to easily configure your output file without ploughing through html source code yourself.
Important fix - if a class name ended just before the closing angled bracket, with no quotes around it, then that class name would be messed up and following ones may be missed
Adds option and configurable separator character for separating multiple data in the output file (eg where multiple divs appear on the same page with the same class)

WebScraper

Version History.

Version 4.15.8 released Sept 2023

Version 4.15.7 released Sep 2024

Version 4.15.6 released March 2023

Version 4.15.5 released October 2022

Version 4.15.4 released May 2022

Version 4.15.3 released October 2021

Version 4.15.2 released August 2021

Version 4.15.1 released April 2021

Version 4.15.0 released March 2021

Version 4.14.4 released January 2021

Version 4.14.3 released December 2020

Version 4.14.2 released October 2020

Version 4.14.1 released August 2020

Version 4.13.0 released August 2020

Version 4.13.1 released August 2020

Version 4.13.0 released August 2020

Version 4.12.1 released July 2020

Version 4.12.0 released April 2020

Version 4.11.0 released March 2020

Version 4.10.2 released December 2019

Version 4.10.1 released December 2019

Version 4.9.2 released October 2019

Version 4.9.1 released September 2019

Version 4.9.0 released July 2019

Version 4.8.4 released March 2019

Version 4.8.3 released March 2019

Version 4.8.2 released January 2019

Version 4.8.1 released January 2019

Version 4.8.0 released January 2019

Version 4.7.4 released January 2019

Version 4.7.3 unreleased

Version 4.7.2 released November 2018

Version 4.7.1 released November 2018

Version 4.7.0 released October 2018

Version 4.6.0 released October 2018

Version 4.5.0 released October 2018

Version 4.4.2 released September 2018

Version 4.4.1 released August 2018

Version 4.4.0 released August 2018

Version 4.3.2 released August 2018

Version 4.3.1 released May 2018

Version 4.3 released May 2018

Version 4.2.3 (not released)

Version 4.2.2 released May 2018

Version 4.2.1 released May 2018

Version 4.2.0 released May 2018

Version 4.1.1 released April 2018

Version 4.1.0 released April 2018

Version 4.0.0 released as beta April 2018

Version 3.0.2 released as full release Oct 2017

Version 3.0.1 released as beta Oct 2017

Version 3.0 released as beta Oct 2017

Version 2.1.0 released Sep 2017

Version 2.0.4 released May 2017

Version 2.0.3 released Apr 2017

Version 2.0.2 released Apr 2017

Version 2.0.1 released Apr 2017

Version 2.0 (free beta) released Apr 2017

Version 1.4.6 released Mar 2017

Version 1.4.4 / 1.4.5 released Mar 2017

Version 1.4.3 released Mar 2017

Version 1.4.2 released Feb 2017

Version 1.4.1 released Jan 2017

Version 1.4 released Jan 2017

Version 1.3 released Jan 2017

Version 1.2 (no longer beta) released Jan 2017

Version 1.1 (still beta) released Dec 2016

Version 1.0.3 released Nov 2016

Version 1.0.2 released Oct 2016

Version 1.0.1 released June 2016

Version 1.0. released May 2016