Scrape data or archive content from a website.
- Fast and easy to scan and screen-scrape a site
- Can use different IP, user-agent etc for each request via the ProxyCrawl service
- A native MacOS application, runs on your desktop
- Plenty of ways to extract data; various meta data, content (as text, html or markdown), elements with certain classes / ids, regular expression
- Easy to export data - choose the columns you want
- Output data as csv or json
- Options to download all images to a folder / collect and export all links
- option to output a single text file (designed for archiving text content, markdown or plain text)
- Plenty of options / configuration
- Haven't got the time to do the job? Let me do it.
- Free Trial
Category:Developer Tools / Web Scrapers
You can buy from within WebScraper after trying it, or use this secure link: https://pay.paddle.com/checkout/509735
Current version requires Mac OS 10.8 or higher
What should I do with the downloaded file?
Open the .dmg file and find the application inside. If you want to keep using WebScraper, drag and drop it into your Applications folder. To keep it in your dock, right-click or click-and-hold on its dock icon and choose 'Keep in dock'.
Developer: S P Dixon
Version 4.8.3 released March 2019
- 'New Project' wasn't clearing the 'stop at X rows in the results' checkbox, leading to confusion. It does now..
Version 4.8.2 released January 2019
- Important fix, bug affecting extracting contents of named class / id where <div id = "...
Version 4.8.1 released January 2019
- Fixes problem experienced when using proxycrawl service and starting with a local list of urls.
Version 4.8.0 released January 2019
- Can use the ProxyCrawl service to use different proxy servers and user-agent string etc for each request. Simply set up an account with ProxyCrawl (free up to 1000 successful requests per month), enter your token in Preferences, switch on "Use ProxyCrawl" in your site's advanced settings.
- File menu now has a 'Save Project' option as well as a 'Save Project As...' option which work as you'd expect.
- Fixes issue causing black / whitelist rules from a previously open project to appear in a project after a certain sequence of events.
- Fixes main tab view switching to empty results tab after a saved project is opened.
Version 4.7.4 released January 2019
- When opening a project file after running a scan or partial scan, any existing results are cleared from the table and Go button is reset
- Fixes and improvements to class/id helper:
- if search box had been used to filter the list, the incorrect item could be sent after double-clicking a result to choose it
- search box above list is now case-insensitive and searches the class names as well as the contents
Version 4.7.3 unreleased
- Inherits some minor improvements to the scanning engine
- Some 'under the hood' changes to enable some advanced options
Version 4.7.2 released November 2018
- Small but important enhancement to whitelisting rules. If a page meets the 'output filter' rules (which means that it's an 'information page' or 'detail page') it'll be included in the crawl regardless of the rules that are set up in the scan blacklist / whitelist rules.
- this makes it easier to set up WebScraper where you want to limit the scan to search results or a certain section of the site, but gather information from detail pages which don't meet those scan rules.
- Some updates to the context help and other small fixes / enhancements.
- (4.7.22) when user clicks on application icon, main window is brought to front, re-opening it if it has been closed.
- (4.7.23) fixes problem with resizing in the source panel of the regex helper
Version 4.7.1 released November 2018
- If the option to split multiple values onto separate rows is used, and data in any cell exceeded 1000 characters, then the data would be truncated, this is now increased to 10kb
- If the option to split multiple values onto separate rows is used, and the multi-value data in a cell contains the same return character that denotes the end-of-line in the CSV, then the 'split rows' function would fail
- Other small bug-fixes
Version 4.7.0 released October 2018
- Adds a new tab, 'Post process output file'. A couple of options have been shifted there, relating to the CSV file (splitting multiple values onto separate rows, and splitting the output file into 64k chunks).
- A new option added to the 'post process' tab; 'remove rows where this column is empty...'.
- A new option added to 'stop at X rows in the results' which is more relatable than the existing 'stop at X links' (which is a safety valve and is still present. That one should contain a number which is bigger than the number of links on the site that you're scanning. The default of 200,000 should be fine but add a zero if necessary.)
- If large csv file was being split into parts with max 64k rows, files after the first one wouldn't contain headings, they do now.
- Temp files are cleaned up when Webscraper quits normally.
Version 4.6.0 released October 2018
- Class and Regex helpers are more helpful:
- Able to select text in Class helper to see which classes apply and choose the most appropriate one
- Able to press 'Use this' in Regex helper to insert the current expression into the 'add column' dialog and return to that dialog
- Replaces the 'press return to test' with a 'test' button to avoid the unexpected
- Related to the above, fixes a problem that has existed since the class and regex helper windows became one. If the helper is switched from class to regex or vice versa before an expression or class is chosen, then the 'add column' dialog will now show the appropriate tab with the expression or class filled in.
- Helps you to write the regular expression:
- Copy and paste a suitable chunk from the source code, select part that you want to actually collect, press the new "(xyz)" button
- If the part you want to collect is a decimal number, press the "(123)" button
- Press the "XYZ" button to replace selected parts of the pasted code that you don't want to collect but that may be different on each page
- Press the "âµ" button to replace all whitespace with a suitable expression fragment. This makes the expression more robust by allowing for invisible space to vary between pages
- Expression field within Regex helper is smarter:
- Automatically trims whitespace and return characters from each end of pasted source code. With a single-line text field this is often a cause of frustration and confusion
- Automatically replaces return characters within multi-line pasted code to make it more reliable and make sure that everything is visible in the single-line text field
- Adds user preferences for many of these things
- Adds preference for when exporting to CSV format, if the output file is bigger than 64k rows, to save multiple files. Older versions of Excel (and current versions of Numbers) have a limit of 64k rows. This preference is off by default and the decision needs to be made before running the scan because the output is split while scanning.
- Small fixes
- when project was saved, the output filter switch ("all of the following / one of the following") wasn't being saved. This now fixed
Version 4.5.0 released October 2018
- Adds option to simply add a column for 'h1' through to 'h4' (under 'Content' in the 'Add a column' dialog. (Useful for info within a heading that doesn't have a class or id).
- Enhances the 'list of urls' functionality. If your starting point is a local list of urls in plain text, and if they are 'deep links' rather than domains, then the 'down but not up' rule will apply unless the 'crawl above starting directory' checkbox is ticked.
- Small interface glitch corrected. If scan was run to completion, small changes made to the configuration, Go pressed again, the scan would proceed, clearing previous data but the counter within the url / progress field (to the right) would not be reset.
- Inherits any recent changes within the Integrity crawling engine
- (4.5.1) A small but important bug fix - the output filter table would sometimes not allow deletion or addition of criteria. This was inconsistent, possibly related to saving and re-loading a project.
Version 4.4.2 released September 2018
- Fixes bug that prevented information being accessed within headings below h3 (h4, h5 etc) by class or id.
Version 4.4.1 released August 2018
- Fixes bug that could result in column information (complex setup) becoming misaligned after dragging and dropping to reorder the columns..
Version 4.4.0 released August 2018
- Adds 'crawl above starting directory' control (below blacklist / whitelist table on Scan tab). This is useful in cases where you want to start at a deep url, but to collect data from linked pages which aren't necessarily within the starting directory. You will then probably want to limit your scan using 'crawl maximum links from home' or blacklisting / whitelisting.
Version 4.3.2 released August 2018
- Built to take advantage of 10.14's new dark mode. Respect's user's system-wide setting
- Updates the Integrity crawling engine within the app to the latest version
Version 4.3.1 released May 2018
- Fixes issue where extraneous dashes could appear within markdown
- Fixes bug causing some text to be missing from markdown immediately following comments in the html
- Takes out some diagnostic messages and some directory structure creation when downloading images to a folder
Version 4.3 released May 2018
- Improves the scan 'blacklist / whitelist rules' in the UI. Previously a couple of fields for comma-separated lists of terms. Now a table allowing you to create rules with more options which are more human-readable, such as "ignore urls that contain..." or "only follow urls that don't contain....". Existing rules in your saved projects should appear in the new format
- Now detects the situation where the scan doesn't proceed because the starting url returns a bad status. Previously nothing happened at that point, now a dialog is shown, displaying the starting url and the status returned, and gives some advice.
Version 4.2.3 (not released)
- Adds support for the
Version 4.2.2 released May 2018
- Fixes a problem with the scan > ignore urls containing / only follow urls containing
Version 4.2.1 released May 2018
- Adds option to ignore <nav> and <header> / <footer> when extracting content as plain text / markdown (defaults for a new project are to *include* the contents of the nav, header and footer)
- Fixes bug in engine relating to the 'output filter' (only scraping data from pages containing or not containing X)
- Slight change to the way that 'ignore urls containing' under scan works. It wasn't ignoring these ulrs completely, but 'not following'. Practically this probably makes little difference but the operation more accurately matches the wording now, and may make the scan more efficient.
- too much information was being sent to the console in recent version(s), this is tidied up a bit.
Version 4.2.0 released May 2018
- Improves the ' output file column builder' table - columns appear in columns rather than rows as before, so hopefully easier to use. You can drag the columns to re-order them, edit their headings, edit the configuration of that column or delete the column.
- Improves the output file filter (Used to be called 'information page contains'). This can now be regarded as a 'select where' and allows for setting up a number of rules, AND'd or OR'd. These can be based on a 'contains' partial match, or regex. With more options for each rule such as contains / doesn't contain, and applying the rule to the url or the entire content.
- Adds a proper links table, this can be used to collect / list all links discovered on the way, and optionally image urls too. This list can be filtered for just links / just images / internal / external / redirected / pdf documents
- Adds capability to easily extract headings (h1-h7) with particular class / id (previously the class or id method was limited to divs, spans, p's and dd's)
- Alters the 'count' that is displayed at the right of the address bar. Now it literally displays the number of pages scraped which = rows in the output table. Previously it was a count of pages discovered, which may not be the same number now that you can make rules that act as an output filter.
- Fixes recently-introduced bug which prevented your output columns from saving properly in a saved project
Version 4.1.1 released April 2018
- Allows editing of your table columns (previously, to change anything other than the column heading, it was necessary to delete the row and add a new one).
- Also allows re-ordering of columns, by dragging and dropping in the 'preview' table lower down
- Unifies the helper windows. Also now available from the View menu. This makes the helper window a potentially useful standalone tool
- Adds 'copy' button to the regex helper, copies the expression to the clipboard so that it can be used in the 'add column' dialog
Version 4.1.0 released April 2018
- Adds capability of downloading images to a folder during the scan. See Complex setup > Output file columns > Also download images to folder.
- Images can optionally be downloaded only if they match a pattern, either partial url or regex match. (leave box blank to download all images discovered)
- Adds option to filter output file - ie only include data in output file from certain pages (eg information pages or product pages). This is done by matching the url of the page being scraped, either by partial url (eg /product/) or a regex match
- Fixes issue with saving project. (note that saving project does not save data, only settings and configuration. Save data separately using Export from the Results screen or File > Export)
Version 4.0.0 released as beta April 2018
- Incorporates the version 8 crawling engine which has many improvements
- Adds 'limit requests to X per minute' control
- Updates pre-defined user-agent strings
Version 3.0.2 released as full release Oct 2017
- Adds 'text' as an option for output file format. This is designed for archiving website content (markdown or plain text) in a single text file
- Fixes some issues with the Markdown conversion - adds options to include images, and include link urls within the markdown.
- Fixes some odd things happening in the new interface with certain 'simple interface' settings
- adds a few checks and balances to prevent the user doing anything illogical like changing the file format and pressing 'export' again without re-scanning. (the output file is built while the scan is running so the format and options can't be changed)
Version 3.0.1 released as beta Oct 2017
- Fixes a couple of issues with the new interface which might have given unexpected results
Version 3.0 released as beta Oct 2017
- Improves interface - 'Integrity-like'
- Removes the 'preview' and puts the results in a table within the app, with an export button
- Adds a 'simple setup' for very quick and easy grab of a single item from each page. For more columns in the output you can switch to 'complex setup' for the column selection options as per version 2.
Version 2.1.0 released Sep 2017
- Adds option to open a list of urls to be scraped in a local text file (or csv with a single column containing the urls)
- Adds File > Open List of Links for this purpose
- Updates the engine to inherit many improvements and fixes from the Integrity crawler
Version 2.0.4 released May 2017
- Fixes a problem causing regex extraction and helper to fail if page uses ISO-8859-1 encoding
Version 2.0.3 released Apr 2017
- Fixes problem causing scan to continue with the previous scan rather than starting a new one, if the previous one has been paused and then a new project opened or started
- Fixes a problem with scanning locally (file://) - scan was extending beyond the local files if there were external links on the pages
- Fixes helpers not working with local html files
- Fixes problem with class helper, if the class /id list was too long and needed scrolling, the 'hover to highlight' functionality wasn't working beyond the area originally visible
- Live view also greys out external links - they won't be scraped but they are in the live view table because they're part of the crawling process.
- Adds field "information page" alongside blacklist and whitelist fields. This is useful if your information appears on detail pages which can be identified with a partial url (/mip/ in the case of yellowpages.com) and you want to collect information from such pages, but not parse the links on those pages.
Version 2.0.2 released Apr 2017
- Adds option to split multiple values in csv column - the row is multiplied to show each separate value in a separate row
- Adds option to ignore session id - this is important when you want to scan a site where parameters in the querystring is important, but it contains a session id which may change as the scan progresses, preventing the scan from finishing.
- Adds display of number of pages scanned to the 'progress bar' field while the scan is running
- Fixes bug messing up output file if regular expression contains quotes
- Crawl is paused if user uses the breadcrumb widget to navigate away from the crawl screen
- Preview button is prevented if scan is running - this could cause chaotic results
Version 2.0.1 released Apr 2017
- Out of beta, adds 30 day trial period and purchase options.
- Fixes login window (for authentication)
- Some small fixes
Version 2.0 (free beta) released Apr 2017
- Improves navigation, logical progression of screens with breadcrumb widget
- Adds save / load project (saves settings / configuration, not data)
- Changes logical flow of program - instead of performing the scan (storing the data) and then generating the output file, the output file is configured, and then the scan is performed.
- pro: scan is far more efficient (larger sites can be scanned)
- con: if you want to reconfigure your output file, another scan must be performed
- Live view (for debugging / helping with configuration) now clearly shows which urls it will include in the output file, and which meet the blacklist rules / don't meet the whitelist rules
- Fixes problem with some classes not being recognised
- Improves class helper - improves red box highlighting where itemprop is being used within div, span etc
- New icon for application and document icon for saved projects
- Many small fixes and improvements
Version 1.4.6 released Mar 2017
- Rolls out fix to class parser where end comment like this -------> would not be recognised as an end comment
- Fixes problem causing some id's not to be picked up
- Adds recognition of itemprop="" to id & class
Version 1.4.4 / 1.4.5 released Mar 2017
- Correction to incorrect combination of control characters If user had selected the Windows-type line separator (a CR+LF, was inserting LF+CR)
- Fixes problem preventing user from selecting plain text as a page content option when adding a column to their output file.
Version 1.4.3 released Mar 2017
- Important fix to the crawling engine around auto-detection of whether starting url is a page or directory in ambiguous cases (this affects the scope of the scan)
Version 1.4.2 released Feb 2017
- Changes to the output file builder to make that neater and more user-friendly
- Fixes crash/hang when app is unlicensed and fewer than 5 pages have been scanned
- Inherits a fix to the engine, not always recognising an end comment where it looks like this: -------------->
- Fixes occasional issue with changes to starting url field not being recognised right away
Version 1.4.1 released Jan 2017
- Fixes bug causing some data to not appear in output file sometimes where classes are being used.
Version 1.4 released Jan 2017
- Adds much better output file builder, allows user to select and add columns, then preview the result using quicklook
- Gives columns in output file human-readable column headings rather than WebScraper internal field names You can edit these column headings (or keys if you're outputing json)
- If you use the json output, for quicklook to show a preview of the output, you may need a quicklook json plugin such as http://www.sagtau.com/quicklookjson.html
- Adds regex helper (allows you to mess around with your pattern and see the result)
Version 1.3 released Jan 2017
- Adds ability to log into a site and perform the scan as an authenticated user (must be used with caution - only log in as a user with read access - not admin rights to the website.) Don't forget to blacklist urls containing 'logout' or what have you.
- Adds 'crawl blacklist' field (don't follow urls containing...). Can be used with the 'crawl whitelist' (only follow urls containing...) or as an alternative. Can help to limit the crawl to the pages you're interested in.
- Crawl whitelist and blacklist can accept multiple values - separate with a comma (whitespace optional but harmless)
- Adds Regex checkbox / field to output options. If brought into play, the whole source for each page is checked and matches are included in the output file. 'Capture groups' are respected (only capture groups are concatenated and included in the output file - unless no groups are included in the expression, in which case the whole match is used.)
- Adds low disk space detection - offers to stop or continue before space (on the system disk '/' ) becomes critical
Version 1.2 (no longer beta) released Jan 2017
- Improves class helper - it now works more smoothly
- Adds purchase price, 30 day trial period, limits output file to 10 rows while in trial mode
Version 1.1 (still beta) released Dec 2016
- Now can extract data within <dd> and <p> (as long as the tags have a class or id)
- Allows you to add multiple classes or ids to your output file. In the 'class or id' box, type the class / id names, separated by comma
- Allows you to limit your scan by whitelisting, ie if you type a partial url into the whitelist box, after the homepage, Webscraper will only follow links matching your whitelist term. This allows for some limited pattern matching.
- Adds a context help system, click the 'i' buttons for information about specific controls.
Version 1.0.3 released Nov 2016
- Adds 'Class helper' window - shows you all the classes / id's that exist, and the contents of those, for your starting url or any other page. Plus a preview, highlighting the hovered class / id in red. Thus allowing you to easily configure your output file without ploughing through html source code yourself.
- Important fix - if a class name ended just before the closing angled bracket, with no quotes around it, then that class name would be messed up and following ones may be missed
- Adds option and configurable separator character for separating multiple data in the output file (eg where multiple divs appear on the same page with the same class)
Version 1.0.2 released Oct 2016
- Adds Preference window
- Adds optional progress bar on dock icon
Version 1.0.1 released June 2016
- Adds page count beneath progress bar on crawl tab
- Adds 'Consolidate whitespace' option to export file dialog
Version 1.0. released May 2016