Scrape data or archive content from a website.
- Fast and easy to scan a site
- Plenty of extraction options; various meta data, content (as text, html or markdown), elements with certain classes / ids, regular expression
- Easy to export - choose the columns you want
- Output as csv or json
- New options to download all images to a folder / collect and export all links
- New option to output a single text file (designed for archiving text content, markdown or plain text)
- Plenty of options / configuration
Category:Developer Tools / Web Scrapers
10 USD (One-off purchase)
You can buy from within WebScraper or use this secure link: https://pay.paddle.com/checkout/509735
Current version requires Mac OS 10.8 or higher
What should I do with the downloaded file?
Open the .dmg file and find the application inside. If you want to keep using WebScraper, drag and drop it into your Applications folder. To keep it in your dock, right-click or click-and-hold on its dock icon and choose 'Keep in dock'.
Developer: S P Dixon
Version 4.3.1 released May 2018
- Fixes issue where extraneous dashes could appear within markdown
- Fixes bug causing some text to be missing from markdown immediately following comments in the html
- Takes out some diagnostic messages and some directory structure creation when downloading images to a folder
Version 4.3 released May 2018
- Improves the scan 'blacklist / whitelist rules' in the UI. Previously a couple of fields for comma-separated lists of terms. Now a table allowing you to create rules with more options which are more human-readable, such as "ignore urls that contain..." or "only follow urls that don't contain....". Existing rules in your saved projects should appear in the new format
- Now detects the situation where the scan doesn't proceed because the starting url returns a bad status. Previously nothing happened at that point, now a dialog is shown, displaying the starting url and the status returned, and gives some advice.
Version 4.2.3 (not released)
- Adds support for the
Version 4.2.2 released May 2018
- Fixes a problem with the scan > ignore urls containing / only follow urls containing
Version 4.2.1 released May 2018
- Adds option to ignore <nav> and <header> / <footer> when extracting content as plain text / markdown (defaults for a new project are to *include* the contents of the nav, header and footer)
- Fixes bug in engine relating to the 'output filter' (only scraping data from pages containing or not containing X)
- Slight change to the way that 'ignore urls containing' under scan works. It wasn't ignoring these ulrs completely, but 'not following'. Practically this probably makes little difference but the operation more accurately matches the wording now, and may make the scan more efficient.
- too much information was being sent to the console in recent version(s), this is tidied up a bit.
Version 4.2.0 released May 2018
- Improves the ' output file column builder' table - columns appear in columns rather than rows as before, so hopefully easier to use. You can drag the columns to re-order them, edit their headings, edit the configuration of that column or delete the column.
- Improves the output file filter (Used to be called 'information page contains'). This can now be regarded as a 'select where' and allows for setting up a number of rules, AND'd or OR'd. These can be based on a 'contains' partial match, or regex. With more options for each rule such as contains / doesn't contain, and applying the rule to the url or the entire content.
- Adds a proper links table, this can be used to collect / list all links discovered on the way, and optionally image urls too. This list can be filtered for just links / just images / internal / external / redirected / pdf documents
- Adds capability to easily extract headings (h1-h7) with particular class / id (previously the class or id method was limited to divs, spans, p's and dd's)
- Alters the 'count' that is displayed at the right of the address bar. Now it literally displays the number of pages scraped which = rows in the output table. Previously it was a count of pages discovered, which may not be the same number now that you can make rules that act as an output filter.
- Fixes recently-introduced bug which prevented your output columns from saving properly in a saved project
Version 4.1.1 released April 2018
- Allows editing of your table columns (previously, to change anything other than the column heading, it was necessary to delete the row and add a new one).
- Also allows re-ordering of columns, by dragging and dropping in the 'preview' table lower down
- Unifies the helper windows. Also now available from the View menu. This makes the helper window a potentially useful standalone tool
- Adds 'copy' button to the regex helper, copies the expression to the clipboard so that it can be used in the 'add column' dialog
Version 4.1.0 released April 2018
- Adds capability of downloading images to a folder during the scan. See Complex setup > Output file columns > Also download images to folder.
- Images can optionally be downloaded only if they match a pattern, either partial url or regex match. (leave box blank to download all images discovered)
- Adds option to filter output file - ie only include data in output file from certain pages (eg information pages or product pages). This is done by matching the url of the page being scraped, either by partial url (eg /product/) or a regex match
- Fixes issue with saving project. (note that saving project does not save data, only settings and configuration. Save data separately using Export from the Results screen or File > Export)
Version 4.0.0 released as beta April 2018
- Incorporates the version 8 crawling engine which has many improvements
- Adds 'limit requests to X per minute' control
- Updates pre-defined user-agent strings
Version 3.0.2 released as full release Oct 2017
- Adds 'text' as an option for output file format. This is designed for archiving website content (markdown or plain text) in a single text file
- Fixes some issues with the Markdown conversion - adds options to include images, and include link urls within the markdown.
- Fixes some odd things happening in the new interface with certain 'simple interface' settings
- adds a few checks and balances to prevent the user doing anything illogical like changing the file format and pressing 'export' again without re-scanning. (the output file is built while the scan is running so the format and options can't be changed)
Version 3.0.1 released as beta Oct 2017
- Fixes a couple of issues with the new interface which might have given unexpected results
Version 3.0 released as beta Oct 2017
- Improves interface - 'Integrity-like'
- Removes the 'preview' and puts the results in a table within the app, with an export button
- Adds a 'simple setup' for very quick and easy grab of a single item from each page. For more columns in the output you can switch to 'complex setup' for the column selection options as per version 2.
Version 2.1.0 released Sep 2017
- Adds option to open a list of urls to be scraped in a local text file (or csv with a single column containing the urls)
- Adds File > Open List of Links for this purpose
- Updates the engine to inherit many improvements and fixes from the Integrity crawler
Version 2.0.4 released May 2017
- Fixes a problem causing regex extraction and helper to fail if page uses ISO-8859-1 encoding
Version 2.0.3 released Apr 2017
- Fixes problem causing scan to continue with the previous scan rather than starting a new one, if the previous one has been paused and then a new project opened or started
- Fixes a problem with scanning locally (file://) - scan was extending beyond the local files if there were external links on the pages
- Fixes helpers not working with local html files
- Fixes problem with class helper, if the class /id list was too long and needed scrolling, the 'hover to highlight' functionality wasn't working beyond the area originally visible
- Live view also greys out external links - they won't be scraped but they are in the live view table because they're part of the crawling process.
- Adds field "information page" alongside blacklist and whitelist fields. This is useful if your information appears on detail pages which can be identified with a partial url (/mip/ in the case of yellowpages.com) and you want to collect information from such pages, but not parse the links on those pages.
Version 2.0.2 released Apr 2017
- Adds option to split multiple values in csv column - the row is multiplied to show each separate value in a separate row
- Adds option to ignore session id - this is important when you want to scan a site where parameters in the querystring is important, but it contains a session id which may change as the scan progresses, preventing the scan from finishing.
- Adds display of number of pages scanned to the 'progress bar' field while the scan is running
- Fixes bug messing up output file if regular expression contains quotes
- Crawl is paused if user uses the breadcrumb widget to navigate away from the crawl screen
- Preview button is prevented if scan is running - this could cause chaotic results
Version 2.0.1 released Apr 2017
- Out of beta, adds 30 day trial period and purchase options.
- Fixes login window (for authentication)
- Some small fixes
Version 2.0 (free beta) released Apr 2017
- Improves navigation, logical progression of screens with breadcrumb widget
- Adds save / load project (saves settings / configuration, not data)
- Changes logical flow of program - instead of performing the scan (storing the data) and then generating the output file, the output file is configured, and then the scan is performed.
- pro: scan is far more efficient (larger sites can be scanned)
- con: if you want to reconfigure your output file, another scan must be performed
- Live view (for debugging / helping with configuration) now clearly shows which urls it will include in the output file, and which meet the blacklist rules / don't meet the whitelist rules
- Fixes problem with some classes not being recognised
- Improves class helper - improves red box highlighting where itemprop is being used within div, span etc
- New icon for application and document icon for saved projects
- Many small fixes and improvements
Version 1.4.6 released Mar 2017
- Rolls out fix to class parser where end comment like this -------> would not be recognised as an end comment
- Fixes problem causing some id's not to be picked up
- Adds recognition of itemprop="" to id & class
Version 1.4.4 / 1.4.5 released Mar 2017
- Correction to incorrect combination of control characters If user had selected the Windows-type line separator (a CR+LF, was inserting LF+CR)
- Fixes problem preventing user from selecting plain text as a page content option when adding a column to their output file.
Version 1.4.3 released Mar 2017
- Important fix to the crawling engine around auto-detection of whether starting url is a page or directory in ambiguous cases (this affects the scope of the scan)
Version 1.4.2 released Feb 2017
- Changes to the output file builder to make that neater and more user-friendly
- Fixes crash/hang when app is unlicensed and fewer than 5 pages have been scanned
- Inherits a fix to the engine, not always recognising an end comment where it looks like this: -------------->
- Fixes occasional issue with changes to starting url field not being recognised right away
Version 1.4.1 released Jan 2017
- Fixes bug causing some data to not appear in output file sometimes where classes are being used.
Version 1.4 released Jan 2017
- Adds much better output file builder, allows user to select and add columns, then preview the result using quicklook
- Gives columns in output file human-readable column headings rather than WebScraper internal field names You can edit these column headings (or keys if you're outputing json)
- If you use the json output, for quicklook to show a preview of the output, you may need a quicklook json plugin such as http://www.sagtau.com/quicklookjson.html
- Adds regex helper (allows you to mess around with your pattern and see the result)
Version 1.3 released Jan 2017
- Adds ability to log into a site and perform the scan as an authenticated user (must be used with caution - only log in as a user with read access - not admin rights to the website.) Don't forget to blacklist urls containing 'logout' or what have you.
- Adds 'crawl blacklist' field (don't follow urls containing...). Can be used with the 'crawl whitelist' (only follow urls containing...) or as an alternative. Can help to limit the crawl to the pages you're interested in.
- Crawl whitelist and blacklist can accept multiple values - separate with a comma (whitespace optional but harmless)
- Adds Regex checkbox / field to output options. If brought into play, the whole source for each page is checked and matches are included in the output file. 'Capture groups' are respected (only capture groups are concatenated and included in the output file - unless no groups are included in the expression, in which case the whole match is used.)
- Adds low disk space detection - offers to stop or continue before space (on the system disk '/' ) becomes critical
Version 1.2 (no longer beta) released Jan 2017
- Improves class helper - it now works more smoothly
- Adds purchase price of $5, 30 day trial period, limits output file to 10 rows while in trial mode
Version 1.1 (still beta) released Dec 2016
- Now can extract data within <dd> and <p> (as long as the tags have a class or id)
- Allows you to add multiple classes or ids to your output file. In the 'class or id' box, type the class / id names, separated by comma
- Allows you to limit your scan by whitelisting, ie if you type a partial url into the whitelist box, after the homepage, Webscraper will only follow links matching your whitelist term. This allows for some limited pattern matching.
- Adds a context help system, click the 'i' buttons for information about specific controls.
Version 1.0.3 released Nov 2016
- Adds 'Class helper' window - shows you all the classes / id's that exist, and the contents of those, for your starting url or any other page. Plus a preview, highlighting the hovered class / id in red. Thus allowing you to easily configure your output file without ploughing through html source code yourself.
- Important fix - if a class name ended just before the closing angled bracket, with no quotes around it, then that class name would be messed up and following ones may be missed
- Adds option and configurable separator character for separating multiple data in the output file (eg where multiple divs appear on the same page with the same class)
Version 1.0.2 released Oct 2016
- Adds Preference window
- Adds optional progress bar on dock icon
Version 1.0.1 released June 2016
- Adds page count beneath progress bar on crawl tab
- Adds 'Consolidate whitespace' option to export file dialog
Version 1.0. released May 2016