Scrape data or archive content from a website.
WebScraper uses the Integrity v6 Engine to quickly scan a website, and can output the data (currently) as csv or json. The output can include various meta data, the entire content of each page (as text, html or markdown), extract data using a Regex pattern, and/or divs, spans, paras or dd's extracted by class or id.
Webscraper is new. Please use it for free and please get in touch with any requests, bug reports or observations.
- Easy to scan a site - just enter the starting url and press Go
- Easy to export - checkboxes for the columns you want
- Plenty of options / configuration
- New since 1.3 - data extraction by Regex pattern matching
- New since 1.0.3 - a 'helper' window giving you a list of all classes/ids on a page, and a visual tool for choosing the correct div to extract.
- Configuration of various limits on the crawl and the output file size
Current version requires Mac OSX 10.8 or higher
What should I do with the downloaded file?
Open the .dmg file and find the application inside. If you want to keep using WebScraper, drag and drop it into your Applications folder. To keep it in your dock, right-click or click-and-hold on its dock icon and choose 'Keep in dock'.
Developer: Shiela Dixon
Version 1.4.6 released Mar 2017
- Rolls out fix to class parser where end comment like this -------> would not be recognised as an end comment
- Fixes problem causing some id's not to be picked up
- Adds recognition of itemprop="" to id & class
Version 1.4.4 / 1.4.5 released Mar 2017
- Correction to incorrect combination of control characters If user had selected the Windows-type line separator (a CR+LF, was inserting LF+CR)
- Fixes problem preventing user from selecting plain text as a page content option when adding a column to their output file.
Version 1.4.3 released Mar 2017
- Important fix to the crawling engine around auto-detection of whether starting url is a page or directory in ambiguous cases (this affects the scope of the scan)
Version 1.4.2 released Feb 2017
- Changes to the output file builder to make that neater and more user-friendly
- Fixes crash/hang when app is unlicensed and fewer than 5 pages have been scanned
- Inherits a fix to the engine, not always recognising an end comment where it looks like this: -------------->
- Fixes occasional issue with changes to starting url field not being recognised right away
Version 1.4.1 released Jan 2017
- Fixes bug causing some data to not appear in output file sometimes where classes are being used.
Version 1.4 released Jan 2017
- Adds much better output file builder, allows user to select and add columns, then preview the result using quicklook
- Gives columns in output file human-readable column headings rather than WebScraper internal field names You can edit these column headings (or keys if you're outputing json)
- If you use the json output, for quicklook to show a preview of the output, you may need a quicklook json plugin such as http://www.sagtau.com/quicklookjson.html
- Adds regex helper (allows you to mess around with your pattern and see the result)
Version 1.3 released Jan 2017
- Adds ability to log into a site and perform the scan as an authenticated user (must be used with caution - only log in as a user with read access - not admin rights to the website.) Don't forget to blacklist urls containing 'logout' or what have you.
- Adds 'crawl blacklist' field (don't follow urls containing...). Can be used with the 'crawl whitelist' (only follow urls containing...) or as an alternative. Can help to limit the crawl to the pages you're interested in.
- Crawl whitelist and blacklist can accept multiple values - separate with a comma (whitespace optional but harmless)
- Adds Regex checkbox / field to output options. If brought into play, the whole source for each page is checked and matches are included in the output file. 'Capture groups' are respected (only capture groups are concatenated and included in the output file - unless no groups are included in the expression, in which case the whole match is used.)
- Adds low disk space detection - offers to stop or continue before space (on the system disk '/' ) becomes critical
Version 1.2 (no longer beta) released Jan 2017
- Improves class helper - it now works more smoothly
- Adds purchase price of $5, 30 day trial period, limits output file to 10 rows while in trial mode
Version 1.1 (still beta) released Dec 2016
- Now can extract data within <dd> and <p> (as long as the tags have a class or id)
- Allows you to add multiple classes or ids to your output file. In the 'class or id' box, type the class / id names, separated by comma
- Allows you to limit your scan by whitelisting, ie if you type a partial url into the whitelist box, after the homepage, Webscraper will only follow links matching your whitelist term. This allows for some limited pattern matching.
- Adds a context help system, click the 'i' buttons for information about specific controls.
Version 1.0.3 released Nov 2016
- Adds 'Class helper' window - shows you all the classes / id's that exist, and the contents of those, for your starting url or any other page. Plus a preview, highlighting the hovered class / id in red. Thus allowing you to easily configure your output file without ploughing through html source code yourself.
- Important fix - if a class name ended just before the closing angled bracket, with no quotes around it, then that class name would be messed up and following ones may be missed
- Adds option and configurable separator character for separating multiple data in the output file (eg where multiple divs appear on the same page with the same class)
Version 1.0.2 released Oct 2016
- Adds Preference window
- Adds optional progress bar on dock icon
Version 1.0.1 released June 2016
- Adds page count beneath progress bar on crawl tab
- Adds 'Consolidate whitespace' option to export file dialog
Version 1.0. released May 2016