Scrape data or archive content from a website.
WebScraper uses the Integrity v6 Engine to quickly scan a website, and can output the data (currently) as csv or json. The output can include various meta data, the entire content of each page (as text, html or markdown), extract data using a Regex pattern, and/or divs, spans, paras or dd's extracted by class or id.
Webscraper is new. Please use it for free and please get in touch with any requests, bug reports or observations.
- Easy to scan a site - just enter the starting url and press Go
- Easy to export - choose the columns you want
- Plenty of options / configuration
- Plenty of extraction options including html elements with certain classes or ids, regular expression or entire content in a number of formats
- Configuration of various limits on the crawl and the output file size
Current version requires Mac OSX 10.8 or higher
What should I do with the downloaded file?
Open the .dmg file and find the application inside. If you want to keep using WebScraper, drag and drop it into your Applications folder. To keep it in your dock, right-click or click-and-hold on its dock icon and choose 'Keep in dock'.
Developer: Shiela Dixon
Version 2.0.3 released Apr 2017
- Fixes problem causing scan to continue with the previous scan rather than starting a new one, if the previous one has been paused and then a new project opened or started
- Fixes a problem with scanning locally (file://) - scan was extending beyond the local files if there were external links on the pages
- Fixes helpers not working with local html files
- Fixes problem with class helper, if the class /id list was too long and needed scrolling, the 'hover to highlight' functionality wasn't working beyond the area originally visible
- Live view also greys out external links - they won't be scraped but they are in the live view table because they're part of the crawling process.
- Adds field "information page" alongside blacklist and whitelist fields. This is useful if your information appears on detail pages which can be identified with a partial url (/mip/ in the case of yellowpages.com) and you want to collect information from such pages, but not parse the links on those pages.
Version 2.0.2 released Apr 2017
- Adds option to split multiple values in csv column - the row is multiplied to show each separate value in a separate row
- Adds option to ignore session id - this is important when you want to scan a site where parameters in the querystring is important, but it contains a session id which may change as the scan progresses, preventing the scan from finishing.
- Adds display of number of pages scanned to the 'progress bar' field while the scan is running
- Fixes bug messing up output file if regular expression contains quotes
- Crawl is paused if user uses the breadcrumb widget to navigate away from the crawl screen
- Preview button is prevented if scan is running - this could cause chaotic results
Version 2.0.1 released Apr 2017
- Out of beta, adds 30 day trial period and purchase options.
- Fixes login window (for authentication)
- Some small fixes
Version 2.0 (free beta) released Apr 2017
- Improves navigation, logical progression of screens with breadcrumb widget
- Adds save / load project (saves settings / configuration, not data)
- Changes logical flow of program - instead of performing the scan (storing the data) and then generating the output file, the output file is configured, and then the scan is performed.
- pro: scan is far more efficient (larger sites can be scanned)
- con: if you want to reconfigure your output file, another scan must be performed
- Live view (for debugging / helping with configuration) now clearly shows which urls it will include in the output file, and which meet the blacklist rules / don't meet the whitelist rules
- Fixes problem with some classes not being recognised
- Improves class helper - improves red box highlighting where itemprop is being used within div, span etc
- New icon for application and document icon for saved projects
- Many small fixes and improvements
Version 1.4.6 released Mar 2017
- Rolls out fix to class parser where end comment like this -------> would not be recognised as an end comment
- Fixes problem causing some id's not to be picked up
- Adds recognition of itemprop="" to id & class
Version 1.4.4 / 1.4.5 released Mar 2017
- Correction to incorrect combination of control characters If user had selected the Windows-type line separator (a CR+LF, was inserting LF+CR)
- Fixes problem preventing user from selecting plain text as a page content option when adding a column to their output file.
Version 1.4.3 released Mar 2017
- Important fix to the crawling engine around auto-detection of whether starting url is a page or directory in ambiguous cases (this affects the scope of the scan)
Version 1.4.2 released Feb 2017
- Changes to the output file builder to make that neater and more user-friendly
- Fixes crash/hang when app is unlicensed and fewer than 5 pages have been scanned
- Inherits a fix to the engine, not always recognising an end comment where it looks like this: -------------->
- Fixes occasional issue with changes to starting url field not being recognised right away
Version 1.4.1 released Jan 2017
- Fixes bug causing some data to not appear in output file sometimes where classes are being used.
Version 1.4 released Jan 2017
- Adds much better output file builder, allows user to select and add columns, then preview the result using quicklook
- Gives columns in output file human-readable column headings rather than WebScraper internal field names You can edit these column headings (or keys if you're outputing json)
- If you use the json output, for quicklook to show a preview of the output, you may need a quicklook json plugin such as http://www.sagtau.com/quicklookjson.html
- Adds regex helper (allows you to mess around with your pattern and see the result)
Version 1.3 released Jan 2017
- Adds ability to log into a site and perform the scan as an authenticated user (must be used with caution - only log in as a user with read access - not admin rights to the website.) Don't forget to blacklist urls containing 'logout' or what have you.
- Adds 'crawl blacklist' field (don't follow urls containing...). Can be used with the 'crawl whitelist' (only follow urls containing...) or as an alternative. Can help to limit the crawl to the pages you're interested in.
- Crawl whitelist and blacklist can accept multiple values - separate with a comma (whitespace optional but harmless)
- Adds Regex checkbox / field to output options. If brought into play, the whole source for each page is checked and matches are included in the output file. 'Capture groups' are respected (only capture groups are concatenated and included in the output file - unless no groups are included in the expression, in which case the whole match is used.)
- Adds low disk space detection - offers to stop or continue before space (on the system disk '/' ) becomes critical
Version 1.2 (no longer beta) released Jan 2017
- Improves class helper - it now works more smoothly
- Adds purchase price of $5, 30 day trial period, limits output file to 10 rows while in trial mode
Version 1.1 (still beta) released Dec 2016
- Now can extract data within <dd> and <p> (as long as the tags have a class or id)
- Allows you to add multiple classes or ids to your output file. In the 'class or id' box, type the class / id names, separated by comma
- Allows you to limit your scan by whitelisting, ie if you type a partial url into the whitelist box, after the homepage, Webscraper will only follow links matching your whitelist term. This allows for some limited pattern matching.
- Adds a context help system, click the 'i' buttons for information about specific controls.
Version 1.0.3 released Nov 2016
- Adds 'Class helper' window - shows you all the classes / id's that exist, and the contents of those, for your starting url or any other page. Plus a preview, highlighting the hovered class / id in red. Thus allowing you to easily configure your output file without ploughing through html source code yourself.
- Important fix - if a class name ended just before the closing angled bracket, with no quotes around it, then that class name would be messed up and following ones may be missed
- Adds option and configurable separator character for separating multiple data in the output file (eg where multiple divs appear on the same page with the same class)
Version 1.0.2 released Oct 2016
- Adds Preference window
- Adds optional progress bar on dock icon
Version 1.0.1 released June 2016
- Adds page count beneath progress bar on crawl tab
- Adds 'Consolidate whitespace' option to export file dialog
Version 1.0. released May 2016