•   
  • Blog   
  • The Who, the What, and the “With What” of Web Scraping

The Who, the What, and the “With What” of Web Scraping

Martin Magdinier  |  12 March 2020

Data is the new differentiator. It’s what you, a product owner, a marketing strategist, your local journalist, and a multimillionaire who already owns twelve successful companies all need. And web scraping is one way to get that data.

But where to start? Sure, the Internet will give you everything you need to know. Soon, you’ll come across lists of the best tools available, each with a name that will never make the list for best marketing decision of the year: Octoparse, Scrapy, BeautifulSoup, ParseHub, Mozenda … But how to choose? Looking at the description and the ratings is a good system when you’re buying shoes, but not when you’re trying to find the best scraper for your project.

Think about it: if you’re about to send something out there on the web to gather the reliable data you need, you have to make sure that the tool you’re using is the best for your project and your specific goals. A pair of flip flops, no matter how good the ratings, won’t get you far if you’re visiting Norway in December. And that’s why we put so much energy in helping our clients find the right tools for their projects. Over the years, we’ve tested many of them and you can now benefit from our experience. Here’s what we know.

The crawler crawling and the scraper scraping

Web Scraping 101

Web scraping is the process of fetching and extracting data from websites and downloading it in a usable format. It’s also referred to as “web harvesting,” “crawling,” “spidering,” and “web data extraction.” The process usually involves a scraper, the tool designed to extract the data for the webpages, and the crawler, or the spider, whose job is to browse the Internet to index and search for relevant content. Web scraping saves you the trouble of manually searching, downloading, and copying the data you need, and will work regardless of format. And by gathering big sets of data, you can help your organization grow by creating new products and innovating faster.

What is Web Web Scraping For?

Web scraping is used across an extensive range of industries, including risk management, retail, finance, sales and marketing, insurance, artificial intelligence, and journalism, to name only a few.

Web scraping can help you:

  • Gather data on your competitors and their products
  • Optimize your customer relationship management
  • Detect fraud
  • Feed a natural language system
  • Train machine learning models
  • Generate more and better leads
  • Perform reliable competitive analysis
  • Monitor your reputation
  • And so much more

Web scraping has opened the door to big data in a world where every market research and business strategy relies on it. Think about it: all your strategies, plans, and insights into the future rely on data. Web scraping offers the privilege of non-discrimination.

As long as you adhere to the standards in place and use the right system with enough data warehousing capacity, there are no limits to the amount of data you can collect.

But whether you’re trying to figure out how your consumers feel about your product or trying to gather data from hundreds of websites in real-time, you’ll need different tools and capacities. The first step of every web scraping project is to decide whether you’d like to take care of everything internally or with a partner, like us. And that’s only the first question of a long list of things you need to figure out before you decide on a scraper.

Our 10 criteria to evaluate a Web Scraping software

When the time comes to choose the right scraper, you should base your decision on specific criteria, and not on benchmarks. These criteria are largely defined by your project and capacities, and by your choice of externalizing or keeping everything in-house. To make the comparison easier we prepared a ready to fill PDF form to evaluate three web scraping solutions.

1. The solution maturity. You want to make sure you invest in a technology that is going to be there in the long run and actively supported. Check when the software was initially released and how often it’s been updated. Make sure the documentation and custom support is available. Do you want an open source or proprietary solution?

2. The development environment. For example, the choice of Windows, Mac or Linux and browser-based for SaaS. (SaaS are “Software as a service”, a software model licensed on a subscription basis and hosted by a third-party provider.) Take also into account how complex is the software and what skills do you need to write a scraper? How fast can you ramp-up new employee?

3. The execution platform and hosting options. Once the project developed where can you execute it? Either public, private or hybrid clouds. The main question here is: do you want to rely on a third-party infrastructure to collect critical data or do you need to keep everything in-house?

4. The possibility to circumvent CAPTCHA. That can be done using a third-party like 2Captcha

5. The ability to fine-tune proxy management and rotation. This could, for example, help you select the countries from which requests will come and support for residential IP addresses. Good web scraper lets you connect with third-party providers like Bright Data (formerly Luminati).

6. The ability to handle advanced anti-scraping features including device fingerprint anonymization and fine-tune browser/profile management.

7. The capacity to add custom scripts. By adding new pieces of code, we can extend the software capabilities.

8. API and Workflow integration. How easily can you integrate your web scraping project into your workflow? The availability of an API or external connectors to configure and project but also retrieve the data.

9. Scheduling, monitoring, and maintaining. These are crucial to any web scraping project, so your tool should allow you to perform all three according to your needs.

10. Pricing. At the end of the day, we all have a budget to respect.

So, ultimately, it’s your project, and most importantly, its schedule, scale, and eventually budget, that will determine what tools you should use.


Web Scrapers: Our Favorites

Once you know more precisely the kind of tool you’re looking for, it’s time to start shopping. You’ll find a plethora of websites listing all the best tools with their pros and cons. The truth is, however, that few of them have tested those tools as much as we have, with projects that differ in goals, scale, and complexity. So, to make it easier for you, we’ve gathered a list of our three favorites, with a short description and comparative chart. If you’d like to know more, or you’re not sure how to decide, contact us!

ParseHub


ParseHub is a point and click web scraping software that manages all projects on its infrastructure. This system means that you’re completely dependent on their infrastructure, yet you benefit from not having to make any of the usual set-up investment to provision environments. Their plan includes the maintenance of web scraping servers and proxy networks, preventing unexpected costs as you scale.

ParseHub is a bit more expensive than its counterparts (it does offer a free version for small and simple projects, though), but it’s a great option for easy projects with high volumes of data.

Content Grabber


Content Grabber, a point and click web scraping software developed by Sequentum. It provides a robust and scalable solution to collect data from complex websites and offers the advantage of being deployed on-premises, on Windows servers. We can help you manage your project without relying on a third-party vendor. And by having full control over the infrastructure, we can meet the most restrictive data privacy and security requirements.

Puppeteer


Puppeteer is a headless browser that uses DevTools Protocol to communicate with Chrome or Chromium. We call it “headless” because it was designed to be used by machines, not humans. It has no user interface and its main goal is to allow programs to read and interact with it. Like Content Grabber and ParseHub, Puppeteer is well designed for large projects, but it offers a complete solution control for complex websites using advanced anti-scraping features.

But complex features also mean a more complex design, so Puppeteer is not for the neophytes. It requires a high level of expertise and you’ll need a trained developer to create the project and maintain it. But as we’ve mentioned many times before, it’s a mistake to think of web scraping as a product: it’s a service. So, if your project requires complex extraction, make sure to ask an expert like us to develop and schedule your project, but also to maintain and monitor it.

The Comparison

Here’s a slightly more complete table that compares the three tools described above.

  ParseHub Content Grabber Puppeteer
1. Solution Maturity Mature proprietary software- Started in 2015. New version released in Fall 2019. Mature proprietary software - Released 2015. Follow Visual Web Ripper released in early 2000. Mature open source - Started in 2017. Large community support with 250+ developers
2. Development Environment Easy Point and click software for Mac, Windows, and Linux Easy Point and click software for Windows only Complex, using a developer envirnoment on Mac, Windows, and Linux
3. Excution Platform & Hosting ParseHub Platform (SaaS) Self-hosted Windows server Self-hosted Linux server
4. Captcha Basic resolution included, possible to connect to third-party Connect with third-party Connect with third-party
5. Proxy Basic proxy included, possible to connect to a third-party provider Connect with third-party Connect with third-party
6. Anti Scraping Limited Advanced Advanced
7. Custom Scripts Javascript and regular expression to select content only Extends the software with C#, regular expression, VB, Python 3 JavaScript environment, but everything is code-based!
8. API and Workflow Integration Yes—to orchestrate (start, stop, and pass parameter) and retrieve data Yes—to orchestrate (start, stop, and pass parameters).

Support data export in multiple formats
No—you need to orchestrate the script and manage your data export yourself
9. Ease to schedule and monitor Easy - everything is supported by ParseHub Medium - You need to provide the infrastructure, then everything is managed via Content Grabber Agent Control Center Not available. You need to provide
10. Pricing Self Service. Prices are available online. Enterprise Solution. Contact Sequentum for a quote. Free and Open Source.


And With That

Web scraping can be easy. But to be easy, it necessitates a plan and a vast amount of technical know-how. Even when trying to choose the right tool, you should always seek the advice of a professional. Our goal here at RefinePro is to make sure your web scraping project will help you meet your long-term needs and that it will scale with you.

It’s easy to make a bad choice with web scraping and to send your team on a wild goose chase. Using the wrong tool to gather the wrong kinds of data on the wrong website, using wrong techniques that get you kick out are costly mistakes.

So instead of chasing a wild goose, give us a call or email us to tell us about your project. And if you’re still not convinced you need help, go read our article about the medium- and long-term implications of web scraping. You’ll learn how the launching of a web scraping project is only the beginning of a long story that can only end well if it involves ongoing and constant scheduling, monitoring, and maintenance.

We specialize in data. And we can help you make the best of it. Do you have any questions? Contact us!

Got a project or idea in mind?
We have the experts to make it happen.
Tell Us About It

Categories

Newsletter

Never miss an update! Subscribe for OpenRefine's announcements and RefinePro's news.