Martin Tapia · About · Blog

We’re making Web scraping so easy that you’re going to love it

published Mon Aug 28 2017

Introducing our 3 steps theory

For a few years now we’ve been extracting data from the Web for clients. If we take a look back at the scraping code we wrote, a simple pattern emerges. It applies to any scraping job.

The 3 steps of all scraping scripts:

Step 1: Do actions that will get us closer to the data. There are only 6 possible choices: open a page,fill a form, simulate user input (mouse/keyboard event), scroll, set a cookie and trigger a DOM event.
Step 2: Wait for the actions to have an effect. There are only 2 possible things we can do: wait for specific DOM elements to appear or disappear.
Step 3: Extract data from the page. This step became very easy once we understood that running a few lines of jQuery directly inside the DOM works every time.

In some cases, we have to repeat steps 1 and 2 a few times to get to the data. After step 3 we have to decide if the job is done or if we go back to step 1. That’s it. I’ve just described 100% of what Web scraping is.

Simple, right?

The problem with Scrapy, PhantomJS/CasperJS, Selenium/WebDriver and plain HTTP requests

It’s surprisingly hard today to execute our 3 scraping steps. None of the tools mentioned above let you get data without headaches.

What’s funny is that running the 3 steps hassle-free only requires 2 things:

Support for all websites, including SPAs and complex JavaScript-only websites.
A set of commands as simple as our 3 steps. Seriously, we don’t need to set the expiration date of our cookies. Less is more.

No library exists today that fulfills both requirements. It can be said that some satisfy the first requirement, but never both.

Until now.

Here comes NickJS

nickjs-logo

NickJS is our attempt at making scraping easy. It’s an open source JavaScript library.

First of all, we support all websites (requirement 1) because we use Headless Chrome (but you can also use our PhantomJS driver if you prefer).

As for the simple commands (requirement 2), here’s what we expose:

Do actions that will get us closer to the data. There are 6 possible choices: open(), fill(), sendKeys(), scroll(), setCookie() and evaluate() (the last one can trigger any kind of DOM event). We also added click() because it came back often.
Wait for the actions to have an effect. There are 2 possible choices: waitUntilVisible("#example1") and waitWhileVisible("p").
Extract the data. Just call evaluate() and execute some jQuery inside the page. If the page doesn’t come with jQuery, simply call inject() with an URL to jQuery’s CDN and you’re all set.

That’s it. Nothing gets in your way. The 11 methods I just mentioned are what’s needed to run our 3 steps.

NickJS in practice: the Hacker News homepage example

hacker-news Yes we know they have an API — it’s just for the example.

The code is pretty much self-explanatory. As you can see, we instantiate a tab and the “3 steps methods” apply on the tab itself.

What’s cool is that you’ve now seen most of NickJS’ methods. You’re now able to extract any data from any website.

We know the HN example is an easy one, but rest assured that the same concepts work on all websites with NickJS.

Caveats

I don’t believe you… I’ve scraped websites in the past, and I know from experience that you’re often forced to use ugly hacks and tricks to get the data.

We won’t contradict you here. Scraping can be a mess. NickJS’ goal is to get out of your way and to simplify your life as much as possible.

NickJS doesn’t stop you from accessing its inner workings. At any moment you can call the underlying driver with tab.driver and do the ugliest hacks you want.

What about CAPTCHAs, IP bans, data storage… Scraping is not so easy!

NickJS is just a cool way to control a headless browser. For all the other problems related to scraping large quantities of data, we created Phantombuster. It’s basically a SaaS for hosting NickJS instances.

Phantombuster has an integrated CAPTCHA solver, proxy pools and cloud file storage (as well as integrated MongoDB instances). Check out the free trial 🙂

Don’t miss our tips & tricks for Headless Chrome.

Contact 💌