For a few years now we’ve been extracting data from the Web for clients. If we take a look back at the scraping code we wrote, a simple pattern emerges. It applies to any scraping job.
In some cases, we have to repeat steps 1 and 2 a few times to get to the data. After step 3 we have to decide if the job is done or if we go back to step 1. That’s it. I’ve just described 100% of what Web scraping is.
Simple, right?
It’s surprisingly hard today to execute our 3 scraping steps. None of the tools mentioned above let you get data without headaches.
What’s funny is that running the 3 steps hassle-free only requires 2 things:
No library exists today that fulfills both requirements. It can be said that some satisfy the first requirement, but never both.
Until now.
NickJS is our attempt at making scraping easy. It’s an open source JavaScript library.
First of all, we support all websites (requirement 1) because we use Headless Chrome (but you can also use our PhantomJS driver if you prefer).
As for the simple commands (requirement 2), here’s what we expose:
open()
, fill()
, sendKeys()
, scroll()
, setCookie()
and evaluate()
(the last one can trigger any kind of DOM event). We also added click()
because it came back often.waitUntilVisible("#example1")
and waitWhileVisible("p")
.evaluate()
and execute some jQuery inside the page. If the page doesn’t come with jQuery, simply call inject()
with an URL to jQuery’s CDN and you’re all set.That’s it. Nothing gets in your way. The 11 methods I just mentioned are what’s needed to run our 3 steps.
Yes we know they have an API — it’s just for the example.
The code is pretty much self-explanatory. As you can see, we instantiate a tab and the “3 steps methods” apply on the tab itself.
What’s cool is that you’ve now seen most of NickJS’ methods. You’re now able to extract any data from any website.
We know the HN example is an easy one, but rest assured that the same concepts work on all websites with NickJS.
I don’t believe you… I’ve scraped websites in the past, and I know from experience that you’re often forced to use ugly hacks and tricks to get the data.
We won’t contradict you here. Scraping can be a mess. NickJS’ goal is to get out of your way and to simplify your life as much as possible.
NickJS doesn’t stop you from accessing its inner workings. At any moment you can call the underlying driver with tab.driver
and do the ugliest hacks you want.
What about CAPTCHAs, IP bans, data storage… Scraping is not so easy!
NickJS is just a cool way to control a headless browser. For all the other problems related to scraping large quantities of data, we created Phantombuster. It’s basically a SaaS for hosting NickJS instances.
Phantombuster has an integrated CAPTCHA solver, proxy pools and cloud file storage (as well as integrated MongoDB instances). Check out the free trial 🙂
Don’t miss our tips & tricks for Headless Chrome.