R, Python, and (web) scraping the bottom of my patience

Last year, when I worked for that company, I had to become used to pick tools and technologies not because of their features, or because of our expertise on them, but just because the manager wanted so. Having no scientific or technological background, his decisions would often be met with fear and distrust by the team, of course, but he'd find a way to impose them on us.

One of them was picking R as the only acceptable language in our office. I actually like R for its straightforward approach to graphs and statistics, and, as such, did not raise a single eyebrow when I was hired to develop R programs. However, I wasn't expecting a straight ban on every other language, even for small tasks that would require just a few lines of Javascript or Python to get done. He dreamt of leading the number one research center in R contributions and, despite not coding a single line himself, dragged us towards his whims. He used to walk around the office repeating the same mantra over and over to anyone who'd listen: "Ask not if R can handle it, but how R can handle it". We got used to do everything in R: stats, markup, stylesheets, database bindings, parsing, testing, and, in my case especially, web scraping.

Once I got up to speed, we started achieving fantastic progress, as expected in any project where there's a new kid on the block. I have a bit of knowledge in building parallel algorithms and, by taking most of the coding responsibilities, my coworkers found the time to dive into modelling and extracting more and more information from the same dataset. This honeymoon, however, took just a few months to disappear. As several of us scraped thousands of pages every day, website managers started imposing restrictions to our IP address and throttling our traffic. One of them had especifically targeted us with CAPTCHA tests so difficult that even I couldn't solve. Web designers would constantly change their websites and render our scripts useless. In fact, we ended up dropping several sources because they had become way too slow to produce enough data for our analysis.

Still, our manager remained stubborn. His solution to our complaints was forcing each of us to operate several computers at the same time, in an attempt to run more tasks at the same time. It only delayed the inevitable and, as the end of my contract approached, his choice was to release the product as it was, without further finetuning. It met internal expectations, but I left before any client showed interest on it.

Today, as I was studying scraping for another project, I decided to take another look at our scripts turned into garbage by clever web designers. I found a few copies on my personal email inbox and ran it as it was. Fatal error in most of them, and 12 articles per minute on the only one I managed to get a decent result, using RSelenium, rvest and boilerpipeR. Not good, but a similar result to what I used to get at work.

Using Python 2.7.8 and the newspaper module, I managed to surpass 400 articles per minute by running this code:

import newspaper

news = newspaper.build('http://estadao.com.br') # one of the hardest to scrape back then
news.download_articles()
news.parse_articles()

for x in news.articles:
f = open(x.title, 'w')
f.write(x.text)
f.close()

Just shy of ten lines and works over 300 times faster than a 100+ lines of our R code. Of course, this is not the language's fault, and it is likely that we would have faster code by now. Yet, by taking what each language does best, we could have shipped a far better product in a lot less time.

This got me excited to try this module out tomorrow. I have great plans for it and can't wait to share them with you, but they will be the object of a future post. Stay tuned!

Tags: scraping 

 

Comments

There are currently no comments

New Comment