Last year, when I worked for that company, I had to become used to pick tools and technologies not because of their features, or because of our expertise on them, but just because the manager wanted so. Having no scientific or technological background, his decisions would often be met with fear and distrust by the team, of course, but he'd find a way to impose them on us.
Once I got up to speed, we started achieving fantastic progress, as expected in any project where there's a new kid on the block. I have a bit of knowledge in building parallel algorithms and, by taking most of the coding responsibilities, my coworkers found the time to dive into modelling and extracting more and more information from the same dataset. This honeymoon, however, took just a few months to disappear. As several of us scraped thousands of pages every day, website managers started imposing restrictions to our IP address and throttling our traffic. One of them had especifically targeted us with CAPTCHA tests so difficult that even I couldn't solve. Web designers would constantly change their websites and render our scripts useless. In fact, we ended up dropping several sources because they had become way too slow to produce enough data for our analysis.
Still, our manager remained stubborn. His solution to our complaints was forcing each of us to operate several computers at the same time, in an attempt to run more tasks at the same time. It only delayed the inevitable and, as the end of my contract approached, his choice was to release the product as it was, without further finetuning. It met internal expectations, but I left before any client showed interest on it.
Today, as I was studying scraping for another project, I decided to take another look at our scripts turned into garbage by clever web designers. I found a few copies on my personal email inbox and ran it as it was. Fatal error in most of them, and 12 articles per minute on the only one I managed to get a decent result, using RSelenium, rvest and boilerpipeR. Not good, but a similar result to what I used to get at work.
news = newspaper.build('http://estadao.com.br') # one of the hardest to scrape back then
for x in news.articles:
f = open(x.title, 'w')
Just shy of ten lines and works over 300 times faster than a 100+ lines of our R code. Of course, this is not the language's fault, and it is likely that we would have faster code by now. Yet, by taking what each language does best, we could have shipped a far better product in a lot less time.
This got me excited to try this module out tomorrow. I have great plans for it and can't wait to share them with you, but they will be the object of a future post. Stay tuned!