I talked yesterday about a previous job gruelings, and how easier everything would be if our manager was just a bit more flexible. I ended up sleeping on it and woke up excited to try again something out, using the Python way this time.
Turns out newspaper scraping wasn't our main bottleneck. In fact, it ended up being our main strength. Since most newspapers in Brazil are somewhat traditional and have been around for decades, scraping just a few of them gave us access to an insurmountable amount of historical data that could not be compared to any other database in the country. What really pissed most of us off (mainly my closest coworker, she really hated it) was Twitter scraping.
Basically, our manager decided we need a broad coverage around the country, which most of my coworkers seemed to agree (I didn't, but I was outvoted). So, in other words, we were to scrape news from at least one source from every major region, a challenge that quickly proved to be too difficult to be done via web scraping. A coworker came up with the idea of scraping their Twitter accounts instead, despite them not going so far back in time, since become it should be easier to tackle.
It wasn't easy. Twitter loads its data asynchronously (via AJAX) and, as such, cannot be scraped with any of our tools. She came up with the idea of using Selenium, a Java application able to operate web browsers and simulate a human being cleverly enough to trick Twitter into loading its tweets in an automated way. However, since we were not allowed to use any language other than R itself, we settled for RSelenium, a package with bindings to Selenium that allowed a connection to our workflow.
It was painful. While RSelenium is a great tool, especially when combined with the power of R statistical features, it just didn't scale well. Scraping just a few hundred thousands of tweets would take several weeks and ruin any deadlines set by the manager (that weren't realistic to begin with). On top of that, it was quite prone to human mistakes and we did plenty of them back then.
I was curious as to how long I would take to mimic the functionality, and how well would it fare against that R code in terms of performance. Turns out it took me only an hour to assemble these lines:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
base_url = u'https://twitter.com/search?q='
query = '%40JornalOGlobo'
url = base_url + query
tweets = 
body = browser.find_element_by_tag_name('body')
txt = open('tweets.txt', 'w')
for i in range(10):
tweets = browser.find_elements_by_class_name('tweet-text')
for t in tweets:
Although we didn't take long to climb the learning curve on RSelenium either, it wasn't nearly as quickly. In terms of performance, it wasn't as superior as yesterday's, but, still, a large improvement, with over 50 tweets per minute. I didn't run a stress test to see how it would fare in terms of stability and memory consumption, but I'm tempted to believe it would do a lot better.
I'm psyched to dive even deeper into this. Can I rebuild that system from the ground up? How long would it take? Wouldn't it be nice if that kind of information became available to anyone interested in researching it? All this answers will come, eventually, in a future post, but that one will take some time. Don't wait up!