Twitter has reworked the way its search works — from an architectural standpoint, at least.
Most end users shouldn’t notice any differences just yet, but Twitter’s search should now scale better, index more tweets per second, and use less of Twitter’s system resources. All this newfound scalability and headroom will give Twitter’s developers the ability to build cool new search features in the near future (we’re hoping for an older back catalog of tweets to show up in search results, but nothing like that has been confirmed yet).
So, what ever happened to Summize?
Apparently, this early-stage acquisition from 2008 has all but disappeared; Twitter’s real-time search engine is no longer based on Summize’s technology.
The search architecture is also no longer based on MySQL, the scaling of which, Twitter dev Michael Busch noted, “had become increasingly challenging.”
Around six months ago, Busch (a Lucene committer) and team decided to make the switch to Lucene, a 10-year-old open-source information retrieval software. The team then spent some quality (and quantity) time hacking Lucene to suite Twitter’s unique needs. And of course, since Lucene is open-source, the modifications are being added to Lucene, particularly its real-time branch.
“We rewrote big parts of the core in-memory data structures,” said Busch, “especially the posting lists, while still supporting Lucene’s standard APIs.” The team also improved garbage collection, added lock-free data structures and algorithms and a few other niceties.
We’re impressed that, indeed, we hadn’t noticed any odd behavior or downtime for Twitter search, specifically. But we’d like to know more about the problems Twitter’s engineers were having with MySQL not being able to scale well. You may not have noticed, but database scaling has been something of a recurring theme around here lately.