As Twitter is now considered the global platform for public conversation, the company’s storage requirements have grown as well. In recent days, Twitter experiences something on order of 5,000 to 10,000 tweets a second from more than 240 million Twitter accounts. These tweets include a multitude of links, hashtags, photos and videos, which is pretty far out of the company’s standard operating bounds.
Last month, Twitter was unavailable for some hours. It was not the first time that the social network was the victim of an overload resulting in connection problems, even if the failure was probably not the longest.
Over the last few years, the company found itself in need of a storage system that could serve millions of queries per second, with extremely low latency in a real-time environment. Availability and speed of the system became the utmost important factor. Not only did it need to be fast; it needed to be scalable across several regions around the world.
This is why at this time, an improvement plan the IT architecture in the long term is drawn. The company prioritizes infrastructure investments around enhancing the simplicity, scalability and performance of its database portfolio.
The Manhattan database system
Last week, Twitter published a blog post detailing its Manhattan database system, which was built to power a wide variety of applications. The distributed, real-time database was built to serve multiple teams and applications within the company that existing technologies can no longer handle.
Manhattan database system is built to cope with the roughly 6,000 tweets, retweets per tweet and replies that flood into its system every second. It’s also something of a complaint opposite existent open source database technologies, which was not built for scalability, speed and accuracy. The company currently uses open source databases MySQL and Cassandra to run its massive online empire.
“We were spending far too much time firefighting production systems to meet the performance expectations of our various products, and standing up new storage capacity for a use case involved too much manual work and process. Our experience developing and operating production storage at Twitter’s scale made it clear that the situation was simply not sustainable,” Twitter’s program operative Peter Schuller wrote on its website.
Manhattan, which began development two years ago, is designed to be an all-in-one solution. Twitter’s engineers built Manhattan storage service meant to be consumed just like any other cloud storage service. Currently, it uses key-value store to interact with users, but Twitter is looking to scale the database by introducing a graph-based capability.
The architecture
The database system consists of three storage engines that are designed for read-only Hadoop data, write-heavy and read-heavy data, respectively. The engines are powered by numerous services including strong consistency service, allowing customers to have strong consistency when doing certain sets of operations; timeseries counters service to handle high volume timeseries counters in Manhattan; and for importing Hadoop data, ensuring strong consistency and counting time-series data.
The data captured by the social network is stored on three different system. The first system is called seadb, which is a read only file format; the second on is sstable, which is a log-structured merge tree for heavy-write workloads; and lastly btree, which is a heavy-read and light-write system.
Manhattan automatically matches the incoming data based on the file format. The output of the workloads is then fed into Hadoop File System, and Manhattan transforms that information into seadb files so they can then be imported into the cluster for fast serving from SSDs or memory.
Developers can select the consistency of data when reading from or writing to Manhattan, allowing them to create new services with varying tradeoffs between availability and consistency. The company also developed internal APIs to expose this data for cost analysis which allows developers to determine what use cases are costing the business the most, as well as which ones aren’t being used as often.
“Engineers can provision what their application needs (storage size, queries per second, etc.) and start using storage in seconds without having to wait for hardware to be installed or for schemas to be set up,” Schuller wrote.
Looking ahead
As for further development, Twitter plans to release a white paper outlining even more technical detail on Manhattan. The company is also working to implement secondary indexes that will help developers to add an additional set of range keys to the index for a database. The secondary indexes will further speed the database queries and developers can navigate through large amounts of data.
“The challenges are increasing and the number of features being launched internally on Manhattan is growing at rapid pace. Pushing ourselves harder to be better and smarter is what drives us on the Core Storage team,” Twitter says.
Given a company’s gusto for open source, it wouldn’t be startling if it open sourced Manhattan in coming days. The company recently contributed code to Facebook’s WebScaleSQL open source project to create the perfect database designed to scale to massive proportions.
photo credit: mkhmarketing via photopin cc