Relational SQL databases have a bad rap these days. Go ask all the startups you know what database technology they are using. Really, go ask.
What's wrong with SQL databases?
As awesome as SQL databases are, there are not-so-awesome challenges you face when you start to use them. Let's list the major ones:
- Requires up-front knowledge, especially since you can't store any data until you set up your database schema
- No out-of-the-box indexing beyond primary keys
- Easy to shoot yourself in the foot with issues like N+1 queries or poor indexing
- Even for those who plan ahead, high-availability is not very easy to attain
However, all of these issues can be overcome with a bit of research and experience. It's just not as simple as running a MongoDB server and dumping your arbitrary JSON documents into it.
What do I get with SQL?
SQL setup can be more challenging. But what you get in return is long-term flexibility with reliable performance.
The vast majority of data we store in databases is relational. Product requirements often require locating data based on criteria other than the primary key of the resource. For example, you may need to query for all users between age 18 and 25, which means that unless you have an index on a users' age you will need to scan the entire users table to find the users who meet the criteria. With SQL databases it is trivial to add an index on a column (or even multiple columns). However, NoSQL databases tend to lack flexible indexing support. Even the best NoSQL databases offer a limited number of secondary indexes for each table, and often times are limited in the coverage of data that can be included in the index, like multi-column support.
SQL databases also tend to have backup solutions that encompass the entire set of data, making it quicker and easier to restore functionality after an outage. NoSQL solutions often rely on their high-availability implementation details to allow for limited numbers of nodes to fail without affecting the integrity of the cluster and its data. But when a datacenter-wide power outage occurs it can take a long time to rebuild the cluster of nodes from backups.
SQL sounds pretty awesome, where do I sign?
Check yo self! As cool as SQL databases are, they are not perfect for every work load. NoSQL solutions like Cassandra and DynamoDB do have their place (though I would hesitate to say the same for MongoDB). They are great at storing high volumes of non-relational data. Timestamped metric data and logs are great use cases for horizontally scalable NoSQL solutions. SQL databases have a hard time with huge amounts of non-relational data because the data can't be randomly sharded across a cluster of database nodes without impairing support for relational querying. The only real mechanism for horizontal scalability involves manual sharding based on some kind of over-arching key such as a customer account identifier. But this form of non-random sharding can lead to hot-spots, where some shards end up with a much larger than average amount of data or queries executed.
That said, it is definitely possible to scale SQL databases. One notable "big data" company is built on MySQL to this day, and their architecture hasn't changed much since this article from 2011: New Relic Architecture - Collecting 20+ Billion Metrics A Day.
The next time you start a project and reach for a database, think hard about what kind of database it should be. Most data is highly relational, and the barriers to using SQL databases are more about up-front challenges than they are about long-term challenges (while the opposite is often true for simple NoSQL databases). More often than not, your next database choice should be a SQL database!