I remember some years ago when talking about databases you would often hear the word “Oracle”, which is considered a very robust database management system but so complex at the same time that in order to make a simple backup you would need to have several certifications such as OCA (Oracle certified associate) or OCP (not related to robocop at all).
Also you would hear about oracle’s open source cousin mysql, which we used to create wonderful php based web application, plagued with sql injection and for the .net world god ol’ sql server which required no certifications at all to make backups.
Those were the times!
However, nowadays the panorama has changed completely and new players have joined the game, more flexible and scalable than anything that we have seen before, we hear all over the news all the time about big data and how artificial intelligence algorithms will forever change the way we live our lives, like how to go to the bathroom more efficiently, such a revolution!
Jokes aside, the increase of connected devices in our lives has lead to an explosion of data, which has surpassed the capacity of traditional relational data base management systems and led to the creation of Non Relational Databases, more commonly known as NoSQL which is not technically correct, we will see why a bit further into the article.
Traditional relational databases such as Oracle and SQL Server follow some of Codd’s twelve rules (which are actually thirteen because you know, the first one has index 0) which in short we have tables, with multiples rows and columns that can be related to each other by using a key (primary key and foreign key) which allow us to do joins between tables and enforce data integrity by using constraints.
Nothing really difficult right? The main characteristic is that the tables have a fixed structure(we will call it schema), we define column types before adding the data, this structure cannot be inferred automatically from the data and usually we do not change the type of the column, if we have a integer column we cannot store a string in the same column.
Limits of Traditional SQL
Suppose we have a 11 millions rows table with a full text index, which means that the database will create like a phone directory of the data in order to search it faster. Therefore, we need to add another index in another column or even we have to add a completely new column, depending on the database, usually this would take too long to complete and would block some other operations like adding new records.
Imagine you have to create a search engine with it, adding new criteria to search the table, so every time you add a new index to a column of your table, it will take ages to process, and your queries will have to be fined tuned in order to have an adequate response time.
What we do to solve this? Generally we would try to reduce the size of the table, add more computer power (if it’s slow usually adding more memory fixes it, right?) or plainly restricting our query to only search for the first 1000 rows, but what if we need to do the search with the complete dataset? Imagine having terabytes of data to search, basically we are screwed
Enter No SQL
So for this case we have a simple solution, divide our system in two, one is the Transactional System which will comply with all 13 codd rules to provide consistency and reduce redundancy to our data and another piece of software, that we will use to index our data, like Elasticsearch, which in fact has no tables and uses json as mode of storing the data, having a query language very different to traditional SQL.
What do we gain by using this solution? Elasticsearch is very scalable and you can have very fast response times for searches with millions and millions of records.
What if your database is bigger than THAT GUY’S EGO? You need sharding, you can divide your database into smaller databases and put them into different servers depending of a key, like for example a “Country” field where you would have a server for each of the countries in your database, depending on your data.
So basically, a NoSQL solution is a more specialized database system which is often used for applications in which a traditional database wouldn’t work, or at least would not perform as well, the principal characteristic is that the schema is flexible and may be inferred by your data, you only need to provide the data in json format.
Usually NoSQL databases do not support reliability features that are natively supported by relational databases such as foreign keys and constraints (bye bye joins) and have other ways of querying data than just only SQL like map reduce, json queries and so on, which differs from the traditional way of querying the data using vanilla sql. Furthermore, most traditional queries cannot be done the same way as with sql, something as simple as select distinct * from table is harder to implement.
The market of NoSQL solutions has become very varied, dominated mostly by key/value stores (Redis, BerkeleyDB) , document databases (MongoDB) and column family store (Bigtable, Cassandra), most of the social networks are fueled by these types of databases, like BigTable which is now used by a number of Google applications, such as web indexing.
Most people after reading this article would ask me “I have one of those old and ugly last generation database thingies, should I throw them away in favor of these new sexy technologies?”
My answer is yes and no at the same way, usually for most projects a simple dbms is more than enough, I feel like dying inside a bit everytime I see developers trying to implement joins and referential integrity in MongoDB instead of just using MySQL.
Most of the time you should choose the right tool for the job, depending on the need of your clients, SQL database are far away from dying but NoSQL database provide with the means to develop new applications that wouldn’t be possible otherwise.