As promised in one of my previous articles now we are going to discuss about how to choose a database, as you might be aware, relational databases are still alive and kicking but there are new players that are changing the game completely.
However, with the huge variety of all the alternatives available in the market, how do we choose the correct database for our new and revolutionary [Insert generic project related to IoT, blockchain, machine learning or artificial intelligence] project?
Even though there is not a single expert that has used all the existing rdms and nosql solutions in the world, this shouldn’t be so hard to do. By using some common sense and real world experience we can think of some simple guidelines:
Follow the data
Our choice of technology should not be influenced by the trends in social media or what most people consider as “sexy” (technologically speaking), but mostly on our use cases and what are the business requirements.
In this case we have to analyse our data and ask some simple questions:
- What does my data looks like? Does it resemble a dictionary (Key-value) ? a table or a json document?
- Should I store in it’s original format?
- Do the relationships in my data are important?
Basically, if you data resembles a classical DBMS with tables and relationships where you need to enforce strict schema level validation rules, you should always think about a SQL database likeMySQL.
If your data has a flexible schema, where you have different types of data in each fields, and looks more like a json file than a table, then you might consider a document store such as MongoDB.
Don’t make the mistake of accidentally choosing a “Cache” instead of a database, like the developers of the social network Dispora did, see: Why You Should Never Use MongoDB
If we need to store our data in its original format without transforming or changing it, we are talking about a Datalake, and we might use Hadoop for this purpose.
CAP Theorem and ACID
Another tool that greatly help us to decide what database system to use if the CAP theorem which states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a (non-error) response – without guarantee that it contains the most recent write
- Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
On the other hand we have ACID (atomicity, consistency, isolation, and durability) is an acronym for learning and remembering the four primary attributes ensured to any transaction by a transaction manager in a database system (which is also called a transaction monitor). These attributes are:
- Atomicity. In a transaction involving two or more discrete pieces of information, either all of the pieces are committed or none are.
- Consistency. A transaction either creates a new and valid state of data, or, if any failure occurs, returns all data to its state before the transaction was started.
- Isolation. A transaction in process and not yet committed must remain isolated from any other transaction.
- Durability. Committed data is saved by the system such that, even in the event of a failure and system restart, the data is available in its correct state.
CAP provides the basic requirements that a distributed system must follow and ACID is a set of rules that a database can choose to follow that guarantees how it handles transactions and keeps data safe, most traditional relational RDBMS comply with ACID whereas most NoSQL solutions do not.
With that in mind we can use the following diagram to help us make our decision:
So when talking about consistency, which guarantee that any transactions started in the future necessarily will see the effects of other transactions committed in the past and that database constraints are not violated, particularly once a transaction is commited, here we are talking about the possibility of doing a rollback of a big transaction if we have an error in the middle of it.
Therefore, imagine if we are inserting a financial transaction, to add money to an account and then we withdraw some money at the same time, if the transacton fails in the middle of the process, our account balance should remain consistent, since this might not be possible to enforce with a NOSQL system, in this case we should use a SQL RDMS such as MySQL.
What about partition tolerance and availability? Availability is about the nodes, even if a node goes down the system still running. Partition tolerance is about the messages, even if random messages got lost the system still running, so if there is a network failure that splits the processing nodes into two groups that cannot talk to each other, then the goal would be to allow processing to continue in both subgroups. An example of a network partition is when two nodes can’t talk to each other, but there are clients able to talk to either one or both of those nodes. If you’ve ever used IRC and experienced a netsplit, this is a great example of that.
This case might be very obvious, there are databases that are best to handle big volumes of data, in MySQL when our tables begin growing beyond 8 or 9 gigabytes we start having troubles and we will have to start thinking about partitioning, HBase and Couchbase provide automatic sharding for example, that will allow us to process higher volumes of data more easier.
You want to do machine learning with big volumes of data? Think about Spark, known as a fast, easy-to-use and general engine for big data processing that has built-in modules for streaming, SQL, Machine Learning (ML) and graph processing. Tons of people are doing data integration and ETL on MapReduce, as well as batch computation, machine learning and batch analytics.
But these things are going to be much faster on Spark. Interactive analytics and BI are possible on Spark, and the same goes for real-time stream processing. So some of the new use cases are just the old use cases, done faster, while some are totally new. There are some things that just couldn’t have been done with acceptable performance on MapReduce.
Finally we need to take into account the specific use case of our system and as what are we going to do with the data? We should take into account success stories in the industry to help us decide.
- You have a lot of unstructured data, including documents and websites? Consider using a Key-value database or other unstructured database such as MongoDB. which use rules defined by programmers to comb through large amounts of unstructured data.
- If you need to deal with massive collections of simple structured data such as log files, you could consider something such as Google’s BigTable, a distributed column oriented data store that is being used in web indexing by Google itself.
- Do you need to add search capabilities to existing data? Consider using Elasticsearch. You want Elasticsearch when you’re doing a lot of text search, where traditional RDBMS databases are not performing really well (poor configuration, acts as a black-box, poor performance). Elasticsearch is highly customizable, extendable through plugins. You can build robust search without much knowledge quite fast.
- Do you require to process large streams of data with extremely low latency? Consider using something like VoltDB that is a high speed database. Another option is Couchbase, which is a more general purpose NoSQL database.
- For regular banking transactions, OLTP and reporting systems, you can’t go wrong with vanilla sql solutions.
Here you have a small quick reference to take a decision, it is not by any means complete and will be continued to be update in the future:
After taking into account all the theoretical aspects of database systems, no other tool is more important than common sense. Do not choose a technology because it seem to be the next big thing or everyone is starting to use it, your technological choices should be the result of an analysis of your business requirements, not the opposite, don’t push technology where it is not needed, if you need to do a few simple inserts and get a report, you don’t need NoSQL.
On the other hand, if you have a SQL system and you are struggling with performance and scalability issues, it might be the time to try something new, don’t be afraid to experiment a bit with these new technologies, but be aware of their limits, how and where to use them.
Don’t reinvent the wheel.