If you’re like me, you’ve been working in the web industry for many years now, primarily using the tried-and-true relational database management systems (RDMS) such as MySQL or PostgreSQL. These databases are well-known, well-supported, and generally considered “the right way” to do things—even if they’re not. In the past few years an alternative database philosophy generally dubbed “NoSQL” which are much more flexible, non-relational database systems.
One of these modern database systems is MongoDB. MongoDB is a document-oriented database, not a relational one. In the standard relational databases you are used to you have tables which are defined by schemas which define the table’s columns. This schema is the DNA of that table, which is used to house some specific type of data. Each column of the table has a designated name and data type, and every row in the table has every column described by the schema.
Documents not rows
In MongoDB the concept of a row is replaced by the concept of a document. A document is flexible, it is amorphous, it can be almost anything. Properties may be set on documents dynamically, where every document of the same type might not actually have the same properties available. It is your own responsibility in your application logic to properly handle when properties exist or don’t exist on your documents. This is in stark contrast to the standard relational database methodology where a row’s data description is set in stone and can always be guaranteed to exist in a certain way. Documents can store several types of data—even arrays!—that are referenced by keys. Although there are differences, you can generally think of a MongoDB document as similar to a hashmap such as a Javascript object or Python Dictionary, except that they are persisted.
Horizontal scaling
Another strong attribute of MongoDB is its almost effortless ability to scale. Most relational database de facto standards were designed and developed many years ago and have their roots before the days of the Internet—before today’s massive amounts of data even seemed possible. As database size and demand increases engineers have to find ways to scale their database systems to handle the strain. Typically in systems like MySQL it’s quite common (and relatively easy) to start by setting up clusters of databases and replicating your data across them to spread out the load. This kind of horizontal scaling can require significant thought and implementation, of course varying by your particular setup.
MongoDB makes this process extremely simple by default. Though you can of course do some tweaking, the default behavior is still pretty good for most cases. The database system itself can handle automatically distributing your documents across nodes (a.k.a. servers) using a concept known as sharding, and even handle balancing the load as queries hit the database.
MongoDB’s goody bag
MongoDB has several of the functionalities that you’re used to with common database systems such as the ability to index on any attribute of a document for optimizing your queries based on your data. One particularly cool feature is MongoDB’s native support for geospatial queries, which makes storing location information on your documents and then querying based on geospatial proximities (find all places near X, Y) almost completely trivial. In fact, Foursquare uses MongoDB for that very reason among others.
You may be familiar with stored procedures on your common RDMSes which allow for you to compile functions on the database to assist in common queries you carry out. MongoDB also has this functionality but they are written in JavaScript. Combine this with MongoDB’s native Map/Reduce functionalities and serious data processing at the database query level is possible, rather than chopping and dicing the data yourself in your own application logic.
Another pretty cool feature of MongoDB is the GridFS which allows you to store files of any size and their metadata as documents inside your database. While there are limits to the size of a MongoDB document (4MB older versions, 16MB in v1.7/1.8, higher limits in the future) there is native support for chunking a file into multiple documents and linking them together. As the GridFS wiki page states, in cases with extra large files, such as videos, this chunking also allows for more efficient range operations which can be very useful for track seeking.
Harder, Better, Faster, Stronger
One of the most important considerations in the design of the MongoDB system is performance. It has several optimizations that allow for its blazing performance such as memory-mapped files, pre-allocated data files (which do tend to consume more space than necessary but allows for consistent I/O performance), and a “memory” which can allow for increased performance for identical queries that are ran often. It’s this performance level that makes MongoDB so well-suited for databases which need to handle heavy, rapid queries.
MongoDB Has Become Self-Aware
MongoDB is damned smart. It really is. It is quite possible in a deployment of a MongoDB cluster that you spend very little time actually managing your servers. Should a master server in your cluster go down, MongoDB can automatically negotiate within the cluster to promote a slave server to be a new master in the place of the one that went down. When adding new machines to your cluster to scale, the process can be very automatic, sometimes as simple as telling the cluster about it and letting MongoDB automatically welcome the new server to the gang, balancing the load and distributing your document storage more efficiently.
Tune in next time for…
That’s all for this exciting episode of… well, my blog. Over the coming weeks I’ll be writing more about getting started with MongoDB, mostly from a Python angle, as well as sharing case studies of ways I’m using MongoDB. This discussion of MongoDB should be interesting for anyone else who, like me, know their way around a web application stack but have just started playing around with the newer, less time-proven technologies such as NoSQL database systems.
