Database musings and mishaps

MySQL vs NoSQL

A popular interview question, and one I’ve asked myself, is “when do you opt for a MySQL vs NoSQL database?” It’s basic, with a simple answer, and it usually leads to discussion of a candidate’s past experiences with and attitudes toward databases, which is what I really care about. A fine answer is “Well it depends on if my data is relational or not”, which leads to the obvious followup of “and how do you determine that?” If you’ve never had to make that decision, I will proffer a heuristic for you. The best way to determine if your data is relational or not depends on your usage patterns for that data. Obviously these can change over time, but as far as you are able to, examine how the data you are collecting is stored and retrieved and let that be a determinant. I’ll give an (oversimplified) example.


Let’s say you have a user. The user has addresses (home, a lake cottage, work address, etc) and family members (spouse, children, etc). One way to relationally structure this data is with a “users” table joined one-to-many to an address table and one-to-many to a “family” table. Another way to relationally structure it is to have a “users” table, but have the user just be another row in the “family” table with a flag indicating who is a user and who are dependents. You could even have a self referring key on the non-user family members you can use to self join family members to a primary user person.

If you were storing this data in, let’s say, MongoDB, it would look a little different. You could just have a user document that had an array of family members, and an array of addresses. Assuming we put sensible limits on the number of family members and addresses a person could have (say, <1000) this works fine.

But which to choose? Well, if your usage is only going to be retrieving and storing all the data at once, perhaps mongo makes sense. If, on the other hand, you use addresses in one place, family members in another, then the actual user information in yet another it might make sense to put it in a relational database. Two good questions to ask yourself are:

  • Am I going to want to JOIN into or out of this data? Conversely, can I get a document of data that makes sense that won’t requre me to join to it?

  • Am I using this chunk of data as a monolith? Conversely, am I using pieces of it in one place an separate pieces in another?

If you don’t JOIN on the data and can sensibly retrieve and write the data as a block, then Mongo makes sense for us. Otherwise, relational is probably the way to go. Regardless, this is a good example of why it pays to start with the data model when architecting your custom apps.


Do not abuse JSON columns in SQL (specifically MySQL)

File this one away as a pet peeve, but I’ve encountered multiple places at various companies where they got themselves into trouble by not understanding how to use the JSON column type in MySQL. The common thread here is that they stored miscellaneous data that should have been stored in a set of extension tables and instead put it in a JSON column. They then inevitably found themselves needing to query records by this data. This is bad in basically every SQL database, but is especially bad in MySQL. Having to use JSON functions instead of the cleaner native SQL to do your queries is awkward, and ORM support for them is spotty at best, if you primarily use an ORM to interact with your DB in code. Also, most administration tools don’t really have a great way to visualize the data in those columns. That means that if you have a defect arising from a data problem, it’s harder to track down. But the real killer was that MySQL, unlike something like SQL Server, does not support indexing on JSON columns - which is a huge problem, especially since the data isn’t structured. One place even decided to copy the JSON column’s contents into a seperate elasticsearch DB, rather than restructure, just so they could query by the information in it. Think of the added complexity, not to mention the cost, of that solution.

In general, therefore, my advice is this - only use JSON columns in MySQL when you feel strongly that you won’t actually ever search or query by the data within. Custom user input, metadata you only ever need to display, not actually use, that sort of thing. And even in other SQL providers who might better support those columns, it’s probably worth considering if an actual NoSQL database might be a better solution.

No one has “big data”

Ok, this heading is a bit clickbait-y. SOMEONE has big data. They have to, otherwise the buzzword probably wouldn’t exist. Amazon has “big data”. Big IOT manufacturers have “big data”. What I mean is that you do not have “big data”. Almost certainly. Oh, people like to talk about how they do. They like to think that they do. But the vast majority do not.

Maybe I’m getting ahead of myself. Let’s define our terms. “Big data” is, broadly speaking, data that is so large, fast or complex that it's difficult or impossible to process using traditional methods. I grabbed that from google, and I think it it’s accurate enough for our purposes. Large refers to top-line scale - data measured in petabytes (1,000,000,000,000,000 bytes) is probably “big data”. It’s also about throughput and growth rate - “big data” usually means that traditional methods are unable to effectively write or even buffer the incoming data streams. It also usually means your data is growing exponentially.

Next let’s talk "traditional methods”. This is pretty easy to concretely define. Most of the providers developers use are traditional methods. SQL in it’s various flavors and NoSQL in most of it’s forms are all “traditional methods”.

Most people don’t have this “big data”, they just have “poorly stored data”. I’ve had conversations with multiple people who say things like “well maybe we need hadoop for this”, and they basically never need hadoop. The issue is almost always that their current storage solution is poorly architected or designed. Sometimes it’s obvious. Perhaps their indexing is covering the wrong things, or maybe it’s a situation of misuse like I talked about in the “Do not abuse MySQL JSON blobs” section. Sometimes it’s slightly more involved, but the point remains the same. These “big data” solutions are expensive in cost, infrastructure, and developer time. You probably don’t need them, you just need to look at whether your current storage solution is appropriate (SeeMySQL vs NoSQL above) and then determine what changes to make to make to your existing databases.



Previous
Previous

What should I do about my web site?

Next
Next

Better Code = Better Results