Relying on the database to validate your data

Matthias Noback

July 29, 2020

One of my pet peeves is using the database schema to validate data.

Several ways in which this normally happens:

Specifying a column as “required”, e.g. email VARCHAR(255) NOT NULL
Adding an index to force column values to be unique (e.g. CREATE UNIQUE INDEX email_idx ON users(email))
Adding an index for foreign key integrity, including cascading deletes, etc.

Yes, I want data integrity too. No, I don’t want to rely on the database for that.

I find it surprising that some years ago we decided that we shouldn’t write application logic in our database (e.g. stored procedures) because:

They are not written in the same language as the rest of the project.
They are not version-controlled (unless you jump through some extra hoops).
They are not testable in isolation; you need an actual database to run them.
They are “magic” because they are triggered implicitly.
The code is vendor-specific.

Well, anyway, it’s clear that we don’t want them.

Yet, many of these concerns apply to validation at the database-level as well. Except, with stored procedures we actually delegate some work to the database. With validation, we usually duplicate the work. We first validate in the code that a value has been provided and show a form error if it hasn’t. Then we use an assertion in our model object to verify that a value is not null. Then we save our object to the database, which again verifies that the value is not null.

Why do we do this? Maybe because we want symmetry? The model property is not nullable, so the column it’s mapped to should also not be nullable. Maybe because we are more confident about the database than about our code? In the code a validation might be skipped somehow, but we’ll always have that extra validation once the data ends up in the database.

Non-nullability

I think we don’t need the symmetry, nor the safe-guard. Instead, we should put more trust in our application code and make sure everything is handled there. No need for the “double bookkeeping” where you try to keep the nullability of your model’s properties in sync with the nullability of the database columns. In my experience often a model property is nullable, but the database column isn’t, or vice versa. This leads to the application blowing up for a nullability discrepancy between the code and the database. We can reduce this risk by stopping the double bookkeeping. Instead of defining non-nullability on database columns, let’s only define it in the code. We always have to deal with non-nullability in the code anyway, since we want validation errors instead of SQL errors. So, let’s just remove NOT NULL everywhere, hooray!

Unique indexes

Another interesting database-level validation technique is ensuring uniqueness, like in the case of the unique email address in the users table. Apparently, we don’t trust the application here either, and leave the verification of an important rule to the database (we’ll talk later about how important it really is). Given that the database supports a uniqueness check by means of adding an index we jump in and add it to every column we feel should be unique. We then realize that we don’t want SQL errors when a user registers with a known email address. We want form validation errors instead. Then we introduce some more double bookkeeping so we can be user-friendly and protect our data integrity.

What if we only validate the uniqueness of the email address in the application code and not in the database? We could end up with two records in the users table that have the same email address. It’s unlikely that this will happen though, because the application always runs the uniqueness validation itself, by querying the table for an existing record with the provided email address:

if ($this->userRepository->containsUserWithEmailAddress($emailAddress)) {
    // show form error in the response
} else {
    // the data is valid
    $user = new User($emailAddress);
    $this->userRepository->save($user);
}

The only way we could end up with duplicate records is when two registration requests providing the same email address are being processed at the same time. In that case, the first containsUserWithEmailAddress() may return false during both of these requests and the call to save() would result in two records with the same email address. Again, it’s not very likely, but it could happen. But what if it happens? We just have to make sure it has no significant impact.

I think the biggest fear when having duplicate email addresses in the database is that someone might be able to log in on behalf of someone else. You’d be able to sign up with that person’s email address, while providing your own password, and log in with it. This isn’t a problem at all if you always use the user ID as the unique identification for a user, instead of their email address. You wouldn’t be able to impersonate the other user; you would impersonate yourself. That said, it would be smart to always verify a user’s email address before they can log in with it. That’s a different story though, and it doesn’t require uniqueness in the database.

Even authentication with duplicate email addresses isn’t a problem. Given user A and user B who have a different password, when user B logs in using their password, they will log in on their own account, not on user A’s account. Well, it could go wrong if the authentication code is really bad.

$users = $this->userRepository->getAllWithEmailAddress($emailAddress);
$hashedPassword = password_hash($password);

$loggedInUser = null;

foreach ($users as $user) {
    if ($user->hashedPassword() === $hashedPassword) {
        /*
         * The password provided matches the current $user, but we always
         * set the first user in the array of users to be the logged in
         * user.
         */
        $loggedInUser = $users[array_key_first()];
    }
}

I think most authentication systems wouldn’t allow for this kind of fuzziness. They will fetch only one user from the repository based on the username (email address) and them compare the password with the one that was just provided. Another case where things might be unpredictable is when authentication consists of a query that matches a user record on both the email address and the password at once:

$loggedInUser = $this->userRepository->findOneWithEmailAndPassword(
    $emailAddress,
    password_hash($password)
);

In that case, user B will never be able to impersonate user A. That is, as long as their passwords aren’t accidentally the same. In that case, what’s the point of trying to impersonate user A; you already have their password. On top of that, this code wouldn’t even work if password hashing doesn’t use a salt string.

By going through a single example like this, we can’t really conclude that you will never need a uniqueness constraint in your database. To me it is reassuring though: even a security-sensitive situation like this doesn’t really need uniqueness at the database-level. However, we should admit that “You never need unique indexes in your database” is really a hypothesis. I’m sure we can refine it, and we should keep looking for cases where it’s impossible to solve the issue in another way.

Unique IDs

One such case where we do need uniqueness is with IDs. These would technically count as unique indexes too, but they are a special type of them. They define how you can retrieve a record from the database without any ambiguity, and here we do need true uniqueness. It would be very bad for our data integrity if we could save a record with a given ID and when providing the same ID we’d get another record back. Anyway, we can’t even remove the unique indexes for ID columns, as the database itself realizes the importance of them.

Foreign key constraints

Something else that some databases can do is define foreign key constraints. You would say this greatly adds to the level of data integrity in the database. I claim that it doesn’t. But maybe I’m too Domain-Driven Design-infested.

I don’t always create aggregates but when I do I stick to the aggregate rules. One of them is: refer to other aggregates by their ID. Another one is: a transaction should only save changes to a single aggregate (type).

We never traverse from aggregate to aggregate, we never save different types of aggregates in the same transaction. So we also never let the database look further than just a single (type of) aggregate. Mapping these concepts to database schemas (and I’m sure you can map them in different ways too): the order table can have a user_id column that refers to the id column of the users table, but it’s not directly tied to it. Just like you can have a UserId value object in your Order aggregate which represents a UserId value object of a User aggregate. But it can never verify that it actually refers to a User aggregate itself. The most common solution is for an application service to establish this link. Something that I usually do is:

$user = $this->userRepository->getById(UserId::fromString($userId));

$order = OrderCreate(
    $user->userId(),
    // ...
);
$this->orderRepository->save($order);

So again, just like with the unique email address validation, the application code can validate things that require the help of a repository. Of course, the repository effectively does a look-up in the database too. But the big advantage here is that we won’t have to catch SQL errors before we can find out that a related aggregate doesn’t exist.

An even bigger advantage is that the changes we persist to the database have no other effect than updating a row in a single table. This itself leads to the “design promise” that changes to an aggregate are always self-contained. It makes the model decoupled from the database in such a way that the type of database becomes less relevant. Although we’re used to using a relational database for everything, we might as well use a document database for storing our aggregates.

From cascading deletes to explicit processes

One thing foreign key constraints are often used for is to set up cascading actions inside the database. E.g. if a user is deleted, all their orders will be deleted too. Well, that example doesn’t make sense. Actually, I think that it never makes sense. Deleting a user is often not what you want to do. It’s more like deactivate their account, remove their personal information, etc. Usually things don’t really have to be deleted. Which is why soft-delete has been invented and I think it’s generally quite useful, in particular in the context of relational databases and using them to store model data in. If yo really want to delete something, then do it, but make every step in the process explicit. This saves you from accidentally deleting more than you wanted because you made a mistake in your schema definition and have no tests verifying its correctness. It also makes every step more explicit and you can read in the code what happens after you delete a user. The recommended way to do this is by modelling these steps as a process using domain events and event subscribers.

The application will be able to implement all the validation logic itself

The biggest advantage of decoupling from the database and its schema is that you are in the position to model everything that’s important for your application’s data integrity in code. This is great, because you can define all the so-called domain invariants of the application in code too, and specify them with isolated tests that don’t require a database at all.

When a value is required you can create a form validation rule for it, and you’d always add some kind of constraint to your model object too (e.g. add an assertion that throws an exception if something isn’t right).
The uniqueness of IDs van be specified with a contract test for repositories. I.e. you can verify that you get the expected object from the repository after saving it. And you verify that saving a new object with an existing ID causes an exception.
Checking that a referenced ID exists can also be done using a repository. Loading the object with the given ID will cause an exception if the ID is unknown.
The uniqueness of other values (like an email address) can be handled with a repository check. Does an object with similar data already exist? Also make sure to verify that uniqueness is really required, and make sure to consult non-programmers too.

This is actually a pretty common problem with validation: we invent validation rules and forget to verify them, or inquire with people in other roles if they have special requirements for the data. At the same time, we should listen to our own developer intuition as well, because we often know what missing values will cause problems later on, or we’ve seen similar situations before and know what and how to validate.