Keep an eye on the churn; finding legacy code monsters

Posted on by Matthias Noback

Setting the stage: Code complexity

Code complexity often gets measured by calculating the Cyclomatic Complexity per unit of code. The number can be calculated by taking all the branches of the code into consideration.

Code complexity is an indicator for several things:

  • How hard it is to understand a piece of code; a high number indicates many branches in the code. When reading the code, a programmer has to keep track of all those branches, in order to understand all the different ways in which the code can work.
  • How hard it is to test that piece of code; a high number indicates many branches in the code, and in order to fully test the piece of code, all those branches need to be covered separately.

In both cases, high code complexity is a really bad thing. So, in general, we always strive for low code complexity. Unfortunately, many projects that you'll inherit ("legacy projects"), will contain code that has high code complexity, and no tests. A common hypothesis is that a high code complexity arises from a lack of tests. At the same time, it's really hard to write tests for code with high complexity, so this is a situation that is really hard to get out.

If you're curious about cyclometic complexity for your PHP project, phploc gives you a nice summary, e.g.

Complexity
  Cyclomatic Complexity / LLOC                    0.30
  Cyclomatic Complexity / Number of Methods       3.03

Setting the stage: Churn

Code complexity doesn't always have to be a big problem. If a class has high code complexity, but you never have to touch it, there's no problem at all. Just leave it there. It works, you just don't touch it. Of course it's not ideal; you'd like to feel free to touch any code in your code base. But since you don't have to, there's no real maintainability issue. This bad class doesn't cost you much.

What's really dangerous for a project is when a class with a high code complexity often needs to be modified. Every change will be dangerous. There are no tests, so there's no safety net against regressions. Having no tests, also means that the behavior of the class hasn't been specified anywhere. This makes it hard to guess if a piece of weird code is "a bug or a feature". Some client of this code may be relying on the behavior. In short, it will be dangerous to modify things, and really hard to fix things. That means, touching this class is costly. It takes you hours to read the code, figure out the right places to make the change, verify that the application didn't break in unexpected ways, etc.

Michael Feathers introduced the word "churn" for change rate of files in a project. Churn gets its own number, just like code complexity. Given the assumption that each class is defined in its own file, and that every time a class changes, you create a new commit for it in your version control software, a convenient way to quantify "churn" for a class would be to simply count the number of commits that touch the file containing it.

Thinking about version control: there is much more valuable information that you could derive from your project's history. Take a look at the book "Your Code as a Crime Scene" by Adam Tornhill for more ideas and suggestions.

Code insight

Combining these two metrics, code complexity and churn, we could get some interesting insights about our code base. We could easily spot the classes/files which often need to be modified, but are also very hard to modify, because they have a high code complexity. Since the metrics themselves are very easy to calculate (there's tooling for both), it should be easy to do this.

Since tooling would have to be specific for the programming language and version control system used, I can't point you to something that always works, but for PHP and Git, there's churn-php. You can install it as a Composer dependency in your project, or run it in a Docker container. It will show you a table of classes with high churn and high complexity, e.g.

+-----------------------+---------------+------------+-------+
| File                  | Times Changed | Complexity | Score |
+-----------------------+---------------+------------+-------+
| src/Core/User.php     | 355           | 875        | 1     |
| src/Sales/Invoice.php | 235           | 274        | 0.234 |
+-----------------------+---------------+------------+-------+

"Times Changed" represents the value calculated for "churn", and "Complexity" is the calculated code complexity for that file. The score is something I came up with; it is the normalized distance from the line which represents "it's okay".

Plotting these classes in a graph, it would look something like this:

Plotting classes against complexity and churn axes

"It's okay"

What does that mean, "it's okay"? Well, we can always deal with a certain level of complexity, given a certain change rate. If it isn't too painful to make a change to a class, even though it needs refactoring, the refactoring effort itself will cost us more than it will save you in terms of future programming efforts. You can always improve code, but since you as a programmer are expensive too, you should always ask yourself what the benefits really are.

Beneath the line are classes that are simple classes that change often (no problem), or medium complex classes that rarely change (no problem). Only when a class is above the line, closer to the top right corner of the graph, we should admit that it's no longer "okay".

Hypotheses, observations, advise

Besides the fact that it's cool to run the tool and find out which classes are dangerous legacy monsters, there are some more profound things about the nature of software development to consider here.

First, let me tell you my hypothesis: if you've known the project for a while, you don't need to run the tool to be able to tell me which classes are the bad ones. (Please let me know if this is/was the case for you!).

Then, I suggest you also read Sandi Metz' article "Breaking up the behemoth" on this topic. Sandi suspects that these "unwanted outliers" are way larger than other classes, but also represent core concepts in the domain. This was initially quite surprising to me. If a class represents a core domain concept, I'd assume you don't let it end up with such poor code quality. But it makes sense. As desired behavior becomes more clear, as the team gains deeper insights in the domain, related classes need to be modified as well. But as you know, change over time requires a reconsideration of the larger design, which may not always actually take place. Meanwhile these central classes become more and more depended upon, which makes them harder to change without breaking all their clients.

Talk with your team to figure out how to fight your legacy monsters. Involve management as well, and explain to them how defeating them will make you work faster in the near future. How to do it? As Sandi puts it in the above-mentioned article:

A 5,000 line class exerts a gravitational pull that makes it hard to imagine creating a set of 10 line helper classes to meet a new requirement. Make new classes anyway. The way to get the outliers back on the green line where they belong is to resist putting more code in objects that are already too large. Make small objects, and over time, the big ones will disappear.

PHP legacy churn code quality Comments

Simple CQRS - reduce coupling, allow the model(s) to evolve

Posted on by Matthias Noback

CQRS - not a complicated thing

CQRS has some reputation issues. Mainly, people will feel that it's too complicated to apply in their current projects. It will often be considered over-engineering. I think CQRS is simply misunderstood, which is the reason many people will not choose it as a design technique. One of the common misconceptions is that CQRS always goes together with event sourcing, which is indeed more costly and risky to implement.

CQRS alone simply means that you're making a distinction between a model that is used for changing state, and a model that is used for querying state. In fact, there's often one model that accepts "write" operations (called "write model" or "command model") and multiple models that can be used to "read" information from (called "read models", or "query models").

Most projects out there don't use CQRS, since they combine the write and read operations in one model. Whether you use the data mapper or active record pattern, you'll often have one object ("entity") for each domain concept. This object can be created, it has methods that allow for state modifications, and it has methods that will give you information about the object's state.

All the legacy projects I've encountered so far use this style of storing and retrieving state. It comes with a certain programming style that is not quite beneficial for the maintainability of the application. When writing a new feature for such an application you will start with getting all the ingredients in place. You need some information, and you need some dependencies. If you're unlucky, you still fetch your dependencies from some global static place like good old Zend_Registry or sfContext. Equally bad, you'll fetch your information from the central database. This database contains tables with dozens of columns. It's the single source of truth. "Where is this piece of information? Ah, in table XYZ". So now you use the ORM to fetch a record and automatically turn it into a useful entity for you.

Except... the entity you get isn't useful at all, since it doesn't give you the information you need — it doesn't answer your specific question. It gives you either not enough or way too much information. Sometimes your entity comes with a nifty feature to load more entities (e.g. XYZ->getABCs()), which may help you collect some more information. But that will issue another database query, and again, will load not enough or way more than you need. And so on, and so on.

Reduce the level of coupling

You should realize that by fetching all this information, you're introducing coupling issues to your code. By loading all these classes, by using all these methods, by relying on all these fields, you're increasing the contact surface of your code with the rest of the application. It's really the same issue as with fetching dependencies, instead of having them injected. You're reaching out, and you start relying on parts of the application you shouldn't even be worrying about. These coupling issues will fly back to you in a couple of years, when you want to replace or upgrade dependencies, and have to make changes everywhere. Start out with dependency injection and this will be much easier for you.

The same goes for your model. If multiple parts of the application start relying on these entities, it will be more and more difficult to change them, to let the model evolve. This is reminiscent of the "Stable dependencies principle" (one of the "Package design principles"): if a lot of packages depend on a package, this package becomes hard to change, because a change will break all those dependents. If multiple clients depend on a number of entities, these will be very hard to change too, because you will break all the clients.

Being hard to change is not good for your model. The model should be adaptable in the first place, since it represents a core concept of the domain you're working in. And this domain is by nature something that changes, and evolves. You want to be able to improve the quality and usefulness of your domain model whenever necessary. Hence, it's a good idea not to let all those clients depend on one and the same model.

The solution is to introduce multiple models, for each of those clients. Exactly as CQRS proposes; one write model, which you can use to change the state of your domain objects. Then multiple read models, one for each client. That way, there will be less coupling, and it will be easy to evolve the model in any direction. You can safely add, remove, or transform any piece of information in any one of those read models.

A recipe for read models

To break with a habit of retrieving all the information from one model, you can follow these steps. You'll end up with a nice set of read models, one for every question a client may have.

  1. Whenever you feel the need to fetch some information, or get an answer to a specific question, and you're tempted to go to the repository and load one or more entities (and related entities), stop and think:
  2. Rephrase the answer you're looking for as an object. For example, let's say the question is: "Which products did the customer order so far? I need the ID and name of each product, and the date on which the customer ordered it. A class for such an answer would look like:

    final class ProductTheCustomerOrdered
    {
        private $productId;
        private $productName;
        private $dateOfOrder;
    
        public function __construct(int $productId, string $productName, string $dateOfOrder)
        {
            ...
        }
    }
    
  3. Define an interface with a single method, which phrases your question, and shows what type of answer you're looking for:

    interface OrderHistory
    {
        /**
         * @return ProductTheCustomerOrdered[]
         */
        public function productsTheCustomerOrdered(CustomerId $customerId);
    }
    
  4. Rely on this interface in your code (through constructor injection of course).

    final class SomeService
    {
        private $orderHistory;
    
        public function __construct(OrderHistory $orderHistory)
        {
            $this->orderHistory = $orderHistory;
        }
    
        public function someMethod(CustomerId $customerId)
        {
            $products = $this->orderHistory->productsTheCustomerOrdered($customerId);
    
            // do something with $products
        }
    }
    
  5. Provide an implementation for the interface, which gets the right data from the database, then returns those objects.

    final class OrderHistorySql implements OrderHistory
    {
        public function productsTheCustomerOrdered(CustomerId $customerId)
        {
            // perform some smart query, fetching only what you need
            $records = ...;
    
            return array_map(function($record) {
                return new ProductTheCustomerOrdered(...);
            }, $records);
        }
    }
    

It would be smart to provide an integration test for the implementation class, proving that your assumptions about any third-party code involved in fetching the data and creating the objects are correct.

The ProductTheCustomerOrdered object itself can become an attractor of behavior; you can add useful methods to it, that help gain more insight in the data, or that protect you from using the data in the wrong way (basic object encapsulation, in other words).

Dealing with inconsistent data

A common issue with those large legacy databases is that data is simply inconsistent. A particular field in the database may contain either a date or null, or 1970-01-01 00:00:00. And on top of that, the date has a time stamp, which isn't really needed. You can smooth out all of these little inconsistencies, all this craziness, inside the implementation class. As long as, in our example, productsTheCustomerOrdered() returns nice, well-behaving ProductTheCustomerOrdered objects, all is good.

The read model and its repository form an Anti-Corruption Layer between the new client code, and the old legacy data. The read model protects the client code from having to deal with all the quirks of the legacy data and code. For more information on this DDD concept, take a look at Eric Evan's article "Getting started with DDD when surrounded by legacy systems" (PDF)

Conclusion

Considering how simple the steps described above really are, CQRS (without event sourcing) is certainly within reach - in any project, no matter how crappy or old the legacy is. You might still consider using event sourcing as well, for one or more write models. That would allow you to further optimize the performance of your read models. Instead of assembling the answer using some smart (and complicated, maybe slow) query, you can build up your read model over time and let it already have the answer.

PHP CQRS legacy database Comments

The release of "Microservices for everyone"

Posted on by Matthias Noback

Today I'm happy to release my latest book, "Microservices for everyone"! 90% of it was done in July but you know what happens with almost-finished projects: they remain almost-finished for a long time. I did some useful things though, like preparing a print edition of the book.

Foreword

Finally, and I'm very proud of that, I received Vaughn Vernon's foreword for this book. You may know Vaughn as the author of "Implementing Domain-Driven Design". It made a lot of sense to ask him to write the foreword, since many of the ideas behind microservice architecture can be traced back to messaging and integration patterns, and Domain-Driven Design. Vaughn is an expert in all of these topics, and it was only to be expected of him to write an excellent foreword.

[…] Microservices, if used responsibly and properly, will tend to solve these three big problems. Teams can be made smaller and more fluid, allowing each developer to work on multiple microservices. This is also a significant investment in developer knowledge of the overall business. There will be one repository and one database (or database schema) for each microservice, having little to no impact on other microservices and teams. Further, the cloud can be an important move toward the successful on-time deployment and resiliency of microservices, and even better support the specific infrastructural and technical mechanism needs of each microservice. Spin up a new server or cluster in no time, and keep them running with no new administrative burden on your organization.

— Vaughn Vernon

PHP

I've been asked: "Is this book written for PHP developers?" There are two answers:

  1. No. The code samples in this book and in the accompanying source code repositories are written in PHP, but you don't need to know PHP to be able to understand what's going. In fact, the main text of the book doesn't make any assumptions about the languages and tools you use.
  2. Yes. My secret mission was to make PHP developers more familiar with the concept of asynchronous service integration and related topics like eventual consistency.

So basically, I expect this book to be interesting for everyone, no matter what programming language they use or prefer.

Where to buy

Besides the completed e-book (EPUB, MOBI and PDF), you can now order physical copies on Amazon. You can also order the book in non-US stores, e.g. in the UK and Germany.

To celebrate the release, I offer you a discount on the e-book (use this link). I'd love to do the same for Amazon purchases, but the publishing platform doesn't offer this functionality.

What people say about this book

Release time is not a time to be modest, so here's a number of comments about this book which make me very happy:

"This book showed me what I needed to know to actually start playing with microservices. It had the right blend of why and how without too much focus on implementation details. Highly recommended!" — Beau Simensen

"Read, Learn, Succeed! A comprehensive and really complete guide for creating microservices from scratch! Matthias can abstract the topic complexity in this book that is really, for everyone." — Christophe Willemsen

"As Microservices become more and more popular each day, it's important for professional developers to familiarize themselves with the basic concepts. As with his previous books, Matthias explains these concepts well, in a clear and concise manner. His examples are useful, and the reader is presented with a solid introduction to using Microservices. Highly recommended!" — Mark Badolato

Book microservices Leanpub Amazon Comments