When I want to get a very rough idea of the quality of both code and structure of a PHP code base, I like to run phploc on it. This is a tool created by Sebastian Bergmann for (I assume) exactly this purpose. It produces the following kind of output:
Directories 3 Files 10 Size Lines of Code (LOC) 1882 Comment Lines of Code (CLOC) 255 (13.55%) Non-Comment Lines of Code (NCLOC) 1627 (86.45%) Logical Lines of Code (LLOC) 377 (20.03%) Classes 351 (93.10%) Average Class Length 35 Minimum Class Length 0 Maximum Class Length 172 Average Method Length 2 Minimum Method Length 1 Maximum Method Length 117 Functions 0 (0.00%) Average Function Length 0 Not in classes or functions 26 (6.90%) Cyclomatic Complexity Average Complexity per LLOC 0.49 Average Complexity per Class 19.60 Minimum Class Complexity 1.00 Maximum Class Complexity 139.00 Average Complexity per Method 2.43 Minimum Method Complexity 1.00 Maximum Method Complexity 96.00 Dependencies Global Accesses 0 Global Constants 0 (0.00%) Global Variables 0 (0.00%) Super-Global Variables 0 (0.00%) Attribute Accesses 85 Non-Static 85 (100.00%) Static 0 (0.00%) Method Calls 280 Non-Static 276 (98.57%) Static 4 (1.43%) Structure Namespaces 3 Interfaces 1 Traits 0 Classes 9 Abstract Classes 0 (0.00%) Concrete Classes 9 (100.00%) Methods 130 Scope Non-Static Methods 130 (100.00%) Static Methods 0 (0.00%) Visibility Public Methods 103 (79.23%) Non-Public Methods 27 (20.77%) Functions 0 Named Functions 0 (0.00%) Anonymous Functions 0 (0.00%) Constants 0 Global Constants 0 (0.00%) Class Constants 0 (0.00%)
These numbers are statistics, and as you know, they can be used to tell the biggest lies. Without the context of the actual code, you can interpret them in any way you like. So you need to be careful making conclusions based on them. However, in my experience these numbers are often quite good indicators of the overall code quality as well as the structural quality of a project.
For example, lots of static method calls often indicates a design problem, and the same goes for the existence of many abstract classes. In this article I'd like to point out what some of these numbers mean to me and how I use them to get some understanding of the situation, and how to improve it.
Lines of Code (LOC) 1882 Comment Lines of Code (CLOC) 255 (13.55%) Non-Comment Lines of Code (NCLOC) 1627 (86.45%) Logical Lines of Code (LLOC) 377 (20.03%)
Lines of Code, Comment Lines of Code versus Non-Comment Lines of Code and Logical Lines of Code, aren't usually that interesting. The total number of lines gives me some idea of the size of the project, compared to other projects I've worked with in the past. However, the length of classes and methods is important. A large value for Maximum Class Length means you have at least one class that's way too big. Based on the Average Class Length value you can deduce whether that will be one big class, or many of them.
Classes 351 (93.10%) Average Class Length 35 Minimum Class Length 0 Maximum Class Length 172 Average Method Length 2 Minimum Method Length 1 Maximum Method Length 117
It turns out that these classes are often related to core domain concepts and regularly need to be modified to reflect new domain insights. At the same time, because they are so big, it'll be hard to change anything about them without breaking things. So this gives you a hint about where to start refactoring (see also: Keep an eye on the churn; finding legacy code monsters).
Roughly speaking, we would expect to see the Maximum Class Length and Maximum Method Length decrease in the long run. At the same time, the Average Class Length and Average Method Length would also decrease, but not as fast.
It's to be expected that breaking down large classes is easier than breaking down large methods. Classes usually become large because developers pile up code in the same place, and it turns out to be relatively easy to "unpile" the code and move parts of it to another class. Moving code out of methods is more work, since you'd usually have to rearrange statements first, which is dangerous.
Important indicators of design problems are the following metrics: the number of lines inside Functions (as opposed to methods), and the number of lines that are Not in classes or functions, that is, inside the global space.
Functions 0 (0.00%) Average Function Length 0 Not in classes or functions 26 (6.90%)
Code that lives in functions is often used for generic operations that aren't part of the language's standard library, also known as utility code. In these cases it's better to use static methods so you can at least use class auto-loading. It would be even better to wrap the data inside value objects and then transform the original functions into value object methods.
In some legacy code bases, the situation is far worse: functions are used in the same way as a regular service object would be, except that functions don't have state or dependencies, so they import global variables and fetch their dependencies from some global static place. Generally speaking, we wouldn't want code to be in functions, to prevent design issues like this.
The same goes for code that is Not in classes or functions. This is script code, executed from top to bottom. You'll find this code in front controllers (e.g.
index.php), in cron scripts, migration scripts, command-line/development tooling scripts, etc. All of this code is practically unmaintainable, since it's not part of a structural element that allows renaming, moving, extracting, etc. So just like the amount of code in functions, the amount of code not in functions or classes should be as small as possible.
You could write very complex code in just a few lines, but roughly speaking, large methods will have a high complexity, and small methods will have a low complexity. So I usually don't worry too much about the Cyclomatic Complexity section in the phploc results. Reducing the size of classes should reduce class complexity. Once you're in a position to rewrite some of the very complex methods, you will see the Maximum Method Complexity drop as well.
Average Complexity per LLOC 0.49 Average Complexity per Class 19.60 Minimum Class Complexity 1.00 Maximum Class Complexity 139.00 Average Complexity per Method 2.43 Minimum Method Complexity 1.00 Maximum Method Complexity 96.00
I don't know enough about cyclomatic complexity to know if phploc implements the official algorithm, but looking at the code, it performs the same calculation that I manually apply sometimes: the complexity of a method is equivalent to how many times it uses
and, and ternary expressions.
The number of Global Accesses is interesting. Access to Global Constants means that code uses a constant defined using a call to
define(). These global constants often contain configuration values like database credentials, which shouldn't really be globally defined, but injected as constructor arguments of a service that actually needs them.
Access to Global Variables means that code uses the
global keyword to import variables from the global scope. This is sometimes used for toggles like the environment in which the application runs (e.g.
global $environment), or even worse: to import dependencies that have been defined in the global space (e.g.
global $db). A high number is a bad sign. In fact, you should aim for this number to be 0.
Access to Super-Global Variables means that code uses data from
$_SESSION, etc. directly. This is often a sign that data doesn't get passed properly as method arguments, but instead gets retrieved on the spot, whenever it's needed. When this happens, it often means that domain logic is mixed with infrastructural concerns.
Considering all of these numbers together gives you a good indication of how tied your code is to the surrounding environment. This itself tells you how portable it is, and thus how easy it is to refactor it. It also shows how easy it will be to test this code in isolation, without preparing the global scope for all the accesses to it.
Global Accesses 0 Global Constants 0 (0.00%) Global Variables 0 (0.00%) Super-Global Variables 0 (0.00%)
Attribute Accesses gives us some clue about how properties are used, and whether those are defined as
static or as instance properties. Instance properties are always the better option in terms of object design. However, in some cases static properties may be the most pragmatic solution.
Attribute Accesses 85 Non-Static 85 (100.00%) Static 0 (0.00%)
For method calls the same applies: static method calls lead to a less flexible design than instance method calls.
Method Calls 280 Non-Static 276 (98.57%) Static 4 (1.43%)
Static methods are often used for methods on utility classes. In that case, the problem is not that big, although value objects can often attract that same behavior, making the cohesion between the data and its related behaviors stronger (the ability to combine state with behavior is why we are object-oriented programmers, right?).
Static methods and properties are sometimes used for static setter injection (as opposed to constructor injection) and should be avoided for reasons listed everywhere. However, in a legacy code base you may not be in full control of object instantiation, so you may actually have to make these static method calls.
One good reason for the number of static method calls to be high is because you use named constructors, which are static by definition. In this case, I'm happy for you (I talk about them in more detail in my book Object Design Style Guide. In fact, maybe there could be an extra metric for this in phploc, since these methods could be counted separately as "special constructors" (if the signature is something like
static function (): self.
To be continued...
See part 2.