Thursday, August 21, 2014

My gripe with "The Rule of Three"

In software development, good developers will bend over backwards to make their code as maintainable as possible. We recognize that a given piece of code will be read many more times than it is written. It may take some time to craft code that's easy to read and change, but you'll save many times as much time in the long run.

One of the most recognized tricks to writing maintainable code is avoiding duplication of code. If you've got two chunks of code that look almost exactly the same, that's a "code smell"--an indication that something is wrong. This doesn't always mean you need to refactor the code to eliminate the duplication, but you should at least spend a minute thinking about whether there's an appropriate abstraction you can use that would make the code easier to understand. In almost every case where I've seen two or more lines of almost-identical code, I've found that removing the duplication yields code that is easier to understand.

Mark Seeman recently blogged about some cases where removing duplicate code can be harmful. He correctly pointed out that sometimes apparent duplication is actually coincidental--there's not really an appropriate abstraction for it, and you'll just make your code more confusing and less cohesive by forcing an abstraction on it. He also argued that a developer may choose the wrong abstraction because he lacks enough use cases to know which technique is going to best accommodate future needs. So once a future need arises, the developer has to refactor again, increasing the maintenance costs.

He says:
This is, in my experience, the most important reason to follow the Rule of Three: wait, until you have more facts available to you. You don't have to take the rule literally either. You can wait until you have four, five, or six examples of the duplication, if the rate of change is low.
The Rule of Three basically says that it's okay to have duplication in two pieces of code, but once you have three pieces of duplicate code then you need to refactor. I feel that advocates of the Rule of Three are missing some important realities about modern software development, though.

Reality #1: Preventive refactoring is cheap

If I see some lines of code that I'd like to mostly duplicate, I can use the refactoring tools that are either built into my IDE or come with a plugin like Resharper. With a few keyboard shortcuts, I:

  1. Extract the values that are going to need to change into variables (select an expression, then "extract variable").
  2. Move the variable declarations above the lines of code that I want to duplicate. 
  3. "Extract method" on the lines of code I need to duplicate.
  4. Go back to the method call and "inline" all the variables that I created in step 1.
It doesn't typically take more than a minute to move the extracted method into a different class, if necessary, either. And all of these changes are practically guaranteed not to change the behavior of the code (as long as step 2 doesn't change order of operations in an important way). So the cost of avoiding duplication in the first place is actually pretty minimal.

Likewise, if another developer comes along and realizes that I chose the wrong form of abstraction for my method, and needs to refactor it to accommodate a different use case, he can use similar tricks in his refactoring. The worst case scenario is that he's forced to inline the method call with a quick keyboard shortcut and start over with exactly the same effort he might have made if you'd followed the Rule of Three and copy/pasted the code in the first place.

I say might because without the extracted method there's no guarantee that the next developer would have noticed the need to refactor in the first place, which leads me to my next point.

Reality #2: Duplicate code is usually only obvious to the person duplicating it

Let's say Alice writes some code, and then Bob realizes he needs to do something very similar. Following the Rule of Three, he copy/pastes a chunk of Alice's code, and then makes a few tweaks to fit his scenario.

Now Charlie needs to do something similar to what Alice or Bob did. What are the chances that he'll be aware of both places that contain duplicate code? Honestly, pretty slim. So he looks around to find an example of where someone else did what he's looking to do, and stumbles across Bob's example. Thinking that he's following the Rule of Three, he then copies Bob's code and tweaks it to meet his own needs. 

Do you see what happened? Because Bob's and Alice's code didn't call into a common method somewhere, Charlie had no idea that he'd reached three instances of the duplicate code. Even if Charlie ended up having two new places to use this code, and chose to refactor at that point in time, he still would have overlooked Alice's piece of duplicate code. And if all of these developers follow Mark Seeman's advice, they may each be aware of five or six instances of the duplicate code and not feel the need to refactor

This example uses three developers, but it's perfectly possible for the same thing to happen to a single developer, given time in between touching the duplicate code.

So unless the duplicate code is highly localized (within a few lines of each other, e.g.) it is almost impossible to truly follow the Rule of Three.

And it doesn't really matter if the rate of change is low--all it takes is one person needing to make one change at some point in time. In the best-case scenario, they notice all the instances of duplication, they take the time to refactor all of them (which will probably take longer than the original refactoring would have--even if Charlie had to refactor it again--because there are now somewhere between three and eighteen chunks of mostly-duplicate code that need to be consolidated). But that best-case scenario is unlikely--there's a good chance that they'll fail to notice all the places where the code needs consolidation, and they'll end up only changing some of these places. This will lead to inconsistent behavior, which could lead to bad data and other serious problems.


While DRY should not be a dogmatic mantra that we follow blindly, it should be a guiding principle. The best time to eliminate duplicate code is generally the moment it appears. Other people may need to refactor the code again later, but the costs of incremental refactoring are usually dwarfed by the costs of failing to do so. 

The so-called "Rule of Three" is bound to lead to dangerous levels of duplication in all but the simplest of systems. The decision to refactor should not be based on how many instances of duplication there are, but how much code is duplicated, how far away those instances of duplication are from one another, and how much cohesion that duplication really has, with relation to the overall code base. 

Thursday, May 15, 2014

Service Locators and the Interface Segregation Principle

Okay, it has been a long time since I've posted here, and boy have I learned a lot in the years since my last post! A co-worker has been encouraging me to start writing blog posts about programming, and in a conversation on Google Plus, Mark Seemann suggested that I should put some of my thoughts in a blog post, so here goes:

Mark Seemann recently blogged (again) about the Service Locator anti-pattern. He tried to describe why it's an antipattern in terms of the Interface Segregation Principle:

From a client's perspective, there no limit to how many variations of the Create method it can invoke:
var foo = serviceLocator.Create<IFoo>();
var bar = serviceLocator.Create<IBar>();
var baz = serviceLocator.Create<IBaz>();
// etc.Again, from the client's perspective, that's equivalent to multiple method definitions like:
IFoo CreateFoo();
IBar CreateBar();
IBaz CreateBaz();
// etc.
However, the client can keep coming up with new types to request, so effectively, the number of Create methods is infinite!

This argument goes part-way there, but needs a little more fleshing out. After all, you could say that any method with parameters can be expanded out in this way:

var x = service.Foo(1);
var y = service.Foo(2);
var z = service.Foo(3);
// etc.

corresponds to

var x = service.Foo1();
var y = service.Foo2();
var z = service.Foo3();
// etc.

The key difference is that programmers almost never write code like what's in the example above: what's up with those magic numbers? We generally say `var a = Foo(x)`, where `x` is somehow derived from the method's input. That means you can write a meaningful unit test that ensures a given `x` gets passed into `Foo()` when you provide a given `y` as a parameter. This aspect of your method is unlikely to change as you modify implementation details--or if it does, it will probably be the sort of change which will require the test to be updated before it compiles.

On the other hand, if you're using a service locator, you're effectively passing hard-coded "magic classes" into the method all over the place. The values you pass in are rarely based on the input provided to the method, and can easily change as you refactor, without forcing anyone to recognize that your method now makes use of a different "interface" method, which needs to be mocked up before the test will run.

The fact that the intended usage of the interface involves passing in hard-coded values that are not defined by an `enum` indicates that you are, in essence, creating an infinite number of different "ways" to call methods on your interface. If you subscribe to the notion of "role-based" interfaces (which I do), then the usage of the interface should be what informs how the interface is written, and if there are a large number of ways for calling code to invoke a method on your interface, it creates the same problems as having a large number of methods on the interface.

What about Generics?

"Okay," you may be saying, "but wouldn't that make this an argument against all generics? After all, you can pass any type you want into your `Maybe.Not<>()` method, and generic parameters are hard-coded by definition: isn't that an anti-pattern?

The difference is that the behavior of a `List<>` constructor, or `Maybe.Not<>()` is exactly the same no matter what type gets passed in. These are data structures, not service methods. 

On the other hand, if you're talking about something like Automapper's `Mapper.Map<,>()` method, its behavior is going to vary widely based on the generic type arguments, and I'd suggest that this is, indeed, an anti-pattern. Your call to AutoMapper will be working fine one day, and then fail the next just because you renamed a property somewhere, and no compiler ever gave you so much as a hint that you broke your mapping.

I've got more opinions to offer on AutoMapper, but now's not the time. Hopefully this was enlightening.

Friday, September 28, 2012

NonNull extension method

One of the things that drives me nuts lately is null checking, my code is plagued with:
if (foo != null)
    // Business as usual

Say you have the following classes (exaggerated, the property "First" in class "Name" would usually just be a string):
public class Person
    public Name Name { get; set; }

public class Name
    public First First { get; set; }

public class First
    public string Value { get; set; }

And the following collection:
var people = new[]
    new Person { Name = new Name { First = new First { Value = "John" } } },
    new Person { Name = new Name { First = new First { Value = "Eric" } } },
    new Person { Name = new Name { First = new First { Value = "Joel" } } },
    new Person { Name = new Name { First = new First { Value = null } } },
    new Person { Name = new Name { First = new First() } },
    new Person { Name = new Name() },
    new Person { Name = null },
    new Person(),

On a given query, you might see something like:
private void PrintFirstNames(IEnumerable<Person> people)
    if (people != null)
        foreach (var person in people)
            if (person != null && person.Name != null
                    && person.Name.First != null && person.Name.First.Value != null)

Which can, of course, be written more concisely in LINQ:
private void PrintFirstNamesLinq(IEnumerable<Person> people)
    var names = people.Where(p => p != null).Select(p => p.Name)
                .Where(n => n != null).Select(n => n.First)
                .Where(f => f != null).Select(f => f.Value);
    foreach (var name in names)

However, after writing a handy little extension method called NonNull:
public static IEnumerable<R> NonNull<T, R>(this IEnumerable<T> source, Func<T, R> action)
    if (source == null)
        yield break;

    foreach (var val in source.Where(s => s != null).Select(action).Where(val => val != null))
        yield return val;

It can be written even more concisely:
private void PrintFirstNamesNonNull(IEnumerable<Person> people)
    var names = people.NonNull(p => p).NonNull(p => p.Name).NonNull(p => p.First).NonNull(p => p.Value);
    foreach (var name in names)

And even more concisely with a ForEach extension method (see here for code and the debate surrounding it):
private void PrintFirstNamesNonNullForEach(IEnumerable<Person> people)
    people.NonNull(p => p).NonNull(p => p.Name).NonNull(p => p.First).NonNull(p => p.Value).ForEach(Console.WriteLine);
Another win for C# extension methods (and Monads, but I suppose that's another discussion).

Friday, February 18, 2011

Load Testing with Google's WebDriver

I have been working with WebDriver/Selenium 2 for some time now; I implemented it for my team at work and have since discovered ways to do load testing.

I code primarily in C#, but if this were to be done in Java you could take advantage of the headless browser driver (HtmlUnitDriver), which would allow you to potentially run 100s of tests at once. IKVM allows you to convert this to C#, but when I did this I discovered that a 5 second test took 2 minutes or more to run, and I wasn't willing to deal with that.

A viable option for a headless browser I found for C# is an open source project called SimpleBrowser, it doesn't support JavaScript but it's the foundation for a bigger project called XBrowser that will support not only JavaScript but HTML5, SVG, and Canvas.
It's not implemented in WebDriver but I was easily able to run 100 concurrent instances of SimpleBrowser performing a simple test.

Anyway, on to the code:

Some notes:
  1. You will need to use a delegate to pass parameters to the test, in this case I've used a lambda expression.
  2. I've run into problems trying to run this as a unit test:
  • For some reason, a unit test doesn't run the threads all at once, instead it will run them one at a time.
  • You will have to spin the main thread while it waits for the other threads to finish.

Recommended reading:

Friday, July 10, 2009

Combining Lambda Expressions

I'm working on custom fields for a GridView which can include custom filters. The idea is to persist these filters to the data layer, so that when a filter is applied, the query sent to the database includes all the appropriate "where" clauses.

I wanted to have an abstract parent class that deals with most of the tedium involved in extending the DataControlField class, and then only require the child classes to handle the cases specific to their custom filters. The parent class would know how to create the lambda expression required to get a given property off of the object that is being represented in the GridView, while the child class would know how to filter by that property. I was willing to make the parent class a little more messy, as long as it could easily be extended to make whatever kind of filterable columns we wanted. For example, if the child class was filtering based on string data types, I wanted to make the following syntax possible:
protected override IEnumerable<Expression<Func<string, bool>>> GetFilters()
if (filterTextBox.Text != null && filterTextBox.Text.Length > 0)
yield return s => s.StartsWith(filterTextBox.Text);
But I ran into a wall trying to get the parent class to build the full filter expression based on this partial filter expression. I needed a method that looked like this:
public Expression<Func<TrackedTime, bool>> mergeFilterExpressions(
Expression<Func<TrackedTime, T>> propGetter,
Expression<Func<T, bool>> filter)

So given one expression that could get the property from the TrackedTime, and another expression that could get a boolean value from the property, I needed to be able to create an Expression that gets a boolean value from the TrackedTime. In LINQ itself, there is of course the ability to Invoke the functions represented by my expressions, like this:
return tt => filter.Invoke(propGetter.Invoke(tt));
However, LINQ to Entities doesn't support the Invoke command. So although this would compile just fine, it would throw a runtime exception when the time came to build the SQL query to send to the database. It was driving me crazy.

After much searching, it was Joe Albahari, creator of the amazing LinqPad (which I use almost daily by the way), who had the solution. On his website, he provides a collection of extension methods called LinqKit, which allowed me to do the following:
public Expression<Func<TrackedTime, bool>> mergeFilterExpressions(
Expression<Func<TrackedTime, T>> propGetter,
Expression<Func<T, bool>> filter)
ParameterExpression param = propGetter.Parameters[0];
InvocationExpression getProp = Expression.Invoke(propGetter, param);
InvocationExpression invoke = Expression.Invoke(filter, getProp.Expand());
return Expression.Lambda<Func<TrackedTime, bool>>(invoke.Expand(), param);
How 'bout that! By Invoking and then Expanding the Expressions, I can now combine Lambda Expressions any way I want!

Thursday, June 4, 2009

Would a Wave-based IDE be feasible

Google recently announced "Google Wave," a project that they've been working on for the past two years, and which is set to revolutionize online communication as we know it.  When it was announced, they mentioned that the whole thing was built using Google's Web Toolkit.  I started looking into the toolkit and noticed that they also have something called the Google App Engine, which is a framework and service that lets you host your applications and data "in the cloud" on Google's servers.

Pondering on these various Internet technologies, I got to wondering:  Would it be possible to host the entire development process online?  If someone could create a Wave-based code infrastructure, we could have an online IDE:
  • Each Java code file could be a Wave, which could be edited collaboratively--by multiple users at the same time, if necessary (hello Extreme Programming!).  
  • A spell-check-like plugin could be created to provide real-time compiler feedback and intellisense.  
  • A bot could be granted access to the code tree in order to compile and deploy changes to a cloud-based service in real-time, provide debugging services, and even run unit tests.
  • Developers could "check out" the waves into their own framework instance, and once a set of changes is ready, they could be merged into the "stable" set of Waves.
  • Waves have a built-in, extremely powerful version control system built in already; you can visually and immediately step back to each point in a file's revision history to watch its evolution.
  • The Google folks already showed how useful Waves can be in bug management; bugs and tasks could be handled and passed around within the same Wave framework.  They could probably even be linked to the code changes that were made to fix them (and vice versa), for future reference.
  • A plugin similar to the bug management one could be used to tag a spot in code for colleague review.  A Wave thus tagged would appear in the colleague's inbox, where they could see the changes made in the context of the entire Java file.  They could start a thread inline in the code to ask questions and make suggestions, which would all be immediately visible to the original programmer.  They could even have an entire chat session right there, inline with the code!  Both the original programmer and the colleague could make and see the changes in real-time.  Once the colleague is satisfied, they could use the plugin to sign off on the changes.
  • Documentation (both internal and external) could also be managed by the same system. Code for a particular feature could be linked to that feature's documentation.
And perhaps the best part about the whole thing is that developers don't even need to install anything on their computers.  They can log in from any web-enabled computer, anywhere, and all the same capabilities are at their fingertips.  With Wave, Google has laid the foundation for a new generation of Internet technologies.  I'm excited to see the many ways that this sort of technology will be leveraged in the years to come.

Monday, May 4, 2009

Fastest way to check for existence in a table

In order to improve performance on one of the pages in our Java code, I was making a SQL query which, along with the typical section grade information, also pulls in a field to tell whether that particular grade type is in use.  Between my own brains and some quick Google searching, this was the best query I could come up with:
select sg.*,
(select top 1 1 from section_roster sr where sr.section_grade_id = sg.section_grade_id) as isUsed
from section_grade sg
I assumed that in the absence of any ordering, "select top 1 1" would be converted by SQL Server into a sort of "exists" statement, and would therefore be the fastest query for the job.  But just out of curiosity, I ran a similar LINQ query in LINQPad to see what SQL would be generated.  Based on those results, I created the following query:
select sg.*,
(case when exists(select null from section_roster sr where sr.section_grade_id = sg.section_grade_id) then 1
else null
end) as isUsed
from section_grade sg
Although it's not as simple a query, I was able to drop the execution time from about 27 milliseconds to about 9 milliseconds.