Harmful Abstractions

Abstraction is a key concept in computer science and math. Abstraction in computer science is the art of hiding details. It could be as simple as the getc() C function that return a single char from some source. That is, the user key strokes, a text file, etc, or a very complicated abstraction like the 7 layers of the ISO TCP protocol. When using the getc() function you are ignoring where the retrieved char come from. It also means that you have, usually, no way of knowing where it comes from. Hence, abstractions are not only ignoring and hiding details.

Modern software is build as a pyramid of abstraction layers. When writing a an application in C#, for instance, you are using an abstraction layer above the intermediate language (IL). The IL is an abstraction layer above the the native environment, which, in turn, is an abstraction layer above the von Neumann computation model. The von Neumann computation model, is an abstraction layer above the electronic gates, which hide the details of current and electrons. It your software uses the Internet you are very likely to use a framework that abstract the whole abstraction layers of TCP protocol (technically your application resides in the 7th layer - the application layer - of TCP layer. Nevertheless, you have multiple abstraction layer inside it).

It is safe to claim that, without those abstraction layer, writing modern software will be kind of self torturing.

Anyway, this article is not about how useful are abstractions. It’s about how harmful then can be, when used wrongly. The reason why abstracts could be dangerous lies in the fact, that they may hide important details. Such details could cause the system to stop behaving as expected. When ignoring details, you are saying “I don’t care about the details because my system will work the same, regardless of the actual detail. Actually this is a very important principle of the S.O.L.I.D priciples: the “Liskov substitution principle”. This principle states “Let q(x) be a property provable about objects x of type T. Then q(y) should be true for objects y of type S where S is a subtype of T.”

In this definition T is an abstraction over S. Assuming S´is another subtype of T, this principle doesn’t say that q(x) <=> q(x’), for all instances x from S and s´from S´. To use other words, when defining an abstract you are defining the properties you want to have in all subtypes. Mixing defined with assumed common behavior leads to serious problems. A funny Poster about the LSP states: If it looks like a duck, quacks like a duck, but have batteries – you have properly the wrong abstraction.

Is the rest of this post I will show some harmful abstraction that led/ could lead to dangerous behavior.

Databases are memory collections

This abstract has been usually pushed by the raise of the Domain Driven Design (DDD) which encourage persistence ignorance (PI). The famous pattern of the PI is the repository pattern which hides the database access details behind a slick interface with few simple methods like : GettAll(), GetById(), etc. This pattern is really great for unit testing. Abstracting the data access layer enables to replace the database with an in-memory collection of the entities which results in a huge performance hit compared to the accessing the database and, more important, to predictable behavior.

Performance problems

The problems start to appear usually after you have established the infra structure of you architecture, usually too late.The problem with this abstraction is that databases don’t work like memory. Navigation from an object to another in the database is not nearly the same as in memory. Round trips to database are far more time consuming than in memory. Getting too much object from the database is too expensive. All serious ORM provide lazy loading to load dependencies of some object on demand. To be mainstream conform I will demonstrate this behavior using the blog post with comments example. Loading the blog post from the database doesn’t mean you want to have all comments. This could be worst if the post has attachments in the database. Loading them is not only slow but could also be memory exhaustive. Such sub objects could be lazy loaded, i.e. on demand. Now assume you want show the blog post with all its comments. Lazy loading them is known as the select N+1 Problem. The ORM has to query the database N times, that is one time for each comment, and the first time to retrieve the blog post.

If you ignore this issue it bite you later. And it will bite hard! Not wanting ti discard your beautiful repository and the nice unit tests with the memory collections to try to solve this dilemma adding new specialized functions to the repository. For retrieving all post with comments you add:

GetAllWithComments()

The same problems happen again and again and each time you add a new method to the repository. The beautiful slick is ain’t slick and beautiful any more. It get bloated with many methods like:

GetAllWithCommentsPaged();

GetAllWithCommntsAndAuthorsPaged();

Get ByTag();

GetByTagWithComments();

GetByTagWithCommentsPaged();

You name it.

Context problems

Many ORM embed the database context inside persisted object to keep the connection to the database to retrieve sub object and persist updated value. The context provides further context features like object lifecycle, transactions, etc. The consequence of this coupling between objects and content causes the persisted object to depend strongly on the the context. It is not enough for a service class to receive a list of blog posts to calculate the average comments count per one post. I has to assume that the comments are all loaded eagerly, or the context is not disposed. It means also that you can not keep the database object in the memory between requests if you wish you modify them later of navigate in the object graph.

Remote calls

Remote calls are calls to a resource on a remote process, remote computer on the same network or a call over the internet. Such calls are afflicted with long latency. If you wish to avoid slowing down you application you can all them asynchronously, i.e. starting the call and registering a callback to be executed when the call has terminated. Problems start when you try to hide the distributed nature of your call. If all what you are doing is calling a webservice to get the weather data and show them on you home page your cool. You can hide the webservice call behind an IWeatherIno interface and nevermind how it works. Of course the latency of you site with get at least as big as the called webservice latency, but no problem. It works!

Now imagine a more complicated situation with many distributed processes. If a communication partner is waiting for few partners to answer its calls the system will get very slow. And that’s not all. It can get even worse. Deadlocks could render the system unusable. To demonstrate this situation let us consider the starbucks example. This example demonstrate how multiple process could communicate with each other over a long term to accomplish a business transaction.

A transaction at starbuck involves three actors: a customer, a barista and a cashier. The customer starts the transaction by ordering a drink from the barista. The barista starts making the drink an notify the cashier to bill it. The cashier asks the customer to pay for the drink. The customer pays for the drink. After receiving the payment the cashier notify the barista, which deliver the drink to the customer.

Executing this transaction in the same process yield no problems. The execution flows from object to other as method calls until the initial call of the customer to order a drink terminates. All needed is one thread.

Now trying to hide the distributed nature ~~could~~ will lead to the following deadlock:

The clients orders a drink and the thread blocks waiting for the asynchronous call to return. As described previously the cashier request the customer to pay. The customer cannot reply because its thread is blocked waiting to the ordering call to terminate, which in turn will not terminate until the payment is done. Every one is blocked waiting for some call to return. Of course you could assign a second thread to the customer. But what he is ordering two drinks simultaneously? This is a fairly simple example. In a real world distributed application you will have many of such service calling each other. Abstracting the asynchronous calls behind an interface would harm your system.

Hiding the stateless nature of the http protocol

The http Protocol used for the web communication is stateless by nature. That is, each http connection is new connection that contains no information about previous connections. Classical ASP.NET tried to get rid of this limitation by using an abstraction layer over the protocol that enables stateful web control. preserving the state of web page by rendering a huge amount of redundant data the get exchanged each timer between the browser and server.

ASP.NET Also tried to abstract the html behind an object oriented, component based model which failed frequently. The hidden details about the true nature of html could not be ignored without losing of control over the feel and look of the rendered pages

I’ve tried in this article to demonstrate how abstract could be harmful if used inappropriately. Consider you abstraction carefully and never assume. And remember if it looks like a duck, quacks like a duck, but have batteries – you have properly the wrong abstraction ;)

August 9, 2010 |
Tags : Programming
Tweet

About Me

This blog is kept alive by me, Moukarram Kabbash, a programmer, hobby photographer from Dortmund in Germany.

mouk.github.com

Follow @dermouk