NoSQL Distilled, book review

ImageMartin Fowler books are traditionally from very high quality (Pramod Sadalage is a new author for me but I could expect that he works at the same professional level). In this respect the “NoSQL Distilled” is not much different. All the most important concepts are very accurately defined and detailed described. What is the best way to design aggregates, why transactions have lost its importance, sharding, replication and consistency issues – everything is explained with precise clarity. One gets answers to most of possible data design questions in the first conceptual part of the book.

The second part of the book does not look same convincing. Although most concepts from the first part are illustrated in the second, still examples are too short and look like a quick start guide from database documentation. In this regard I would personally prefer “Seven Databases in Seven Weeks” that successfully demonstrates most relevant strengths of respective databases. One could say that the “Seven Databases…” book perfectly extends “NoSQL Distilled” as a practice guide. (Or other way around – “NoSQL Distilled” extends “Seven Databases…” better summarizing most important concepts)

All in all the “NoSQL Distilled” is a very important appearance and must have for the target audience.


The Vim Way

Everyone knows Vim, the old good *nix editor. It seems to be more and more trending in the last years, not without a reason. I also got acquainted with it, about one year ago, in december 2011, at the global day of coderetreat in Munich.

Personal

The coderetreat pe se was terrific. My former colleague, software craftsman and agilist Ilker Cetinkaya has written a very good review and there is no way to describe the event better. At that day during one of the coding sessions I have gained a unique opportunity to watch a Vim professional at work. I was so amazed by the seamless workflow and frictionless development that my vis-à-vis demonstrated during coding, so that i decided that next year, 2012, should be a year of Vim for me.

Apart from that there was another motivating factor, namely “The Pragmatic Vim“, an excellent book by Drew Neil. While my partner at coderetreat demonstrated how effective one can be with Vim, the book helps to reconsider the process of writing text, treat it completely different, not just as hitting keys on the keyboard, but as a mental activity.

What have I learned throughout this year? Why have I invested weeks of free time into new tool, while many could just smile leniently taking this as an art of a game…

Is it just a game?

Dreyfus is all around

No, it is not a game (however it could also be a game) It is also not a challenge (however it could be one). It is all about Dreyfus, the Dreyfus model of skill acquisition. (The application of Dreyfus model to software development was excellent described by Andy Hunt in “Pragmatic Thinking and Learning“). The Dreyfus model describes, how good you can perform as a software developer, and helps to understand, how can you do it better. Skill acquisition starts at novice competence level (rigid, rule-based) and goes all the way through to the expert (self-improving, intuition-based). Aiming to get expert skills is extremely important, especially for software developers, especially for coding, since coding is the kind of special activity that should be as intuitive as possible. Here is why.

Software developers edit text trying to translate human ideas into linguistic constructs comprehensible for a computer. There are still no computer chips implanted into brain that would directly deliver our thoughts to machines, it still takes a while before we gain such gadgets. Until then, typing symbols of computer language is the main means of communication. That is why it is not only important what you code – object-oriented domain-driven C# essay, or high-performance C++ map-reduce novella. It is also essential how you do that. And as long as your are concerned – how can you do it better?

The How question

The how means going all the way from novice to expert (here is once again a summary of Dreyfus levels)

The first step is surely learn touch typing. Second is to find a tool that helps you. And finally – master the tool. Tool choice is essential, mastering notepad is not so difficult, but also not helpful. The tool should allow you to be intuitive and encourage you to permanently improve the workflow.

Here is where Vim comes into play. It seems to be almost endless open for mastery. Month for month you could find new hidden gems. Designed to be intuitive, it translates thoughts into text,  almost like that pluggable computer chip in your brain. This way you should not necessarily perceive Vim as a tool, or perceive it at all. Provided advanced level of competence.

Surely, Vim is not the only one. There are enough good tools, you should just pick one. Or combine several of them. Vim in Visual Studio with ReSharper could sound tempting for a C# developer. Diversity is welcomed.

What’s next

Now this year is drawing to a close. Another global day of coderetreat is gone with lots of experiences and positive impressions. And i am happy to have a new good friend on my side, the Vim, the workhorse.


Seven Databases in Seven Weeks, book review

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement

The book does what it says in the preface – it provides well-rounded understanding of the modern database landscape, written in a nice informal language, with loads of examples and exercises and “no day would be complete without a little bit of razzle-dazzle”. You should expect to get a broad grasp of why there are so many NoSQL databases and which one could be good for your next project.

Do not expect however to become proficient or even competent in any of the listed tools. It would be an unfeasible goal for three days sprints – and this is the amount of time one gets for each database. Each chapter shows just most essential features of respective database, something that distinct it, leaving a lot of questions open. So be prepared to get accustomed with Amazons and Googles original papers to really understand Riak and HBase, read about Bloom filters, investigate Hadoop, Zookeeper and handful of other exotic tools successive turning seven weeks into seven months. It pays off – but be prepared…

Short wrap-ups after each chapter summarize all strengths and weaknesses and comprehensive overview tables at the end could be perfectly used as a guidance while looking for an appropriate tool in your project. Especially helpful find I the classification in terms of  the CAP theorem – on which side of the CAP triangle can the database operate.

All in all a very interesting book, many thanks to the authors, Eric Redmond and Jim R. Wilson.


Way too many Factories

I am a big fan of Java ecosystem. But some design decisions are just perplexing. Like in this code snippet, writing an XML document to a string:

DOMSource domSource = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.transform(domSource, result);
return writer.toString();

Not sure I want to deal with a factory for my simple task, leaving alone that DOMSource object…

Same pattern here:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(path);

or here:

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(myXpath);

Here even more layers of indirection:

SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Source schemaSource = new StreamSource(new File(mySchema));
Schema schema = schemaFactory.newSchema(schemaSource);
Validator validator = schema.newValidator();
validator.validate(source);

Seems that the whole JAXP API is built with the same pattern in mind. All the factories and builders are very generic and offer great deal of flexibility. But I wonder how oft the flexibility was used in real life instead of copypasting same new/factory/builder snippets out of google. Load an XML, search or validate are just day-to-day simple tasks. Shouldn’t it be possible to fulfill them simpler?

Less is more.


Snapshot Isolation in SQL Serer

Snapshot Isolation is a notable feature of SQL Server since version 2005. Still not everyone knows about it (I didn’t – until recently) and it’s worth to lose a couple of words on this subject.

Some time ago we’ve got an interesting problem – deadlocks in our SQL Server database! Not just some ordinal boring deadlocks but the cool ones. We’ll have a look, what makes them so cool. But first, let us notice what the classical deadlock is. It happens when two processes are trying to request two resources at nearly same time in wrong order.

Classic Deadlock

Is it possible to be deadlocked having just one resource? Probably not…

But who knows – let’s first have a look at one simple example (actually a drastically oversimplified version of a real production issue).

Example

Let’s create a single ‘event log’ table, something like store of the log4net AdoNetAppender.  Every record contains log message, level (like Info, Warning, or Error) and time when it was created. We will also assume that event logs will be somehow classified and the CategoryID refers to a particular message classification.

CREATE TABLE [EventLog](
	[ID] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY,
	[Created] [datetime] NOT NULL,
	[Level] [nvarchar](10) NOT NULL,
	[Message] [nvarchar](max) NOT NULL,
	[CategoryID] [bigint] NOT NULL,
)

If my application experiences substantial problems, it could update the table with some relevant data:

INSERT INTO [EventLog]
(Created, [Level], [Message], CategoryID)
VALUES
(GETDATE(), 'FATAL', 'ProgramException: mysterious program crash', 32)

After some period of time I would want to look in to the table and see what’s going on. Today I’m interesting in all the messages of Category 32 (whatever this category means)

SELECT Message FROM EventLog WHERE CategoryID = 32

There are already a couple of logs in the database and my query seems to work kind of slow. That’s why I am adding a new index:

CREATE NONCLUSTERED INDEX Events_By_Category ON [EventLog]
(
	[CategoryID] ASC
)

OK. Everything works great now. Application logs events at significant rate. I’m looking at the categorized data time to time and my colleagues are doing the same when they have a spare minute.

No! Don’t cheer too soon! The deadlock is already coming!

Transaction was deadlocked on lock resources with another process and has been chosen as the deadlock victim.

As we saw earlier, there should be at least two transactions and at least two resources to produce a deadlock. The two transactions are pretty quick identified (thanks to the SQL Server Profiler). Transaction 1 is the INSERT statement and Transaction 2 is the SELECT statement from above.

But where on earth are the two resources if we are dealing with one single table? How can be deadlocked two single statement transactions? Probably the problem is elsewhere?

No, everything is correct. The problem is with this two statements. And to find the two resources we should take in mind how the SQL Server stores data and do not forget that the persistence format of an SQL Server table is the persistence format of its clustered index – it is a B-Tree where Nodes are ID Ranges and leafs contain actual data. That’s why when I say table data, I should mean its clustered index. With second nonclustered index – Events_By_Category – another B-Tree was created. It consists of CategoryIDs and its leaf nodes reference the clustered index.

And this is the answer to the question. When the first transaction inserts data to the table, it first adds the data to the clustered index and afterwards (indirect) updates the nonclustered one. When the second transaction looks for particular events it first makes search in the nonclustered index using CategoryID following by the scan of the clustered index to get the actual message data. Isn’t it cool?

Deadlock demystified

How can we fix it? Probably INCLUDE all the data required for transaction 2 into the second index? Or use query hints like NOLOCK?  It could work. But there is a better option – now it is the right time to go back to the Snapshot Isolation.

Snapshot Isolation

The idea of Snaphot Isolation is to separate reads from the write requests. If the feature is switched on, a separate snapshot of the data is stored in the tempdb (that’s why the name). When our table got an update, the snapshot will also be updated shortly afterward with the most recent committed data. No locks will be taken! And without locks there are no deadlocks! After enabling READ_COMMITTED_SNAPSHOT database option all the reads on READ COMMITTED transactions will use the snapshot data.

There is just one drawback (at least I know only one). Since all the committed data are effectively cloned to snapshot, the database will need much more disk space for the tempdb. But who cares of the disk space nowadays? 🙂

Interesting enough that the snapshot data are de facto derived data. If the snapshot is lost, it could be recreated again using the source tables. In this respect it could be seen as a certain variation of the Reporting Database and the whole idea of the read/write separation is to some extent related to the CQRS pattern.

Further Reading:


Divide and Сonquer

Recent memory leak issue was quite tough. The application crashed at different points at different times. Error messages did not contain any mention of leaking handles and just stated something like “Parameter is not valid”. It was quite challenging to track down the true cause of the problem. How should we proceed? What is the process to find the single problematic statement in a 5 MLOC code base?

OK, we had some clues. Of course.

  1. We knew that the problem occurs only when a particular module was active. And we isolated the module to an own process. It lead to extra IPC but in this case it was less of two evils. Nevertheless after this action the main application did not crash any more. The error has moved to the module and changed from “Parameter is not valid” to an “Out of memory”. So the problem was definitely in this module.
  2. This part of the project has a lot of native stuff so the next thought was that we dispose something wrong. Unfortunately code analysis did not bring anything, everything seemed to be OK.
  3. But now we knew that the problem was with system resources since the new error said “Out of memory”. With extra monitoring of memory and handles we could see permanent growth of handle counters by constant memory consumption. It allowed us to identify the problem as a leak of handles in the module – quite a good constraint for further analysis (it is funny that the error message states out of memory and the monitoring shows leaking handles). But we still did not know the reason.
  4. By that time we had data, that the application crashed not only during business hours, but also at night. It means that there should be some activity triggered automatically around the clock. We should now look for a background process leaking handles in the module. Some more minutes looking through the code and here it is – a Timer repeatedly polling native API. Big win!
  5. But even knowing the type of the problem and particular code causing it, having narrowed our search from 5 MLOC to just dozens of lines, we still could not identify the true cause. Luckily, having just few lines of code under suspect, it was possible to do a detailed line by line runtime analysis and finally find it – the evil Control.Invoke!

The entire process could sound obvious. Sure it is. We just effectively tried to split the whole application into pieces, ‘stable’ and ‘not stable’, and further split the bad ones narrowing the search. All the good things should be same simple and obvious, I think.


Blast from the Past

No, I mean not the movie. But the cold hand of WinForms stretching through the years and catching me unaware on the shoulder.

Did you know that Control.Invoke leaks? It is.
It allocates WaitHandles and does not dispose them. Surely, they will be disposed by garbage collector sometime. And surely the garbage collector is unaware of unmanaged resources under managed WaitHandles. Even GC.AddMemoryPressure is of little help since 30K Handles (and same number of system objects) could be a tiny chunk of memory but still a huge problem for the current process.

Some folks faced same problem with a timer triggering callbacks that – due to some reason – must be routed to the GUI thread. Timer could trigger every N milliseconds and after some hours or days the process could not create handles any more. This is pretty the same what we had recently in our project – events coming every 300 ms to a module triggering some native API that depends on a message loop of a particular window.

In the end we have managed to move event generation to the same GUI thread effectively discarding any single call to the Control.Invoke. But still, it is kind of embarrassing that Control.Invoke does not care about system resources. Being paranoid seems to be a good strategy. Even in regard to something that is claimed to be ‘mature’ and ‘stable’.

Further Reading


Image to byte array and vice versa, the better way

It is not a big problem to convert an Image to byte array. Just save it to a MemoryStream and you are done. Right? Not really, not always. You could do it much better!

The problem with Image.Save is that it is quite slow. It always implies some encoding work. But what if I don’t need to apply any image codec and just work with bitmaps? I have bytes of my Bitmap somewhere in the memory and just want to get a copy of them. Could it work faster? Yes, most probably!

        public static byte[] BitmapToArray(Bitmap image, PixelFormat pixelFormat) {
            var imageData = image.LockBits(new Rectangle(0, 0, image.Width, image.Height),
                                                ImageLockMode.ReadWrite, pixelFormat);
            var bytes = new byte[imageData.Height * imageData.Stride];
            try {
                Marshal.Copy(imageData.Scan0, bytes, 0, bytes.Length);
            } finally {
                image.UnlockBits(imageData);
            }
            return bytes;
        }

        public static Bitmap ArrayToBitmap(byte[] bytes, int width, int height, PixelFormat pixelFormat) {
            var image = new Bitmap(width, height, pixelFormat);
            BitmapData imageData = image.LockBits(new Rectangle(0, 0, image.Width, image.Height),
                                                ImageLockMode.ReadWrite, pixelFormat);
            try {
                Marshal.Copy(bytes, 0, imageData.Scan0, bytes.Length);
            } finally {
                image.UnlockBits(imageData);
            }
            return image;
        }

It works quite good – in my tests up to 4 times faster comparing to the Image.Save to Stream solution. And fast 20 times faster (!!) if you are saving image to Stream as Png.

Just one thing to take in mind – working without codecs surely means that we get much bigger byte arrays. That _could_ be a problem, depending on particular use case.


Optimistic Concurrency and NoSQL

Here is the question: how do i implement optimistic concurrency in this or that document database? How can I make sure that a document being saved has not been changed by someone else? Just to start here are some examples with two popular document stores.

RavenDB

If you need optimistic concurrency you should set the session’s UseOptimisticConcurrency property. Straightforward. Under the hood it uses the ETag values, Raven’s variant of TimeStamp automatically associated with every document. Every time document is updated in the store, its ETag will get a new Guid value. And you get an exception if ETag of the document in the session is not the same as in the store.

[Subject("Raven Experiments")]
public class when_updating_same_object_in_separate_sessions : raven_persistence_context
{
    static Exception Exception;

    Establish context = () => StoreSingleObject(new MyClass {Text = "this one"});
    Because of = () => Exception = Catch.Exception(UpdateTwoObjectsInTwoSessions);
    It should_throw_concurrency_exception = () => Exception.ShouldBeOfType<ConcurrencyException>();

    static void UpdateTwoObjectsInTwoSessions()
    {
        using (var session = Store.OpenSession())
        {
            session.Advanced.UseOptimisticConcurrency = true;
            session
                .Query<MyClass>()
                .First()
                .Text = "still one";

            AtTheSameTimeInAnotherSession();

            session.SaveChanges();
        }
    }

    static void AtTheSameTimeInAnotherSession()
    {
        using (var anotherSession = Store.OpenSession())
        {
            anotherSession
                .Query<MyClass>()
                .First()
                .Text = "ha ha";
            anotherSession.SaveChanges();
        }
    }

    public class MyClass
    {
        public string Id { get; set; }
        public string Text;
    }
}

It works equally good for new documents. For example in event sourcing scenarios sequences of events with continuously incrementing revisions are persisted in the store. The event sequence cannot be consistent if there are two events with the same revision. UseOptimisticConcurrency helps also here. Just make sure that event IDs are generated properly and events with the same Revision have the same ID in the Store.

[Subject("Raven Experiments")]
public class when_storing_two_objects_with_same_id_in_separate_sessions : raven_persistence_context
{
    static Exception Exception;

    Because of = () => Exception = Catch.Exception(CreateTwoObjectsInTwoSessions);
    It should_throw_concurrency_exception = () => Exception.ShouldBeOfType<ConcurrencyException>();

    static void CreateTwoObjectsInTwoSessions()
    {
        using (var session = Store.OpenSession())
        {
            StoreOnce(session, "first");
        }
        using (var session = Store.OpenSession())
        {
            StoreOnce(session, "another first");
        }
    }

    static void StoreOnce(IDocumentSession session, string text)
    {
        session.Advanced.UseOptimisticConcurrency = true;
        session.Store(new MyEvent {Revision = 1, Payload = text});
        session.SaveChanges();
    }

    public class MyEvent
    {
        public int Id { get { return Revision; } }
        public int Revision { get; set; }
        public string Payload { get; set; }
    }
}

Raven design in this case is pretty straightforward and tries to encapsulate details with an out of the box solution. It covers probably 90% of use cases, but if this solution doesn’t really fit to your problem, then you should invent some workarounds.

Taking again the Event Sourcing example. Say you have several sequences of events with ongoing revision numbers. For each sequence there is also some aggregated information like most recent revision and time when the last snapshot was taken. This information will be continuously updated: most recent revision is updated every time new event comes (oft) and snapshot time is set when there is a new snapshot (not so oft). If I update the last revision I do care that nobody is changing it meanwhile (Actually i don’t. It is not a big problem if the last revision not exactly correct. But for the sake of the example let’s assume that I care). If I update the snapshot time, there is no reason to get any conflicts due to changes in revision. That could be a problem with the ETag scenario.

This is where the MongoDB comes in play.

MongoDB

MongoDB has a couple of nice atomic operations: find-and-modify and update-if-current. They are somewhat similar to the Raven’s Patching API with one important difference: in Raven’s partial document updates you must use ID to identify the document that should be patched and ETag to ensure that document has not been changed.

(Sticking to the .NET Driver)

[Subject("Mongo")]
public class when_issuing_an_update_on_a_changed_entity : mongo_context
{
    private static long DocumentsAffected;

    private Establish context = () => SetUpDatabase();
    private Because of = () => DocumentsAffected = UpdateSameObjectTwice();
    private It should_not_affect_any_documents = () => Assert.That(DocumentsAffected, Is.EqualTo(0));

    private static long UpdateSameObjectTwice()
    {
        var events = db.GetCollection("events");
        var oldRevision = events.AsQueryable().First().Revision;

        SomeoneChangedDocument();

        var res = events.Update(
            WhereDocHasRevision(oldRevision),
            Update.Set("Revision", new BsonInt32(100)),
            SafeMode.True);
        return res.DocumentsAffected;

/* or with find-and-modify
        var res = events.FindAndModify(
            WhereDocHasRevision(oldRevision),
            SortBy.Null,
            Update.Set("Revision", new BsonInt32(100)));
        return res.ModifiedDocument == null ? 0 : 1;
*/
    }

    private static void SomeoneChangedDocument()
    {
        var events = db.GetCollection("events");
        var evt = events.AsQueryable().First();
        evt.Revision++;
        events.Save(evt);
    }

    private static IMongoQuery WhereDocHasRevision(int oldRevision)
    {
        return Query.And(
            Query.EQ("_id", "1"),
            Query.EQ("Revision", new BsonInt32(oldRevision))
            );
    }
}

Hello world!

This just in:

console.log 'Hello, World!'