A few months ago I wrote about an algorithm to produce a random shuffle of an array. The idea was to use this algorithm for the Secret Santa game, where you have to buy a small gift for another team member. The following is a diagram illustrating how it would work in two steps. First we shuffle the list of names in an array, then we connect each name with the next, with the last name to the first. In this diagram Sue would buy a gift for Phil, Phil for Max and so on until we get to the end and Amy buys for Sue.
A friend of mine, after reading this "not a blog", pointed out that this did not produce all of the possible random combinations. While the idea of shuffling the list of names and then chaining them together in a ring works, it would always produce a random onecycle ring. I have to point out that my friend, Jean Paul (https://degabriele.info/), researcher in cryptography, knows a thing or two about randomisation. He’s right of course. The following example would not be possible in my solution.
Turns out that my Secret Santa problem is actually a reencoding of the random derangement problem in computer science. It’s quite an interesting one. The idea is to shuffle a list of elements in a way that no single item remains in the same position after the shuffling. Think about an array containing a bunch of names. After we shuffle them around, no name should be in the same place. We can then map the starting positions of each name to the finishing ones, showing who's buying a gift for whom. The next diagram shows a derangement that results in creating the two cycles shown in the previous example.
To be more precise a derangement is a permutation of elements in a set, in a way that no element appears in its original position. There is a ton of maths theory around this. For example if we just do a normal random shuffle on a list of items, the probability of having a derangement is 1/e or approximately 0.3679. The proof and theory can be found here:
https://en.wikipedia.org/wiki/Derangement
We can easily implement an algorithm that keeps on shuffling until we have a derangement. Let’s do it three steps. First let’s write something that generates a list of sequential shuffled numbers up to n1 . For example for n = 5, we can generate 4, 1, 3, 0, 2. We can later use this sequence to swap elements in the list we want to shuffle. For example if we have the list with ["Bob", "Phil", "Sue", "Amy", "Max"] , shuffling with the sequence 4, 1, 3, 0, 2 would result with a final list of ["Max", "Phil", "Amy", "Bob", "Sue"]. Here’s the python code to generate this shuffled index. This is simply Knuth’s shuffle algorithm on a list of sequential numbers. Notice how python is one of the few languages where you can swap two elements in an array in a single line! Ah, the little joys in life...
Code Editor
The next step is to check if the generated list is a derangement or not. To do this we can simply iterate over every single item in the sequence and check if the value is the same as its index. If it is for any of the elements, then the list is not a derangement. We can easily do this with a simple ‘for loop’.
Finally we can put the two together. We can keep on generating shuffled sequences while it’s not a derangement and then swap the items in our original list using the derangement.
There are of course many better ways to implement this derangement algorithm. I'm putting a simple implementation here so it can be easily understood. This presentation outlines many other performance improvements:
https://www.cs.upc.edu/~conrado/research/talks/analco08.pdf
In our algorithm we “guess” a derangement sequence and if it’s not a valid one, we throw it away. After implementing the above code, it got me wondering, is there an algorithm that doesn’t keep on trying until it finds a valid derangement? I.e. can we have one with a measurable worst time complexity?
If we have a way to enumerate derangements, we can then generate a random number n and simply choose the n’th derangement. But how do we enumerate derangements? Maybe a good place to start is to try generating permutations. Let’s say we a have a sequence such as 0,1,2,3 … n, how do we write some code that generates all permutations of this sequence? There is a simple way to do this. It’s called the “heap algorithm”. You can find all the details about it here, implementation follows in python: http://ruslanledesma.com/2016/06/17/whydoesheapwork.html
Next we can simply ignore anything that is not a derangement. We can reuse our is_derangement() function for this. If we modify the previous program by calling this function we’ll end up with something that lists all possible derangements.
Now we just need to return the n'th derangement. This is kind of where it gets complicated with the implementation. We need to modify our function to accumulate the number of derangements generated and the n’th derangement itself. In my implementation changing the function so it returns a tuple. I’m pretty sure there is a cleaner way to do this.
Before we modify our main function to choose a random nth derangement, we need to know the upper limit of the number of derangements of the sequence. That is, given n, find the total number of derangements of a set of n elements. Wikipedia tells us that this is the following recursive expression:
countDerangments(n) = (n  1) * (countDerangments(n  1) + countDerangments(n  2)) So we can implement the following function to give us the upper limit (go dynamic programming!):
Finally we can put everything together. First we select a random number between 0 and the total number of derangements. Then we use the previously developed function to get the nth number of derangements.
Right you’re probably thinking… “This is horrible” and you’re right. In fact I think this is my first time ever writing an algorithm with a O(n!) runtime complexity. The algorithm is pretty useless even for small values of n. Trying it with anything bigger than n=10 will have you waiting for a couple of lifetimes. But hey, sometimes we do things just because we can, not because it’s a good idea.
A colleague of mine, Bruno, <link to git profile> came up with a brilliant idea to find a random derangement without throwing away guesses, with a fast runtime complexity. I’m going to try to explain it in a few steps using a sequence of numbers of length 4. As usual we start with a sequence 0,1,2 ... n. The first step is to copy this list, use a normal Knuth shuffle and put the results on a stack:
Next is to pair the initial sequence with the item of the stack. Pop one item from the stack, pair it and move to the next. If the items are equal, you pop another one pair it, and push the previous one back on the stack. In our example this produces the following pairing, meaning that 0 will go to position index 2, 1 goes to 3, 2 to 1 and 3 to zero. But we have a problem when we come to process the number 4. Basically after we pop 4, our stack is empty and we cannot pop another item.
So what do we do? Easy; we always put the first item in our sequence as the bottom item in the stack. This way the last item will never pair with the same value.
But I know what you're thinking... "The last item will always pair with the first item, it's not exactly random!". And you're right. The way to fix this is to do another shuffle on the first sequence or arrayrotate it a random number of times. This way the first item on the list will always be randomly selected to match with a random selection at the end. Here are the 3 steps together.
The above example would have number 3 move to index position 2, 2 would go to 4 and so on. This gives us the derangement of [1,4,3,0,2]. Here is the complete python code of Bruno's random derangement algorithm:
Pretty neat eh? Follow my colleague Bruno on git: https://github.com/brunosousarb And check out Jean Paul's research website, there a loads of great papers on it: https://degabriele.info/
0 Comments
My last nonblog post got me thinking about my past technical interviews. I’ve been on the receiving end of a few of these. I remember questions like "How can you build a system where only your cat passes through the door flap?" or "Can you describe your character in 5 words?". For the record, I hated both of these… I really have no idea what they were trying to evaluate.
What about examples of good ones, the ones I really enjoyed working on? I also had a few of those. In this post I’m going to describe a past employer that asked me to write some pseudocode to implement a queue data structure. However, there was a twist… The queue had a fixed maximum size and I had to use an array as a base data structure.
Let’s start with the basics. A queue is a basic data structure used in many algorithms. It allows you to add an item to the tail of the queue (an operation called enqueue) and remove from the head (called dequeue). The way we usually model a queue is by using a doubly linked list, this gives us a constant runtime complexity for both operations. Here’s a nice diagram to explain this:
Using a doubly linked list, when we need to enqueue, we create a new node and point it to the current tail. Then we get the tail node to point to the newly created node and finally we update the tail pointer to point to the new node. To dequeue an item we do the reverse at the head of the list.
The advantage of using a linked list is that the structure grows dynamically. As long as there is enough memory you can keep on enqueuing items. So why on earth would you use a static structure such as an array to store your queued items? The idea of using an array has advantages. Notice how in the doubly linked list we’re wasting memory. For each item we have to create a container and two memory pointers. For example, if our queued items are 32 bit integers, we would be wasting 4 times the data in just pointers (assuming 64 bit pointers). In addition this constant allocating and freeing of memory space slows things down a little. So how can we model a queue so we don’t have to use all this extra memory pointers? The answer is to use a fixed sized array. The idea is to allocate enough memory for the array at initialization and then use it to store items in a queue like manner. So how would the enqueue and dequeue operations work on an array? How about you ahead and try to write the code for this and then come back to see if you got it right?
Nice to have you back. Did you manage to do it? Which parts were the hardest? I think the hardest part of the problem is caused by the fact that a queue grows from the tail and shrinks from the head. We can keep two array index pointers, one for the head and one for the tail. When we enqueue an item, we increase the tail pointer. When we dequeue, we increase the head pointer. The next diagram shows this.
This is simple eh? We always enqueue at the tail position and then increase it. The slight hiccup with our plan is what we do when the pointers reach the end of our array. The next diagram shows this. The tail pointer is at the end of our array with no more space where to advance. However the front of our array is empty and can be used to contain more items. The solution here is to wrap around (the pacman way) and move any pointer that is at the end of the queue to the front. This applies to both the tail and head pointers. How do we go about implementing this? We can easily implement the "pacman" wrap around using the mod operator. This is also the way we would implement a circular buffer. The next snippet, in python, shows how we can do this for both the tail (enqueue) and the head (dequeue) pointers. More advanced implementations can also throw errors if the array is full or empty. An array is full or empty when an increment in a pointer results in it having the same position as the other pointer. For example if we increment tail pointer resulting in the same position value as the head one we know that the queue is full. If the head pointer reaches tail pointer then the queue is empty. Both of these checks are not included in the next code to keep things simple.
Did you like this article? You can find more of these data structures and algorithms in my various courses and publications on my publications page. Please leave comments and feedback!
Meanwhile, on a Tuesday at 3:00pm: Paul: Hey James, we have an interview scheduled but John is sick and we have nobody to do it as everyone’s busy with this production issue. Would you mind going to interview this candidate for our team? Me: Hmm… I’m a bit busy but yeah sure. When is the interview? Paul: Err… yeah about that… she’s here already, waiting in room 2.08 I hurried towards the interview room with the CV in my hand, speedreading it while trying not to trip on the stairs. When I got there, we both introduced ourselves and spoke for a while about the bad weather recently. I proceeded to talk a bit about what we do in the company, apart from sitting on bean bags (yeah I know, millennials right?) and then asked her if she could talk a little about her experience since I hadn’t managed to read her CV. I then tried to think of a coding puzzle I could ask her. See, we’ve been having some problems with coding interviews. We all have a couple of coding problems and interview problems we like to ask. The problem is that recruiters have picked up on this and we think they are preparing candidates and giving them the solutions. So we’ve all been thinking very hard about new problems. Christmas is close and our dev team decided to do a secret santa gift exchange. This is where you write everyone’s name on pieces of papers, place them in a bag and everyone draws out one name. Then you’re supposed to buy a small gift for that person. I really have no idea what I’m buying… Was considering the xkcd book, but not sure if it will arrive on time... So I thought, hmm… what if we could write a function that given a list of names would output a random pairing? I discussed the idea with the candidate and together we came up with the problem definition and a function signature. Problem: Write a function (in whatever language you want) that accepts an array of at least size 2 containing unique strings. The functions needs to output a random mapping (giver to taker) using the elements in the input such that:
Go ahead and challenge yourself to try to solve it! Then come back and read the solution. Solution: The problem can be broken down into two main areas. There is a randomization part and a pairing part (deciding who’s paired with who, so they only buy and receive a gift once). Let’s start from the easy part which is the pairing. We are 7 people in our team. One way we can decide Santa’s pairing is to walk into a room with a round table and seven chairs and make everyone sit down. Then we can have a rule that says you need to buy a gift for the person on your right. That way everyone buys and receives one gift. Except Robert that is… he has decided to not to take part. He’s now known as stingy Rob. Yes Rob, if you’re reading this, now you know. Anyway, you get the idea. In tech speak we would just need an array of names and pair each entry to the next one. Then the last element in the array is paired to the first one. This can be done functionally in Scala quite easily by zipping a list with its tail plus head: The other part is to randomize the names in the array before the pairing. This is equivalent to assigning random seats to our team on the round table. If only there was an easy algorithm that shuffles the contents of an array for us... Oh  hang on a second what about the FisherYates shuffle? https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle This algorithm, also known as the Knuth shuffle, works by putting all the array entries in a “bag”. Then for each element position in the array it removes a random item from the bag and places it in that array position. It does this “bag selection” by randomly picking a position to swap to within the array. Here’s the algorithm in pseudocode: Thankfully in Scala there is already an implementation for this so we don’t need to implement it. Here is the full implementation. I’m always impressed by how expressive functional programming is. In a couple of lines of code we have implemented the entire thing! Oh yeah you’re probably wondering about the interview. She was very good and solved the problem much faster than I would have! Unfortunately, she won’t be joining my team but a different one. I’m stuck with stingy Rob.
I wanted to learn some Ruby because of an upcoming talk I am giving. I found Ruby quite easy to grasp. Ruby has some great characteristics and I enjoyed working with it. The most difficult part of coding in Ruby is stopping yourself from humming the song by the Kaiser Chiefs’ who coincidentally I’m going to see in a couple of days time. Yes, I know you’ve started humming it too now, I’m sorry… Anyway, this entry is not about the Kaiser Chiefs nor is it about Ruby. It’s about how we learn a new language or technology. If you work in development, you know how cut throat our industry can be if you don’t keep up with new technologies. Sometimes it feels like we’re in a perpetual “learn technology  use technology  learn technology” cycle. To a certain extent most other professions also require you to keep up to date. A doctor for example, needs to keep uptodate with the latest progress in medicine and treatment methods. However, I doubt that the pace of change is quite as quick as in technology. As software developers we are constantly under pressure to learn new things as quickly as possible. So how do we go about it? Different methods work better depending on what type of person you are. You can buy books and go through them page by page working through the examples. Or you might prefer to follow a classroom or online course. I find that using aids such as books and online courses are more useful when the technology is conceptually different from what I’ve used in the past. For example if I’ve never written any functional code, I might need to follow a course if I wanted to learn Scala. Similarly if I’ve never used a relational database I will need additional help to learn how to use MySql. For everything else, I personally like learning by doing. Unlike doctors, if we want to learn something new, we can practise as much as we want without fear of causing any harm. The idea is to pick a problem and try to solve it with the new technology you’re trying to learn. Along the way you’ll learn the new features and tools available in the new language or technology. The main difficulty with this method of learning is choosing the problem or project. If you attempt to pick a random project such as “Let’s develop an online flower shop” it will not work. You’ll quickly get bored of it. When you start hitting difficult problems you start to question why you picked such a useless project in the first place and putting so much effort into it. Soon you’ll drop it and say “eh, I’ll stick with visual basic for the rest of my life”. Pick something that is useful and real. If you are learning javascript as you plan to open an online business, make that online business your learning project. If you need to learn python to start writing deployment scripts at work, learn python while doing that. Of course you’ll need to fail many times before you succeed, so try your pet code in an isolated environment before moving to the real thing. Back to me learning Ruby now. I’m giving a workshop to a coding bootcamp, called “Le Wagon”, https://www.lewagon.com/lisbon in Lisbon in a couple of weeks. The talk’s title is “Beating the coding interview” and is about learning a few tips to solve the puzzles one gets given during an interview. At Le Wagon, students use Ruby as their web development language. I must confess that up until a few weeks ago I knew almost nothing about Ruby. Keeping in mind that I also have a fulltime job, how could I learn the essentials in Ruby in the shortest amount of time? Since the talk is about coding interview puzzles, what better way to learn it than to try to solve some of these puzzles in Ruby? The internet is full of example puzzles. One of the most popular sites offering these coding exercises to candidates is http://www.codility.com/. They have a practise area where you can try solving sample coding puzzles in many different languages, amongst them Ruby. My plan of action was to pick a few these sample questions and solve them, learning about the different bits of language along the way. Here’s how it works in practise. The brackets problem from codility is about writing a function that checks if a string has a properly nested set of brackets "[]{}()". Here are some examples to get the idea: text = "[()()]", result = true text = "[(]()]", result = false text = "[{()}]", result = true text = "[(])", result = false The solution is quite straightforward if you have worked with stacks before. The idea is to process the string, one character at a time, left to right. If you encounter a left bracket you push it on the stack. If you encounter a right bracket you pop the first item on the stack and check that the brackets match (round with round, curly with curly etc…). In the end you make sure you have emptied the stack. Therefore, for the solution, I needed to find out how to use a stack in Ruby and how to do conditional branching (ideally using some sort of case statement). I also needed to loop over every character in the string and check if the stack is empty at the end. Right, a few google searches later I found out: 1. Iterating over every character in a string: Code Editor
2. Using a case statement: 3. Using a stack: 4. Checking if an array is empty: 5. Using an if statement: Now that I had the building blocks all that was left was to put everything together and to try it out (The app on Codility confirmed the next solution is correct): My solutions to codility lessons problems can be found at https://github.com/cutajarj/CodilityInRuby
Happy learning by doing! Using the correct algorithm can make a difference between a happy afternoon coding and a mental breakdown. This applies to other aspects of life… The baker needs to apply the right mix and technique for the bread; unless you want to make your life miserable you need the correct sequence of steps to change the motorbike oil; the list goes on and on. This analogy is especially true for programming problems. I will illustrate this concept by means of a simple example problem: Write a function “evaluate” that accepts an infix expression “expr” as a string and returns the expression evaluated as a double. Function signature:
Examples:
Limitations:
Right… So there are now two kinds of people reading this post. The first kind most likely met this problem before maybe during their university course and know the general idea of the algorithm. The second type is thinking “How hard can this be?!”. I dare the second kind of readers to try it out without googling for a solution or using libraries. Come on, stop reading this post and come back once you’re finished. I know you want to. You’re back? How did it go? How many grey hairs have you grown? Is Trump still President? If you belong to the first kind and you’ve seen the solution before, you’ll be able to solve it in about an hour or two maybe after brushing up on some specifics on the algorithm. For the rest, I will now show you the solution using Scala, because Scala is awesome. First some definitions. Infix refers to the way we normally write expressions. I.e. when the operator is the middle between two operands. Infix: operand1 operator operand2 Examples:
Postfix is another way to represent an expression. In postfix the operator is after the operands. Postfix expressions don’t support (or need) parentheses.Postfix: operand1 operand2 operator Examples:
Although it doesn’t feel like it, it’s way easier to evaluate a postfix expression. When evaluating infix expressions, you need to take care to evaluate it in the right sequence, doing the parenthesis first, then division and so on. Using postfix the natural sequence of the expression dictates the precedence. A postfix is evaluated by traversing the expression left to right and whenever you meet an operator you replace the operator and the two previous operands with the result. For example consider the expression 1 + 2 * 3 + 4 which translates and evaluates to:
Writing the above in Scala in a more functional way leads to something like: Notice how easy it is to implement this using foldLeft and Scala’s pattern matching. The foldLeft is initialized with an empty list representing the stack. Matching is then done on the two top stack elements and operator type. The result is found and pushed on top of the stack. At the end we simply read the top of the stack to get the final result. Ok, so now we have a way to evaluate a postfix expression. All we need now is a way to convert infix to postfix and we are done. The algorithm is also quite simple:
Again, notice the versatility of pattern matching in Scala. In the above algorithm we start the foldleft with an empty postFix list and an empty stack. We then match on the input infix operands and operators and perform the required logic in each case. The “span” function was used here to split the stack in two. This can be described as popping the stack until the given condition is no longer met. For example: Would keep on removing items from the top of the stack until the left parenthesis is found: ( List(*, +), List((, , +) ) All that is left now is to call the two functions in sequence: There… Job done. This problem is a good example that transforming the input into a different encoding and solving turns out to be more simple than trying to evaluate the expression using the original infix notation. For reference the above algorithm is a simplified version of the “Shuntingyard” algorithm by Edsger Dijkstra.
Make sure you run your motorbike for a few minutes before changing the oil. This warms up the oil and helps it flow when draining. It's been a while since I’ve written one of these. Last time I wrote an entry in here I only had one child, worked at another office and lived in a different country. I've been busy and life got in the way... So it's fitting my next post is about timekeeping! How many times, as a developer, have you had to deal with a timezone or daylight savings bug? Myself, I have had to deal with so many that I don't think I can list them all. Probably the most entertaining one was when I was working at a factory site where IT had just installed a new payroll system. The factory worked night shifts and workers were paid by the hour. An employee would tag in and out using a palm reader. Pay would then be calculated based on the number of hours spent on their shift based on the time between when he tagged in and out. You’d think it's easy stuff right? Think again... The system was happily running for almost a year before a number of workers made a fuss about being paid 1 hour less on one of the night shifts. It turned out that on the night when daylight savings ended, the computer turned the clock back one hour. This meant that anyone working that night shift had worked one hour more than what the application had calculated. The HR department noted however that nobody had complained six months previously, when the daylight savings came into effect and everyone got paid for one extra hour that wasn’t real. Bugs in timezones/daylight savings can be more serious. A bank I worked with few years ago moved hosting servers from one country to another. The new servers were mistakenly configured to run under a different timezone. This caused a bunch of stock options to be exercised a few hours late. The bank made big losses of money and reputation. Is keeping time so complicated? It turns out it is... pretty much. Let's try to look at why this is. So what’s a timezone?The idea behind timezones is that wherever you are in the world the sun should rise and set at more or less the same time. Have a look at the picture above. This shows a simplified view of how this should work. The world is first divided into vertical slices (called meridians). An arbitrary one is then picked and called the prime meridian. Going west you reduce the hours with reference to the prime meridian, going east you increase it. You keep on doing this until you meet on the other side of the world. Once there, you take a swim in the Pacific, toss a coin on whether it should be a 12 or a +12 and congratulate yourself on a job well done. One of the problems is that timezones weren't decided by one guy on gap year going around the world. They were decided and modified throughout history by various governments in many countries. To say this caused a lot of complications is the understatement of the year. Why, you might ask? Let me demonstrate with an example. Can you guess what's so special about the group of islands of Samoa shown here in the picture? The government of Samoa didn't like that it was the last country on earth to go to sleep. In December 2011 they decided Samoa would switch timezone from UTC11 to UTC+13! Yes you read correctly. Timezone UTC+13 should not even exist. The government did cite economic reasons for doing this, such as their biggest trade partner being Australia, but really we all know they did it so they can be the first to say “Happy New Year!”. This created all sorts of weirdness. For example anyone living in Samoa had to wipe out an entire day off their calendar, the 30 December 2011. I wonder what people did if they had a flight booked on that day or say a dentist appointment? American Samoa, just 30km east is now 25hrs behind. Crazy stuff eh? The history book is full of examples like these. Nepal has a timezone of +5:45, Mumbai +5:30 and countries like Portugal and Spain have different timezones even though they overlap! It's fair to say that keeping time across countries is a hairy business. The situation gets even more complex when you introduce daylight savings. Daylight savings was introduced during the war (apparently to save fuel). The idea is that in summer when days are longer time is moved forward by one hour. Although apparently there is a place called Lord Howe Island (Australia) that only moves half an hour... What is the problem doc?Let me explain by using an example. Let's say you arranged a meeting at 1:30am on the 25th October (in the UK). Most countries in Europe, on this date, move the hour back to winter time. At 2am the clock is moved back to 1am and everyone gets to sleep an hour more, well unless you have a meeting at 1:30am. Now if you think about it, you will live through 1:30am twice on that night. Will the meeting happen on the first 1:30am, the second or will you just have to hold it twice? This isn't a big problem right? Nobody holds meetings at this hour. Well what if you had a flight departing at that time? Or maybe you have a night shift starting at that hour? Or a nuclear reactor where the rods need to be retracted at exactly 1:30am ONCE? There are loads of other problems such as: 1. Countries change timezones 2. Countries change daylight savings rules 3. Converting between timezone for a specific historical date requires political history 4. Switching from winter to summer time and back causes problems Ok fine, let's look at the tools available so we can make time keeping easier. Standard TimesThese come in various names, GMT, UTC, Zulu time and many others. They have basically the same idea. Have a time defined that isn’t affected by timezone or daylight. Standard times evolved from when British naval captains used to keep one of their watches to the time at home (UK is on the zero meridian). The timezones of the different countries they sailed to used to be defined relative to this time. This standard time is what is known today as GMT (Greenwich mean time). From this evolved UTC (coordinated universal time) which is what is used today. A timezone is usually referred to as UTC+/X. Why is standard time useful for us? Because it gives us an easy way to convert from one timezone to another. If the time is 2pm in NY UTC5 what is the time in Spain UTC+1? Simply add 5 to convert to UTC and add another 1 to convert UTC to Spain. Timezone databaseIANA maintains a historical timezone record for all countries. This lists all the timezones and daylight savings applied to countries over the years. Typically the database goes back up to the 19th century. It also contains future changes to timezones that have been publicly announced by the governments. The database is updated frequently. Here’s an excerpt taken from an update done in March 2015: “Mongolia will start observing DST again this year, from the last Saturday in March at 02:00 to the last Saturday in September at 00:00. (Thanks to Ganbold Tsagaankhuu.) Palestine will start DST on March 28, not March 27. Also, correct the fall 2014 transition from September 26 to October 24. Adjust future predictions accordingly. (Thanks to Steffen Thorsen.)” The database keeps its histories in two sections; one dealing with timezones and the other with daylight savings. Here’s another excerpt taken from the actual database: The history shows that in 1942 Portugal moved the clock forward twice, in March and then in April. In August and then in October it moved the clock backwards again. Apparently this double DST happened in the UK too during WW2. Fascinating, right? In the second section we see that Portugal was on the same timezone as Spain (GMT+1). This lasted from 1992 to 1996. My guess is that after 4 year of eating dinner late they decided that having dinner in the middle of the night like the Spaniards was just too late for them. LibrariesLibraries and tools for the IANA database exist in every programming language. Many languages (such as a JVM language) come bundled with timezone libraries, however these need to be updated from time to time to the latest IANA database. For example in Java you can update to the latest database either by keeping up to date with the last JVM release or simply running a tzupdater tool periodically (just google “tzupdater Java tool” for more details). Ok now what?So we saw the tools available to tackle the complexities of timekeeping. How do we use it? The answer is surprisingly simple. To explain this let’s look at an industry which has solved this problem already. The aviation industry has solved this problem by completely ignoring it. Instead of dealing with each individual country and its timekeeping politics, they decided “screw it, we’re keeping our own time and we shall call it the ZULU time”. The Zulu time happens to be equivalent to UTC, the difference is that it’s written in a different format. For example here’s a weather report used by pilots (pilots use this format as it’s concise, unambiguous and they can read it really quickly): EIDW 241430Z 11007KT 050V150 9999 FEW024 SCT250 20/12 Q1021 NOSIG The bit in bold (1430Z) shows the time this report was issued, 2:30pm UTC. Aviation uses this time format for everything and anywhere on earth. Examples include radio communications, airport time slots, ETAs, forecasts, log reporting etc… This simplifies things and makes everything unambiguous, thus avoiding situations like “Controller: Err… when you say, you’ll be over Italy at 12:35 is that in the departure, arrival or Italian timezone?” The idea here is to use the same technique when writing our applications. Every time you start thinking about a clever algorithm to convert timezones or to negate the effects of DST: It’s a pretty simple solution right?
Any exceptions? Well obviously yes. But it’s only a simple one really. Only convert to the local timezone when you want to display a date to a user. For this you just display the UTC date converted to the timezone that the user is located in. EASY! Come to think of it this is what aviation does too: “Ladies and gentlemen, welcome to Malta Airport where the local time is 9:30am and the temperature is 32 degrees. For your safety and comfort, please remain seated with your seatbelt fastened until the Captain turns off the Fasten Seat Belt sign.” Ok here's the exercise: find me a data structure which can accept random key/value inserts and which I can query faster than O(log_{2}(n)). I might be wrong on this one but I don't think one exists. I know what you're thinking... Hashtables. Well think again, remember CS101? Hashtables have a big O of n. Hashtables, while they're on average really fast have a worst case of N. There is an easy way to remember this in case you're asked in an interview. Think of a chained hashtable with a very bad hash function where every key is added to the same entry. This will result in all the keys in the same chain giving a search performance of O(n). So what else have we got? Arrays are O(1), but they're not suitable for random key/value inserts. Well unless you have infinite memory. Anything involving a list (Linked, Arrays and vector lists) is slow with O(n). So the only thing remaining is tree structures. A binary tree is actually also O(n). Think of inserting a sequence of numbers 1100 in ascending order... you'll also end up with a sort of linked list. Balanced binary tree, on the other hand, is a different animal. This guarantees searches in O(log_{2}(n)). There are various flavours of these, such as AVL and redblack trees all of which make good reading on Wikipedia. Here: http://en.wikipedia.org/wiki/Selfbalancing_binary_search_tree Still, my claim that you can't search faster than O(log_{2}(n)) stands. So what about faster tree structures? Things such as Btrees where nodes can have more than 2 children? Ok. So let's ignore the important mathematical fact that O(log_{x}(n)) = O(log_{y}(n)) as practically speaking log_{100}(n) will be a bit faster than log_{2}(n). However, again practically speaking, a Btree with order x does not have a performance of log_{x}(n), it is still log_{2}(n). Shocker! The height of a Btree might be log_{x}(n), but this does not reflect its search performance. To understand this, you need to consider what happens in each node. In a Btree (or any other higher order tree structure), you need to store a list of keys and pointers to the children. Here's a picture of what Btree nodes look like: As you see the keys can be stored in an ordered fashion at each node. This lets us do a quick binary search on each level hop. And the performance of this binary search is? Yes log_{2}(x). If x is the order of the tree, this gives us: height of tree: log_{x}(n) performance of each key node search: log_{2}(x) total performance: log_{x}(n) * log_{2}(x) simplifying: log(n)/log(x) * log(x)/log(2) log(n)/log(2) log_{2}(n) TADA! What happens when you keep on incrementing an integer past its maximum? Most competent developers would answer 'It would go to something negative'. Some would even be able to explain why and detail the way two's complement works. But what happens when you get a double and keep on dividing by two past its precision limit? Most would answer 'Er... I don't know. I have to try it' or 'Hey, let's play online pong'.
Floating point representation in binary can be a pita to grasp. Not because it's more complex than the simple two's complement but it's because it's something you don't really care about since it just works. . You would rather watch cat clips than spend time reading about it. The problem is that we work with floats all the time, from them representing money to asteroid speeds. I've seen so many simple coding bugs because a lot of developers don't get the basics of floating point representation. So here's a quick refresher which hopefully is more interesting than watching cats on YouTube: Floats and doubles in Java (and most other languages) use the IEEE754 standard which stores things as: C x B^E => example 2.625 x 10^2 where C = 2.625, B=10, E=2. C is called the coefficient, B is the base and E is the exponent. But who cares what they're called, right? Now the IEEE standard is pretty complex and specifies various formats but in summary 'Singles' (java floats) use 1 bit for the sign, 23 bits for the coefficient and 8 bits for the exponent. Doubles use 52 bits for the coefficient and 11 for the exponent. The standard specifies two values you can use as base, 2 and 10. Languages such as C, Java C# and many others use the base 2 representation for the float and double primitive types. So let's pick the example 2.625 again and see how to represent it. This can be expressed as 2.125 x 10^2 with: C = 2.625 B = 10 E = 2 Next step is to convert to base two. Converting the coefficient is done by using something like: ...[8][4][2][1].[0.5][0.25][0.125][0.0625]... 0 0 1 0 . 1 0 1 0 Thus: C = 10.101 B = 2 E = 10 Does this sound like thermo nuclear dynamics? Not really... So you might ask... why do so many of my website users report problems with their balance? Why is this guy’s balance showing as 2000.0000000003176?! Here's a quick test (yes it's scala) that shows the main problem of base 2 floats scala> (0 until 10).foldLeft(0.0)( (x,_) => x+0.2) res13: Double = 1.9999999999999998 What is going on here? It's simply adding 0.2 ten times. Why is it losing precision even on such a simple operation? Does this look familiar to you? Probably. I see this type of problem over and over again and it confuses a lot of programmers. Often this confusion results in 'creative' hacks. Most common thing to do is to round to some precision after each addition or using BigDecimal. BUT THIS IS THE WRONG WAY! 'I know I've hurt you, and I've not been true ... But I know I can turn it around' Ok... Keeping drum and bass artists out of this, let's understand what's happening here. The explanation is pretty simple. How would you represent 1/5 in binary? You can't... You can only estimate it. Try it using the above conversion process. You'll end up with something like: 0.010101010101... However, 1/5 can be represented in decimal accurately as 0.2. What about for example 2/3? In decimal you can only estimate it (0.666667). Basically the base you decide to use dictates what rational numbers you can represent accurately or not. 1/3 can be accurately represented if you're using base 3 or 12 but not in base 2 or 10. We would probably be a more advanced race if we had 12 fingers... (see http://en.wikipedia.org/wiki/Duodecimal#Advocacy_and_.22dozenalism.22) So what do you do about it? NOTHING! So what if humans speak base 10 and computers base 2? We shouldn't force machines to use base 10 (such as using BigDecimal) or hack around and 'round' floats everywhere. We simply need to convert estimated base 2 floats to proper base 10 when presented to users. For example if a user's balance is $150.20 it's fine to store this in memory/DB as 150.1999999999999999 using the full float/double precision without any rounding. But when you're displaying the balance to the user you just show it by rounding it to $150.20. The rule here is to never round and store. The only rounding happens for presentation and it's never stored. Wouldn't my application start leaking cents/pennies? Yes... but you shouldn't lose sleep over it. Let’s look at an example. Say you're using double to sum up every transaction that happens in your shop. Say you sell 500 things a second, 24x7 and every single operation is a double which can't be represented precisely in base 2 (very unlikely). In this scenario you'll end up losing something like 2c a year. While this inaccuracy might be crucial if say you're sending monkeys to the moon, it's unlikely to bankrupt your online business selling cat grooming products. The firm I'm working with was in the process of hiring some new developers. I was asked to come up with some programming tasks they could ask candidates during the interviews. These were of the problemsolving type, ones where you need to write an algorithm and solve a particular problem. The interviewer sits with you during this process and assesses your problemsolving and communication skills. I've been on both sides of these kinds of interviews many times. In general I think they are a good idea. Problemsolving and good communication are the best skills a developer can have. These tests can give you an insight into both. But here's the catch. The problem needs to be right and the interview conducted in the correct manner. Asking the developer to implement a distributed btree in 30 minutes is not really going to prove anything except that you're a big nerd. So we researched hard about what problems to give candidates. We came up with a handful of programming tasks, roughly of equal difficulty that could realistically be completed in 45 minutes. Each interviewee was then given a random problem to solve and was encouraged to ask questions and speak their thoughts (explain their reasoning?) as they were 'searching' for the solution. When talking about this to another developer friend he told me how he had recently had a similar programming interview (with another firm) and failed. I asked him to describe the question he was asked and he told me it was done online with a website called 'codility' and that he had only 30 minutes to complete. He emailed me a summary of the problem and it was something along these lines: Problem 'Count the cycle' You are given an array of integers: int numbers[] Each element in this array represents an index on the same array. Your algorithm needs to follow the path from index 0 and count how many elements are 'stuck' in a cycle. Your function should return this count. This is best explained in an example (array indexes start from 0): numbers = {3,2,4,4,1} Starting from index 0 you read its value(3) and jump to index 3. You then read this value(4) and jump to index 4 and so on. So following this process you'll end up getting: '3,4,1,2,4,1,2,4,1,2,4...' In the above the cycle it would be 1,2,4 and it should return 3. Easy right? No... an extra constraint is that it needs to have a worst time complexity of O(n) and worst space complexity of O(1). The question also specifies that the input array does not count as 'space'. So basically you can modify it without incurring any space complexity hit. Here's my trail of thought on this, the first time I read the problem: Reading question... Reading question... Oh this is way too easy. This is just a linked list in an array, just put stuff in another linked list and count backwards after the first... 3 minutes Crap there is a space limit of O(1)! No way this can work! How can it be possible? Oh hold on... I can modify the array.. aha... Ok think... think... 5 minutes I have to store some 'cycle' state in the array itself. Hmmm... crap 10minutes have passed already... ok focus. We need at least two sorts of flags. One when we are doing the first pass and another when we're stuck in the cycle counting nodes. 12 minutes Ok ok... I think I got it. How about negating numbers on the first pass and stopping when you hit the first negative number? Then a second pass making numbers positive, this time counting until you hit the positive tail. Something like: int index = 0; 18 minutes NO! This won't work if we have a zero! Balls... Ok maybe we can shift everything forward by adding 1... actually hold on we can shift everything forward on the first pass and then everything backwards on the second. Right yes this should work. I just need to add a large number to everything and then subtract it on the second pass. The large number can simply be the array length: int index = 0; 25 minutes Does this work? Let me write some tests... yep looks like it's fine. Is it time O(n)? Well it's really O(2n) but that’s fine as O(n)==O(2n) yep. Is it space O(1)? Yep same argument as for time. 30 minutes Phew... Just made it in time... I wonder if I would have made it under pressure in a real interview? Probably not. I didn't feel great with my solution having solved it with only seconds to spare. So I decided to pass it on to some other developers at work and see how they got on. So one guy solved it in about the same time as me (on a piece of paper) and two others spent almost a full day thinking about it and finally quit and came to ask me for the solution. Does it mean anything? Am I better than my colleagues who couldn't solve it? While it does make me feel better about myself, the reality is that I was lucky to have a light bulb moment while the others didn’t. This doesn't mean that my way of thinking is better. I discussed this with one of the guys that couldn't get it and he told me that he was thinking of ways to solve it with the space constraint but he couldn’t figure out how to encode the 'cycle' information in the array itself. In fact, as soon as I gave him a small hint on shifting numbers he quickly got to a similar solution (he shifted the numbers twice forward instead of one forward and one back). This is why I believe the choice of question is very important. Choosing the wrong question is like asking someone to solve the famous 'Crossing the river with a wolf, a goat and a cabbage' problem. Anyone who comes up with a right answer in a few minutes does so because: 1. He has seen/heard it before 2. He's lucky and had a eureka moment thinking 'hey the goat can cross back' 3. He's a farmer and has goats, wolves and cabbages crossing rivers all the time The moral here is that the 'wolf/goat/cabbage' problem just like the 'Count the cycle' problem does not prove you're any better for the job or that you are more intelligent or that your problemsolving skills are any better. In fact, it doesn't show much at all. So what should the interview programming questions look like then? In my view it's difficult to get it just right. The problem needs to be easy enough to dot in the time allocated but hard enough so not everyone can do it. It should avoid being a puzzle but instead be a problem where the candidate can show his/her problemsolving skills and algorithm design. Problems that can be broken down into multiple smaller ones tend to be better and so are ones which require the use of multiple datastructures types. Problems like: 1. You have an array of random integers. Find the first nonduplicate number in the array in O(NlogN). 2. You have two sorted arrays of size N and M where N <= M. Output the intersection in O(N) 3. How can you implement a linked list in an array (fixed size limit). I know. The above are not exactly rocket science but neither is the 'Count the cycle' problem once you solve the encoding puzzle. And as for the wolf/goat/cabbage problem the cabbage didn't ask to cross bridges anyway. Disclaimer: this is in no way saying codility is a load of rubbish. I've worked on some of their sample problems and I think they're very good for filtering out candidates that have no idea about computer science. It still amazes me that sometimes we get someone with years of experience on his CV who cannot write a simple findMax(array) function. So codility is a great tool for filtering some of the worst. I just think that whoever uses codility or similar services needs to be careful about the problems they choose and try to stay away from bad ones like the above. _A while back I had the pleasure of working with an occasionally grumpy developer. Nothing frustrated Matt as much as badly written code. He used to sit at his desk and from time to time you could hear ‘WTF?!’, ‘What were they thinking?!’, ‘Who wrote this?!!’ and ‘What the hell is this doing?’. I enjoyed pair programming with Matt, not only because he was an excellent developer, but also because he used to come with the funniest terms for how not to do things. Every time he came up with a new phrase for an anti pattern, I opened a text file and wrote it down. Here is some of the stuff he came up with. The Ultimate Unit Test of Doom We have all seen this many times. It’s surprising how often developers manage to fall into this trap. When you come across a unit test starting with an initialisation of a spring context file or even connecting to a database, and then doing 10s of assertions in one test case, you know you’ve come across the Ultimate Unit Test of Doom insert dramatic music here Such tests usually contain only one test case and it might span a few hundred lines of code. It might check if the stored proc exists and then call a service method and expect a table in the database to be populated. It is testing a whole subsystem instead of a unit of behaviour. The next code snippet shows one such example _The evangelists of unit testing preach that unit testing is not about testing at all. Unit testing is all about specifying behaviour. When you’re writing your unit test if you’re asking yourself ‘Is the system working as it should?’ you know you’re going down the route of the ultimate unit test of doom. Instead you should be asking yourself questions such as ‘If I put these inputs in this interface, once I have an implementation for it, it really ought to be returning me these kind of outputs’. More than a testing framework, unit testing is a thought process which makes you think ahead about how to build the small units of your software and how they interact together. Pass the Banana We came across this anti pattern when we were trying to understand how a system property was propagated through the system. The ‘Banana’ was a boolean flag that was passed as a startup main argument. The flag was passed from one method to the other, from class to class until it got read and used at the bottom of the stack. The following code illustrates an example of a ‘Pass the Banana’. _Most developers would see the above and ask ‘Why isn’t startThreadPool a system property and whatever class needs to access it can just read it directly?’. The thing is, such bad practises aren’t usually written by just one person. They usually evolve over time through incremental changes by different developers. The above example most likely started with simple implementation where the main method started the thread pool directly. As the system got more complex and larger, the call to start the thread pool moved further down the layers. Each developer kept passing the banana every time he added a new class or method. Solving these kind of issues and keeping code complexity down is not an easy task especially when you have a large team and tight schedules. Allocating time for refactoring and improving code quality in each sprint is key. A code analyser tool helps, but its not enough. It also comes down to each developer’s attitude and willingness to improve code quality. If each developer took one step in the right direction, instead of a ‘pass the banana’ pattern, it would he the ‘happy hour’ pattern... The one where developers are down the pub instead of debugging horrible code. Martin Fowler wrote a good article on opportunistic refactoring at: http://martinfowler.com/bliki/OpportunisticRefactoring.html The Monolithic method of Death This is quite simple to explain. It’s a method that takes 7 arguments or more, is called ‘process’ or ‘execute’,is about 200 or more lines of code and has horrible multiple nested ifs and loops. You know what I mean... we’ve all come across one of these..._ _In my experience people underestimate the impact on productivity of bad code and tech dept. It’s also very hard to explain to non technical business people that a part of the system needs redeveloping. Why should I pay you to write code that is going to do exactly what the current code already does? It is a very valid argument. In an ideal world it should never get to at point where a system needs to be reimplemented. Coding should be an evolution. Small incremental improvements that make the code more modular, readable and easy to scale. Unfortunately this is hardly ever the case. Schedules are tight, pressure builds up and corners are cut. Again there is no technical fix for this. It all boils down to methodology and individual attitudes and some of the team dynamics. Are the correct development practices being followed? Is there enough time in each sprint for refactoring and reducing tech debt? Is there good team communication?
Poo in a File This is not really an anti pattern. It’s more something Matt did and made some people laugh. One morning I updated my code from subversion and I noticed a new file called 'pooinafile.txt'. When I opened this I found the text 'poo' in it. It turned out that Matt was testing why his bit of code wasn't reading from the file system due to some permission issues. He had created the file and committed by mistake. Since then we kept on using the phrase every time someone committed a file that didn't belong in the repository. Matt, wish you had named your file something a bit more exciting... 
AuthorJames Cutajar is a software developer, with interests in high performance computing, algorithms design and distributed data structures. Archives
May 2019
