Tuesday, January 24, 2012

Spotify Puzzle in Java, Ruby, Prolog and ... Common Sense


I like music and rarely work without a nice track playing in the background. Pandora and MOG are always with me, on my laptop, on my iPhone, on my Kindle. I believe my TV has a Pandora app and I saw one running on my neighbor's fridge (fancy that). Last week I figured I would check out Spotify just to make sure I wasn't missing out on something even better. Facebooking is not something I do so that's how I missed all the buzz I guess. Anyway, long story short, I didn't chose Spotify over MOG but I did find myself a puzzle to spare a few night hours with.

These guys posted a few puzzles for their candidates to try before one would consider applying for an engineering job at Spotify. As of time of this writing there are three, each labeled with a music genre - reggae, funk, and heavy metal. I would almost always prefer heavy metal over reggae and funk given no other choice so I knew which one I would tickle my brain with.

The Bilateral Projects is about finding an optimal solution to what sounds like an easy problem. You have project teams of exactly two individuals each from one of two different offices and you need to invite the smallest amount of people to a company conference to save on expenses while having all projects represented. Same people can work on more than one project hence a potential for savings. Should the problem have more than one optimal solution you would pick one that would invite your friend otherwise you wouldn't really care. Oh, and a very important detail as we will soon learn. The input may ask you to optimize for as many as 10,000 project teams.

Java


First, let's do a careful, defensive-style object oriented brute force solution in Java. We would find all possible solutions, reduce the list to only optimal ones, and finally pick one that ideally has our friend in it. Clearly not a solution for large data sets but a good way to test the waters and see what we've got here.

Note: if you're here for a solution then you should scroll down to the Dominating Set chapter, the Java part won't be of much interest to you. O(2^n) isn't really a solution for anything larger than a handful of teams.

The full code is on github and here's the important pieces. Input data is parsed into a collection of Team objects each referencing two Employee objects (we could clearly work with primitive arrays but that's something I left for Ruby). It's now time to build a list of all possible solutions to the puzzle. There are algorithms to build all combinations and permutations but we need a slightly different thing that can be expressed in a much simpler way. We only need one of the two team members from each team which means 2^n choices that we can visualize as a binary tree. A tree as deep as many teams we have branching out and doubling the nodes on each level to have 2^n on the bottom leaf level. Take a look.

Each line in the dataset on the left represents a team of two employees referenced by their IDs. Full solutions tree would look like the one to the right:


1009 2002
1009 2021
1002 2002
1003 2038
1003 2002
1021 2021
1022 2024
1088 2031


And the code does exactly that. It loops through all teams, accumulates a List<Set<Employee>> (a path from the root down to the end leaf node is a set of nodes you need to walk through, and outer list as a collection of all possible ways down), creates a new solution branch on each step adding employees from the team as leaf "nodes":

private List<Set<Employee>> generateAllSolutionsList() {
    final List<Set<Employee>> accummulatedSolutions = new ArrayList<Set<Employee>>();
    accummulatedSolutions.add(new HashSet<Employee>()); // root of the solutions tree
    
    for (final Team team : projects) {
        final List<Set<Employee>> newBranch = new ArrayList<Set<Employee>>();
                
        for (Set<Employee> solution : accummulatedSolutions){
            final Set<Employee> solutionSibling = new HashSet<Employee>(solution);

            solution.add(team.getLondonEmployee());  // "left" leaf node (existing branch)
            solutionSibling.add(team.getStockholmEmployee()); // "right" leaf node (new branch)
            
            newBranch.add(solutionSibling);
        }
        
        accummulatedSolutions.addAll(newBranch);
    }
    
    return accummulatedSolutions;
}

Having a superset of all solutions we now just need to reduce it a few times to get to the bottom of it. First, let's sort them all and get that minimum size we were looking for:

final List<Set<Employee>> allPossibleSolutions = generateAllSolutionsList();
Collections.sort(allPossibleSolutions, 
        new Comparator<Set<Employee>>() {
            @Override
            public int compare(Set<Employee> set1, Set<Employee> set2) {
                return set1.size() - set2.size();
            }
        });

final int smallest = allPossibleSolutions.get(0).size();

Now reduce it to only optimal solutions:

// reduce to only optimal solutions
final List<Set<Employee>> optimalSolutions = new ArrayList<Set<Employee>>();
for (Set<Employee> s : allPossibleSolutions) {
    if (s.size() > smallest) {
        break; // sorted list so we know all others are even larger in size
    } else {
        optimalSolutions.add(s);
    }
}

Finally, try to find that best solution that has our friend in it:

// SPEC: If possible (subject to the set of people being smallest possible), 
//       the list of invitees should include your friend
final Employee myFriend = new Employee(MY_FRIEND_ID);

// reduce to those having our "friend"
final List<Set<Employee>> bestSolutions = new ArrayList<Set<Employee>>();
for (Set<Employee> s : optimalSolutions) {
    if (s.contains(myFriend)) {
        bestSolutions.add(s);
    }
}

final List<Set<Employee>> solutions = !bestSolutions.isEmpty() ? bestSolutions : optimalSolutions; 
final Set<Employee> solution = solutions.get(0);

Given the data set we used to illustrate the solutions tree the code would find a 5 people solution: 2002, 2021, 2038, 2024, 2031.

Dominating Set


Before we move on to other languages and potentially better ways to solve the heavy metal puzzle let's see what we're dealing with. If we plot employees on a chart and cross-link people who work together we'll end up with a graph (or a set of graphs) with employees as vertexes and projects and edges. Here's one for the dataset we've already looked at:


I used orange coloring for the employee nodes that represent the optimal solution. You see how all other nodes not colored are connected to at least one colored node by an edge? It make sense, right? We want to pick employees to represent all projects so we should have all edges "covered" by the highlighted set. In graph theory a set like this is said to be a dominating set. And there's apparently a dominating number that represents the number of vertices in a smallest dominating set... As wikipedia says, the dominating set problem [..] is a classical NP-complete decision problem [..] therefore it is believed that there is no efficient algorithm that finds a smallest dominating set for a given graph.

Right. Why would otherwise folks over at Spotify have called this one a "heavy metal" puzzle? :)

Ruby


My original idea was to illustrate that one would use much less words to express same thing in Ruby but rewriting a brute force solution in Ruby is no fun so let's optimize it a bit. Instead of building a list of all potential solutions let's just try to get straight to that optimal one.

The idea is to:
  • Try to figure out who is more likely to make it to the final list of attendees
  • Add that likely candidate to the attendees list and reduce the list of projects removing all that the selected candidate works on.
  • As we reduce the projects list we will keep notes of other people who participate in those being removed. This way we will know we don't need them unless they represent other projects
  • For the teams that don't have a clean "winner" we would take either one giving a preference to our friend (if he or she is one of the two) otherwise we'll try to balance between the two offices

Our Solver will keep track of projects, employees, and selected attendees (oh, and full listing is of course available over on github):

class Solver
 attr_accessor :projects, :employees, :attendees

end

When solver initializes itself from the input data it builds a list of projects (a project is a simple two elements array and a list of projects is an array of project arrays), and also calculates how many projects each employee participates in. The latter we would keep in a hash with employee ID as a key and a number of projects as a value:

def initialize(fileName)
    # a project is a team of two. collect all projects into an array
    @projects = File.readlines(fileName).grep(/^\d+\s\d+$/).collect {|line| line.split(/\s/)}

    # employee ID as a key and project participation index as a value, default to 0
    @employees = Hash.new(0)

    # calculate participation index
    @projects.flatten.each {|e| @employees[e] += 1 }
    @attendees = []
end

well, it's time to optimize it:

def optimize()
    # project with the most "active" employees and ideally a project with a clean winnder
    project = @projects.sort_by! {|p| (@employees[p.first] - @employees[p.last]).abs}.last
    
    @attendees <<
        if (@employees[project.first] != @employees[project.last])
            # clean win
            (@employees[project.first] > @employees[project.last]) ? project.first : project.last
        else
            # a tie with a prefernece to "a friend"
            project.include?("1099") ? "1099" : project[rand(2)]
        end
    
    # discard projects that the attendee participates in 
    # and reduce  participation index of the individuals on those teams
    @projects.reject! {|p| p.include?(@attendees.last) && p.each {|e| @employees[e] -= 1}} 
    
    optimize() if @projects.length > 0
end

That's it. A bit of "side effect" programming and we're done. It generates a slightly different though still optimal solution: 2002, 2021, 2024, 1003, 2031.

Note: project list is sorted in place using the ! version of the sort_by to keep subsequent sorting close to a best case O(n). There's still room for improvement though: we could stop recursing and re-sorting once we know the rest of the list are teams that don't have "shared" resources on them. We could also do a more intelligent sorting to push project teams where people are equally shared across more than one projects first when there's no more "unbalanced" teams. I am jumping ahead of myself a little bit but my tests didn't show any noticeable difference from those improvements so I voted for smaller and cleaner code, not the code that pretends to be "smarter" when it's not.


Prolog


I have to admit I spent a few evenings "learning" it before I could express a solution to the puzzle in a declarative way. I thought the experience I had with inference systems in a comfortable environment of imperative languages (JBoss Drolls, for example) will be enough to just "get" it and boy, was I wrong. It's predicates, not functions; it's unification, not assignments; it's terms and atoms, not variables; and on and on it goes. If you never programmed the logical way I dare you to try! Like Bruce Tate wrote about Prolog in his Seven Languages in Seven Weeks: "if this is your first exposure, I guarantee either you will change the way you think or you'll fail". A good prime on predicate calculus might also help.

Prolog's approach to solving any kind of problem you would throw at it is all about permutations. It's optimized to proof search and backtrack the decision tree so we know the puzzle will be solved with the "brute force"'s O(2^n) "complexity". Let's see:

Firs, we define the attendees rule. It starts with a terminate empty goal to make sure the recursion has where to stop. Then we say that either one of the two employees is a good solution for a one team puzzle, and then we recursively solve it for a list of teams. attendees works on a list of tuples and in this form would report the entire solutions tree (just like that one we build in Java). The sort predicate at the end will make sure the solution set is free of duplicate employees (those shared across more than one project teams thus listed more than once).

attendees([], []).
attendees((E1, _), [E1]).
attendees((_, E2), [E2]).
attendees([H|T], Solution) :- 
    attendees(H, AttendeeFromFirstTeam), 
    attendees(T, AttendeesFromOtherTeams), 
    append(AttendeeFromFirstTeam, AttendeesFromOtherTeams, Accumulator), 
    sort(Accumulator, Solution).

It's now time to find that optimal solution that ideally has our best friend in it. Let's collect the list of all solutions, sort it by length (assuming lists are free of duplicates), push lists with our friend to the front among same length lists, and then just pick the first one. We are going to need a custom sorting predicate:

optimize(Result, ListA, ListB) :- 
    length(ListA, X), 
    length(ListB, Y),
    (\=(X,Y) -> compare(Result, X, Y) ; (member(1099,ListA) -> Result = (<) ; Result = (>))).

optimize compares by length when two lists are different in size and, when two lists represent an equally optimal solution, returns the one with our friend in it (if found) or the first one. Here's the solve goal:

solve(AllTeams, OptimalSolution) :- 
    setof(Solution, attendees(AllTeams, Solution), UniqueSolutions),
    predsort(optimize, UniqueSolutions, [OptimalSolution|_]),
    length(OptimalSolution, L),
    print(L).

It collects all permutations as delivered by the attendees goal, sorts them with the optimize, picks the head of that list and prints its length. When solver.pl is loaded you can ask Prolog a question like this:

solve([(1009,2002),(1009,2021),(1002,2002),(1003,2038),(1003,2002),(1021,2021),(1022,2024),(1088,2031)], Solution).

and it will gladly respond:

5
Solution = [2002, 2021, 2024, 2031, 2038].

You can find projects.pl on github. Note: Prolog dialects vary. I used SWI-Prolog and then tried to run it in GNU Prolog and it didn't work. GNU version didn't like the predsort/3 which I believe is SWI-Prolog "extension".

Let's Rock!


There's not much examples on the Spotify website so I had to generate a few data sets to see how well we can cope with the puzzle, at what point each solution breaks, and whether it finds that optimal solution. The generator that I quickly put together is capable of building various setups based on a few parameters: number of teams/projects, how many employees from each office to consider (a density factor basically and a way to do balanced and unbalanced participation), and finally whether to "pad" it with our friend a few extra times. It produces data files with the names like this - dataset_A_B_C_D - where A is how many teams we asked it to generate, B and C stand for how many Stockholm and London employees it randomized project participation across, and D is for up to how many times it tried to squeeze our friend into the project teams. You can find the generator on github as well.

Let's start with a few simple tests. 20_7_4_0 is a slightly unbalanced setup but it's a good test. We know right of the bat that the optimial solution can't require more than four employees. It can be less if rand(4) didn't spit out all the options though it's unlikely for 20 runs. So let's see: Java solution said it's 4 very quickly, so did Ruby version, so did Prolog.

30_10_10_2 is already large enough to keep Java spinning and eventually run out of heap space. I gave it some more and it ate it all up and was still not satisfied. That was to be expected with exponential complexity and all those Set objects. Prolog ran out of local stack recursing into the solutions tree. Poor guy. Ruby version was very fast telling me it needed only 9 employees. I would expect it at max to be 10 so 9 is believable. 100_30_20_0 reports back it needs 20. 100_80_60_0 says 51. (your experience may vary as data sets are essentially randomized). 100_500_500_0 says 85. Again, I would expect the number to be less than 100. So far so good.

500_50_50_0 reports... 53. 5,000_500_500_0 reports ... 563. And the edge case of 40,000_999_999_0 took a few seconds to generate and about 20 seconds to solve. It reported... 1014. Well, now we know we didn't crack it completely... We'll get back to this one in a minute.

To wrap it up on the algorithmic part, Java and Prolog solutions have at least O(2^n) complexity to collect the solution tree plus n*log(n) sorting (they all do variations of merge sort) and some more overhead. The Ruby version in the worst case would loop n times and each time do n*log(n) of sorting and another n iterations to "reject" the processed project(s). That gives us O(n^2*log(n)) if I am not mistaken. No surprise it can get to the bottom of a 40,000 set. Windows calculator in a scientific view reported "overflow" when I tried to look at 2^40,000. And n^2*log(n) for 40,000 is a little shy of 10 billions which isn't really a big deal to iterate through, especially with the tail recursion keeping it all "flat". I have to admit though that I liked the way Prolog solution "sounded".

So what's up with those large sets producing suboptimal results? Well, I wouldn't expect I could find a fast algorithm to what folks with PhD in computer science believe to be a classical NP-complete problem :) Here comes the last optimization.

Common Sense


If the "optimized" answer after 20 seconds of crunching the numbers is larger than the number of employees from one of the two offices participating in all projects then do the following. If you're still up for some savings and can afford having one of the two offices miss the Barbados "thing" and still be there when you come back then just take the smallest office with you. If you don't think your talent will stick around when you dump them like that then I see another option. Understand that people is the most important asset in your business and if you truly can't take everybody to Barbados then take them all to either London or Stockholm and have some real fun together :)

2 comments:

Anonymous said...

You can use Hopcroft–Karp algo (which is in P, so you can do better than your exponential solution)

Unknown said...
This comment has been removed by the author.