Pavel Veller's blog: Java

Showing posts with label Java. Show all posts

Tuesday, January 24, 2012

Spotify Puzzle in Java, Ruby, Prolog and ... Common Sense

I like music and rarely work without a nice track playing in the background. Pandora and MOG are always with me, on my laptop, on my iPhone, on my Kindle. I believe my TV has a Pandora app and I saw one running on my neighbor's fridge (fancy that). Last week I figured I would check out Spotify just to make sure I wasn't missing out on something even better. Facebooking is not something I do so that's how I missed all the buzz I guess. Anyway, long story short, I didn't chose Spotify over MOG but I did find myself a puzzle to spare a few night hours with.

These guys posted a few puzzles for their candidates to try before one would consider applying for an engineering job at Spotify. As of time of this writing there are three, each labeled with a music genre - reggae, funk, and heavy metal. I would almost always prefer heavy metal over reggae and funk given no other choice so I knew which one I would tickle my brain with.

The Bilateral Projects is about finding an optimal solution to what sounds like an easy problem. You have project teams of exactly two individuals each from one of two different offices and you need to invite the smallest amount of people to a company conference to save on expenses while having all projects represented. Same people can work on more than one project hence a potential for savings. Should the problem have more than one optimal solution you would pick one that would invite your friend otherwise you wouldn't really care. Oh, and a very important detail as we will soon learn. The input may ask you to optimize for as many as 10,000 project teams.

Java

First, let's do a careful, defensive-style object oriented brute force solution in Java. We would find all possible solutions, reduce the list to only optimal ones, and finally pick one that ideally has our friend in it. Clearly not a solution for large data sets but a good way to test the waters and see what we've got here.

Note: if you're here for a solution then you should scroll down to the Dominating Set chapter, the Java part won't be of much interest to you. O(2^n) isn't really a solution for anything larger than a handful of teams.

The full code is on github and here's the important pieces. Input data is parsed into a collection of Team objects each referencing two Employee objects (we could clearly work with primitive arrays but that's something I left for Ruby). It's now time to build a list of all possible solutions to the puzzle. There are algorithms to build all combinations and permutations but we need a slightly different thing that can be expressed in a much simpler way. We only need one of the two team members from each team which means 2^n choices that we can visualize as a binary tree. A tree as deep as many teams we have branching out and doubling the nodes on each level to have 2^n on the bottom leaf level. Take a look.

Each line in the dataset on the left represents a team of two employees referenced by their IDs. Full solutions tree would look like the one to the right:

And the code does exactly that. It loops through all teams, accumulates a List<Set<Employee>> (a path from the root down to the end leaf node is a set of nodes you need to walk through, and outer list as a collection of all possible ways down), creates a new solution branch on each step adding employees from the team as leaf "nodes":

private List<Set<Employee>> generateAllSolutionsList() {
    final List<Set<Employee>> accummulatedSolutions = new ArrayList<Set<Employee>>();
    accummulatedSolutions.add(new HashSet<Employee>()); // root of the solutions tree
    
    for (final Team team : projects) {
        final List<Set<Employee>> newBranch = new ArrayList<Set<Employee>>();
                
        for (Set<Employee> solution : accummulatedSolutions){
            final Set<Employee> solutionSibling = new HashSet<Employee>(solution);

            solution.add(team.getLondonEmployee());  // "left" leaf node (existing branch)
            solutionSibling.add(team.getStockholmEmployee()); // "right" leaf node (new branch)
            
            newBranch.add(solutionSibling);
        }
        
        accummulatedSolutions.addAll(newBranch);
    }
    
    return accummulatedSolutions;
}

Having a superset of all solutions we now just need to reduce it a few times to get to the bottom of it. First, let's sort them all and get that minimum size we were looking for:

final List<Set<Employee>> allPossibleSolutions = generateAllSolutionsList();
Collections.sort(allPossibleSolutions, 
        new Comparator<Set<Employee>>() {
            @Override
            public int compare(Set<Employee> set1, Set<Employee> set2) {
                return set1.size() - set2.size();
            }
        });

final int smallest = allPossibleSolutions.get(0).size();

Now reduce it to only optimal solutions:

// reduce to only optimal solutions
final List<Set<Employee>> optimalSolutions = new ArrayList<Set<Employee>>();
for (Set<Employee> s : allPossibleSolutions) {
    if (s.size() > smallest) {
        break; // sorted list so we know all others are even larger in size
    } else {
        optimalSolutions.add(s);
    }
}

Finally, try to find that best solution that has our friend in it:

// SPEC: If possible (subject to the set of people being smallest possible), 
//       the list of invitees should include your friend
final Employee myFriend = new Employee(MY_FRIEND_ID);

// reduce to those having our "friend"
final List<Set<Employee>> bestSolutions = new ArrayList<Set<Employee>>();
for (Set<Employee> s : optimalSolutions) {
    if (s.contains(myFriend)) {
        bestSolutions.add(s);
    }
}

final List<Set<Employee>> solutions = !bestSolutions.isEmpty() ? bestSolutions : optimalSolutions; 
final Set<Employee> solution = solutions.get(0);

Given the data set we used to illustrate the solutions tree the code would find a 5 people solution: 2002, 2021, 2038, 2024, 2031.

Dominating Set

Before we move on to other languages and potentially better ways to solve the heavy metal puzzle let's see what we're dealing with. If we plot employees on a chart and cross-link people who work together we'll end up with a graph (or a set of graphs) with employees as vertexes and projects and edges. Here's one for the dataset we've already looked at:

I used orange coloring for the employee nodes that represent the optimal solution. You see how all other nodes not colored are connected to at least one colored node by an edge? It make sense, right? We want to pick employees to represent all projects so we should have all edges "covered" by the highlighted set. In graph theory a set like this is said to be a dominating set. And there's apparently a dominating number that represents the number of vertices in a smallest dominating set... As wikipedia says, the dominating set problem [..] is a classical NP-complete decision problem [..] therefore it is believed that there is no efficient algorithm that finds a smallest dominating set for a given graph.

Right. Why would otherwise folks over at Spotify have called this one a "heavy metal" puzzle? :)

Ruby

My original idea was to illustrate that one would use much less words to express same thing in Ruby but rewriting a brute force solution in Ruby is no fun so let's optimize it a bit. Instead of building a list of all potential solutions let's just try to get straight to that optimal one.

The idea is to:

Try to figure out who is more likely to make it to the final list of attendees
Add that likely candidate to the attendees list and reduce the list of projects removing all that the selected candidate works on.
As we reduce the projects list we will keep notes of other people who participate in those being removed. This way we will know we don't need them unless they represent other projects
For the teams that don't have a clean "winner" we would take either one giving a preference to our friend (if he or she is one of the two) otherwise we'll try to balance between the two offices

Our Solver will keep track of projects, employees, and selected attendees (oh, and full listing is of course available over on github):

class Solver
 attr_accessor :projects, :employees, :attendees

end

When solver initializes itself from the input data it builds a list of projects (a project is a simple two elements array and a list of projects is an array of project arrays), and also calculates how many projects each employee participates in. The latter we would keep in a hash with employee ID as a key and a number of projects as a value:

def initialize(fileName)
    # a project is a team of two. collect all projects into an array
    @projects = File.readlines(fileName).grep(/^\d+\s\d+$/).collect {|line| line.split(/\s/)}

    # employee ID as a key and project participation index as a value, default to 0
    @employees = Hash.new(0)

    # calculate participation index
    @projects.flatten.each {|e| @employees[e] += 1 }
    @attendees = []
end

well, it's time to optimize it:

def optimize()
    # project with the most "active" employees and ideally a project with a clean winnder
    project = @projects.sort_by! {|p| (@employees[p.first] - @employees[p.last]).abs}.last
    
    @attendees <<
        if (@employees[project.first] != @employees[project.last])
            # clean win
            (@employees[project.first] > @employees[project.last]) ? project.first : project.last
        else
            # a tie with a prefernece to "a friend"
            project.include?("1099") ? "1099" : project[rand(2)]
        end
    
    # discard projects that the attendee participates in 
    # and reduce  participation index of the individuals on those teams
    @projects.reject! {|p| p.include?(@attendees.last) && p.each {|e| @employees[e] -= 1}} 
    
    optimize() if @projects.length > 0
end

That's it. A bit of "side effect" programming and we're done. It generates a slightly different though still optimal solution: 2002, 2021, 2024, 1003, 2031.

Note: project list is sorted in place using the ! version of the sort_by to keep subsequent sorting close to a best case O(n). There's still room for improvement though: we could stop recursing and re-sorting once we know the rest of the list are teams that don't have "shared" resources on them. We could also do a more intelligent sorting to push project teams where people are equally shared across more than one projects first when there's no more "unbalanced" teams. I am jumping ahead of myself a little bit but my tests didn't show any noticeable difference from those improvements so I voted for smaller and cleaner code, not the code that pretends to be "smarter" when it's not.

Prolog

I have to admit I spent a few evenings "learning" it before I could express a solution to the puzzle in a declarative way. I thought the experience I had with inference systems in a comfortable environment of imperative languages (JBoss Drolls, for example) will be enough to just "get" it and boy, was I wrong. It's predicates, not functions; it's unification, not assignments; it's terms and atoms, not variables; and on and on it goes. If you never programmed the logical way I dare you to try! Like Bruce Tate wrote about Prolog in his Seven Languages in Seven Weeks: "if this is your first exposure, I guarantee either you will change the way you think or you'll fail". A good prime on predicate calculus might also help.

Prolog's approach to solving any kind of problem you would throw at it is all about permutations. It's optimized to proof search and backtrack the decision tree so we know the puzzle will be solved with the "brute force"'s O(2^n) "complexity". Let's see:

Firs, we define the attendees rule. It starts with a terminate empty goal to make sure the recursion has where to stop. Then we say that either one of the two employees is a good solution for a one team puzzle, and then we recursively solve it for a list of teams. attendees works on a list of tuples and in this form would report the entire solutions tree (just like that one we build in Java). The sort predicate at the end will make sure the solution set is free of duplicate employees (those shared across more than one project teams thus listed more than once).

attendees([], []).
attendees((E1, _), [E1]).
attendees((_, E2), [E2]).
attendees([H|T], Solution) :- 
    attendees(H, AttendeeFromFirstTeam), 
    attendees(T, AttendeesFromOtherTeams), 
    append(AttendeeFromFirstTeam, AttendeesFromOtherTeams, Accumulator), 
    sort(Accumulator, Solution).

It's now time to find that optimal solution that ideally has our best friend in it. Let's collect the list of all solutions, sort it by length (assuming lists are free of duplicates), push lists with our friend to the front among same length lists, and then just pick the first one. We are going to need a custom sorting predicate:

optimize(Result, ListA, ListB) :- 
    length(ListA, X), 
    length(ListB, Y),
    (\=(X,Y) -> compare(Result, X, Y) ; (member(1099,ListA) -> Result = (<) ; Result = (>))).

optimize compares by length when two lists are different in size and, when two lists represent an equally optimal solution, returns the one with our friend in it (if found) or the first one. Here's the solve goal:

solve(AllTeams, OptimalSolution) :- 
    setof(Solution, attendees(AllTeams, Solution), UniqueSolutions),
    predsort(optimize, UniqueSolutions, [OptimalSolution|_]),
    length(OptimalSolution, L),
    print(L).

It collects all permutations as delivered by the attendees goal, sorts them with the optimize, picks the head of that list and prints its length. When solver.pl is loaded you can ask Prolog a question like this:

solve([(1009,2002),(1009,2021),(1002,2002),(1003,2038),(1003,2002),(1021,2021),(1022,2024),(1088,2031)], Solution).

and it will gladly respond:

5
Solution = [2002, 2021, 2024, 2031, 2038].

You can find projects.pl on github. Note: Prolog dialects vary. I used SWI-Prolog and then tried to run it in GNU Prolog and it didn't work. GNU version didn't like the predsort/3 which I believe is SWI-Prolog "extension".

Let's Rock!

There's not much examples on the Spotify website so I had to generate a few data sets to see how well we can cope with the puzzle, at what point each solution breaks, and whether it finds that optimal solution. The generator that I quickly put together is capable of building various setups based on a few parameters: number of teams/projects, how many employees from each office to consider (a density factor basically and a way to do balanced and unbalanced participation), and finally whether to "pad" it with our friend a few extra times. It produces data files with the names like this - dataset_A_B_C_D - where A is how many teams we asked it to generate, B and C stand for how many Stockholm and London employees it randomized project participation across, and D is for up to how many times it tried to squeeze our friend into the project teams. You can find the generator on github as well.

Let's start with a few simple tests. 20_7_4_0 is a slightly unbalanced setup but it's a good test. We know right of the bat that the optimial solution can't require more than four employees. It can be less if rand(4) didn't spit out all the options though it's unlikely for 20 runs. So let's see: Java solution said it's 4 very quickly, so did Ruby version, so did Prolog.

30_10_10_2 is already large enough to keep Java spinning and eventually run out of heap space. I gave it some more and it ate it all up and was still not satisfied. That was to be expected with exponential complexity and all those Set objects. Prolog ran out of local stack recursing into the solutions tree. Poor guy. Ruby version was very fast telling me it needed only 9 employees. I would expect it at max to be 10 so 9 is believable. 100_30_20_0 reports back it needs 20. 100_80_60_0 says 51. (your experience may vary as data sets are essentially randomized). 100_500_500_0 says 85. Again, I would expect the number to be less than 100. So far so good.

500_50_50_0 reports... 53. 5,000_500_500_0 reports ... 563. And the edge case of 40,000_999_999_0 took a few seconds to generate and about 20 seconds to solve. It reported... 1014. Well, now we know we didn't crack it completely... We'll get back to this one in a minute.

To wrap it up on the algorithmic part, Java and Prolog solutions have at least O(2^n) complexity to collect the solution tree plus n*log(n) sorting (they all do variations of merge sort) and some more overhead. The Ruby version in the worst case would loop n times and each time do n*log(n) of sorting and another n iterations to "reject" the processed project(s). That gives us O(n^2*log(n)) if I am not mistaken. No surprise it can get to the bottom of a 40,000 set. Windows calculator in a scientific view reported "overflow" when I tried to look at 2^40,000. And n^2*log(n) for 40,000 is a little shy of 10 billions which isn't really a big deal to iterate through, especially with the tail recursion keeping it all "flat". I have to admit though that I liked the way Prolog solution "sounded".

So what's up with those large sets producing suboptimal results? Well, I wouldn't expect I could find a fast algorithm to what folks with PhD in computer science believe to be a classical NP-complete problem :) Here comes the last optimization.

Common Sense

If the "optimized" answer after 20 seconds of crunching the numbers is larger than the number of employees from one of the two offices participating in all projects then do the following. If you're still up for some savings and can afford having one of the two offices miss the Barbados "thing" and still be there when you come back then just take the smallest office with you. If you don't think your talent will stick around when you dump them like that then I see another option. Understand that people is the most important asset in your business and if you truly can't take everybody to Barbados then take them all to either London or Stockholm and have some real fun together :)

Tuesday, October 19, 2010

My first question on stackoverflow

Here it is:
http://stackoverflow.com/questions/3965331/weird-behavior-around-same-erasure-compilation-error

Playing with the code on the edge of compile and runtime errors is fascinating. Compilers are software components and may have defects and deficiencies. Language specs may have wholes and ambiguities too. But these are thoroughly tested in labs and then applied daily by hundreds of thousands (if not millions) of programmers so they must be almost perfectly clean. Almost. But still not perfect.

How far from this "still not perfect" is the software you write?

Thursday, September 09, 2010

Type erasure does not erase all of it

Last week I was reading the most recent installment of Spring framework reference documentation and one particular thing surprised me, if not to say more. Here it is with my annotations:

In Java 5 and later, you can use strongly typed collections { sure } [...] If you are using Spring to dependency-inject a strongly-typed Collection into a bean, you can take advantage of Spring's type-conversion support such that the elements of your strongly-typed Collection instances are converted to the appropriate type { something rings a bell but I am not sure... } prior to being added to the Collection.


public class Foo {

  private Map<String, Float> accounts;

  public void setAccounts(Map<String, Float> accounts) {
      this.accounts = accounts;
  }
}
<beans>
  <bean id="foo" class="x.y.Foo">
      <property name="accounts">
          <map>
              <entry key="one" value="9.99"/>
              <entry key="two" value="2.75"/>
              <entry key="six" value="3.99"/>
          </map>
      </property>
  </bean>
</beans>

When the accounts property of the foo bean is prepared for injection, the generics information about the element type of the strongly-typed Map is available by reflection { really? I thought no generic info is available at run time. That's what type erasure is about, isn't it? }. Thus Spring's type conversion infrastructure recognizes the various value elements as being of type Float, and the string values [...] are converted into an actual Float type { something is not right. Guys from springsource are not likely to make false statements of this magnitude, or are they? }

I never really looked into all new stuff that Generics brought along with them into the language. And I missed a small but apparently important portion of what has been added to the Java reflection classes.

Here's a short summary at stackoverflow that basically sums up what you can get via reflection. It's not like you can get anything, the exact type of the object at hand has still been erased and instance of ArrayList is still just an instance of an ArrayList. But something is still there :)

Two more good readings on the subject:

Neel Gafter on reified generics for Java

Bruce Eckel about how Generics Aren't (had to dig it up from the web archive)

Monday, August 30, 2010

Java concurrency is unsafe

Have you ever wondered how stuff in java.util.concurrent works?

I know I am late. It was in early 2000 (if not before that) when Doug Lea had published his Concurrent Programming in Java and had described what later became JSR 166 and then finally part of JDK 1.5 in 2004. Still better late than never.

I actually wanted to find out just a few things. I wondered how they provided the full semantic of happens-before as it works in volatile and synchronized but on the API level. I wondered if I would still find the volatile and synchronized and wait/notify buried down in the implementation details or would there be something else.

I did not look into the original Doug Lea's source code but I what I found in the latest JDK implementation made me smile.

They're using sun.misc.Unsafe to do the atomic changes and suspend enqueued threads (it's called parking). The sun.misc.Unsafe is native all over the place and, as you can tell from its name, is considered unsafe. It's not really Java per se.

Do you still think Java concurrency is safe? :)

Sunday, February 01, 2009

Excel XML Spreadsheet in Java

Microsoft Excel spreadsheet is somewhat unique in a way that custom software applications need to deal with it much more often than with any other 3d party file format. Here I mean specific formats, not containers like XML or CSV. Oh, of course PDF would win if we were to consider formats for sending stuff out, but when it comes to processing inputs, no one beats Excel's popularity.

In a .NET / MS world reading and generating Excel data comes naturally, native. Not so true in the Java world where one needs a mediator, a framework or a utility to understand the Excel binary format. There're tools out there, both open source and commercial that would read the *.xls files and expose an API to the consumer to digest the data. So what's this all about you may wonder? Let's see...

I never liked those APIs. Not because some resemble the VBA object model or are hard to get accustomed with. The actual reason is that I never liked working with the data by writing procedural code that would read it sequentially and do stuff in between getValue() calls. Being a great fan of query languages like SQL and XPath/XQuery, I like seeing data processing problems as transformation or query puzzles. And they often are! Not always, I did not say always did I? In those rare cases I just go with Apache POI HSSF or jxl, but at all other times I want Excel data to come in as XML. That one I can consume in an elegant and decoupled way with XSLT, XPath, SAX events listeners, you name it.

It was in Excel XP (maybe earlier?) when Microsoft introduced the Excel XML Spreadsheet format. Since then, any spreadsheet can be saved as XML. Not just tabular data within tags and brackets as one could imagine considering the CSV analogy, a full featured Excel file format with formatting, multiple worksheets, formulas, etc. It has its limitations of course (charts, data grouping, etc.) but then those API frameworks can't read any Excel file either... So when it comes to consume or generate Excel data in your Java application, Excel XML Spreadsheet may come in as a handy alternative.

I first discovered this in 2003 when I was building a first version of a modeling engine (the one I once mentioned in my earlier post as something I ought to blog about more, and something I might consider going opensource with, company permitting of course). Part of the idea behind that engine was to use MS Excel as an authoring tool for design-time modeling (a very good analogy of what purpose it served can be seen in Decision Tables supported in JBoss Drools which did not exist back then). I basically had to architect a DSL for describing models and rules in Excel and then find a way to digest it by the Java back-end. This whole engine thing is worth a separate blog and one day I'll blog about it, for now let's just isolate the "Excel" problem.

Back then, faced with the challenge, I was thinking along multiple lines simultaneously: write a VBA macro to read the data and build XML files? use Apache library to read the binary file? forget it and build XMLs by hand planning to come back to the automation later? I liked neither but I needed the result. I told everybody I'd do it manually and consider a better way later. 4 hours into typing XML files (I still can remember how boooooring it was) I figured I wanted to reconsider :) That's when I saw the Excel XML Spreadsheet in the Save As menu... to make long story short, in two days I announced that we had automated the Excel <-> (our custom) XML conversion and could move on with the runtime. I didn't have full bi-directional conversion right there, but I had enough pieces to feel safe about it and move on.

If you made it this far into this post, you might feel puzzled what exactly did I do with that Excel XML and why, 5 years later, I decided to blog about it. 2d part is easier to answer so I'll attend to it first. I still see how knowledge about Excel XML spreadsheet format existence surprises engineers and makes their lives easier when suddenly discovered. And then I did not blog back then so I could not do it earlier :) Now to the 1st part and some code examples.

--

I started off very simple. I saved my Excel files manually and built few XSLTs (worth another blog post and I promise to get back to it) to dig out the data I needed and present it in the way I wanted. There was one long HTML page somewhere on MSDN describing the format in a very dry and short form. I skipped that and worked off of the format as it was saved by the Excel application. When it worked, I needed an automation to bulk-process all files in a model. Plus, I thought, no good engine could require engineers to go ahead and save Excel files one by one into some weird XML format.

My next step was again very simple. I built an abstraction that could loop through the given folder's contents, run a filter (an interface that one can implement as needed) to see if a file needs to be converted, and then run a converter (another interface) that would chew one Excel binary file at a time and spit out the Excel XML stream. What was next? Not sure if simple is the right word to describe the next step, brutal should do it better. I took the Java-2-COM bridge (it was jacob) and ran the Excel.Application.SaveAs() command for the given file. Not at all a Java way, agree, but it's not the end of story. Look at the code snippet with jacob and read on please.


excelApp = new ActiveXComponent("Excel.Application");
...
// make all Excel related manipulations in silent mode 
excelApp.setProperty("DisplayAlerts", new Variant(false));
excelApp.setProperty("Visible", new Variant(false));

final Object workbooks = excelApp.getProperty("Workbooks").toDispatch();
final Object workbook =
    Dispatch.invoke(workbooks, "Open", Dispatch.Method,
                    new Object[] {file.getAbsolutePath()}, new int[1]).toDispatch();

final Variant saveAsFormat = new Variant();

saveAsFormat.putInt((MODE_EXCEL_AS_XML == mode) ? FILE_FORMAT_XML_SPREADSHEET
                                                : FILE_FORMAT_WORKBOOK_NORMAL);

// fake Variant to not specify Optional parameters for method call
final Variant optionalP = new Variant();
optionalP.noParam();

...

Dispatch.callSub(workbook, "SaveAs", newName, saveAsFormat, optionalP, optionalP,
                 optionalP, new Variant(false));
logger.info("File [" + file.getName() + "] saved as [" + newName + "]");

Dispatch.call(workbook, "Close", new Variant(false));

It lived like this for 2 months I guess. One of our customers, who we were to build a solution leveraging that engine for, was apparently concerned that they needed a Windows machine with MS Excel installed for the conversion to work. Technically, it was not an issue. All their machines were Wintel, MS Office was part of corporate OS image on each and every machine, conversion did not need to happen on runtime (production boxes were all *nix) and was considered a developer's responsibility. But they had enterprise architecture standards and were not sure the conversion would always remain local to the workstation. Anyway. I had to do something about it and the last thing I wanted to do was to give up the Excel XML and turn to those APIs I never liked.

I needed a way out and I wanted minimum changes to the system. Ideally, stay within all contracts so that nobody could notice a change. The abstraction I briefly described up above suggested that I should have replaced that individual file "save as" processor with another implementation. And I did just that. I took Apache POI HSSF and built a little framework (just a few classes) on top of it. It would take a binary Excel file and return the Excel XML Spreadsheet as a series of SAX events. Basically, I developed the Excel.Application.SaveAs emulator with as much format support as I needed. Let me show you how.


...

final ExcelParser parser = new ExcelParser(file.getAbsolutePath());

...

final Source xmlSource = new SAXSource(parser, new InputSource());
final TransformerFactory factory = new TransformerFactoryImpl();

OutputStream outp = new FileOutputStream(...);

final Transformer transformer = factory.newTransformer();

final StreamResult result = new StreamResult(outp);

transformer.transform(xmlSource, result);

Let's look under the hood of the ExcelParser. It extends org.xml.sax.helpers.XMLFilterImpl and does the following in its parse() implementation:


// InputSource input
 
...

final POIFSFileSystem fs = new POIFSFileSystem(input.getByteStream());
final HSSFWorkbook wb = new HSSFWorkbook(fs);

startDocument();

doProcess(wb);

endDocument();

...

Here's the doProcess():


doHeader();

doStyles(workbook);

final int size = workbook.getNumberOfSheets();
for (int i = 0; i < size; i++) {
    doSheet(workbook, i);
}
doFooter();

I won't transcript all of it, it would take a few pages if I did, let me just show you the header part as apparently MS Excel is sensitive to the namespace prefixes, and then one of the file body parts to give you a better idea of what's inside.

The doHeader() part:


startPrefixMapping("o", Constants.NS_OFFICE);
startPrefixMapping("x", Constants.NS_EXCEL);
startPrefixMapping("html", Constants.NS_HTML);
startPrefixMapping("ss", Constants.NS_SPREADSHEET);

final AttributesImpl attrs = new AttributesImpl();

startElement(Constants.NS_SPREADSHEET, Constants.TAG_WORKBOOK, 
             Constants.TAG_WORKBOOK, attrs);

And here's the code that works on individual cells:


    private void doCell(final HSSFCell cell, final HSSFRow row, final List mergedRegions)
                 throws SAXException {
        final int cellIndex = cell.getCellNum();
        final int rowIndex = row.getRowNum();

        String hyperLink = null;
        String cellValue = "";
        String cellType = Constants.DATATYPE_STRING;

        if (HSSFCell.CELL_TYPE_NUMERIC == cell.getCellType()) {
            cellValue = Constants.DECIMAL_FORMATER.format(cell.getNumericCellValue());
            cellType = Constants.DATATYPE_NUMBER;

        } else if (HSSFCell.CELL_TYPE_FORMULA == cell.getCellType()) {
            hyperLink = ExcelParser.parseHyperLinkFormula(cell.getCellFormula());

        } else if (HSSFCell.CELL_TYPE_BOOLEAN == cell.getCellType()) {
            cellValue = cell.getBooleanCellValue() ? "1" : "0";
            cellType = Constants.DATATYPE_BOOLEAN;

        } else {
            cellValue = cell.getStringCellValue();
            cellType = Constants.DATATYPE_STRING;
        }

        final AttributesImpl attrs = new AttributesImpl();

        if ((hyperLink != null) && (hyperLink.length() > 0)) {
            attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_HREF_LOCAL, 
                               Constants.ATT_HREF_PREFIXED,
                               Constants.DATA_TYPE, hyperLink);
        }

        attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_INDEX_LOCAL, 
                           Constants.ATT_INDEX_PREFIXED,
                           Constants.DATA_TYPE, String.valueOf((int) cellIndex + 1));

        final CellStyle style = StyleConverter.convert(cell.getCellStyle());
        final String styleId = (String) styles.get(style);
        if (null != styleId) {
            attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_STYLEID_LOCAL, 
                               Constants.ATT_STYLEID_PREFIXED,
                               Constants.DATA_TYPE, styleId);
        }

        int mergedAcross = 0;
        int mergedDown = 0;
        if ((null != mergedRegions) && !mergedRegions.isEmpty()) {
            for (Iterator it = mergedRegions.iterator(); it.hasNext();) {
                final Region region = (Region) it.next();
                if ((rowIndex == region.getRowFrom()) && (cellIndex == region.getColumnFrom())) {
                    mergedAcross = region.getColumnTo() - cellIndex;
                    mergedDown = region.getRowTo() - rowIndex;
                    break;
                }
            }
        }
        if (mergedAcross > 0) {
            attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_MERGEACROSS_LOCAL,
                               Constants.ATT_MERGEACROSS_PREFIXED, Constants.DATA_TYPE,
                               String.valueOf((int) mergedAcross));
        }
        if (mergedDown > 0) {
            attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_MERGEDOWN_LOCAL,
                               Constants.ATT_MERGEDOWN_PREFIXED, Constants.DATA_TYPE, 
                               String.valueOf((int) mergedDown));
        }

        startElement(Constants.NS_SPREADSHEET, Constants.TAG_CELL, 
                     Constants.TAG_CELL, attrs);

        attrs.clear();

        attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_TYPE_LOCAL, 
                           Constants.ATT_TYPE_PREFIXED,
                           Constants.DATA_TYPE, cellType);
        startElement(Constants.NS_SPREADSHEET, Constants.TAG_DATA, 
                     Constants.TAG_DATA, attrs);

        characters(cellValue.toCharArray(), 0, cellValue.length());

        endElement(Constants.NS_SPREADSHEET, Constants.TAG_DATA, Constants.TAG_DATA);
        endElement(Constants.NS_SPREADSHEET, Constants.TAG_CELL, Constants.TAG_CELL);
    }

The StyleConverter that was mentioned in the code is really just a small service to read the style data from the cell and put it into a nice little object for further processing (style data lives separate from the cells in the Excel XML format).

That was it. I had Excel XML Spreadsheet on any Java platform. Windows requirement was dropped and the engine has been using the new conversion approach since. Later it was refactored to become a separate Excel format utility that other projects and tools could consume. I hope some still do.

--

Do you think a tool like this would be a valuable contribution to the Apache HSSF POI project?

Pavel Veller's blog

Tuesday, January 24, 2012

Spotify Puzzle in Java, Ruby, Prolog and ... Common Sense

Java

Dominating Set

Ruby

Prolog

Let's Rock!

Common Sense

Tuesday, October 19, 2010

My first question on stackoverflow

Thursday, September 09, 2010

Type erasure does not erase all of it

Monday, August 30, 2010

Java concurrency is unsafe

Sunday, February 01, 2009

Excel XML Spreadsheet in Java

Blog Archive

About Me

Pavel Veller's blog

Tuesday, January 24, 2012

Spotify Puzzle in Java, Ruby, Prolog and ... Common Sense

Java

Dominating Set

Ruby

Prolog

Let's Rock!

Common Sense

Tuesday, October 19, 2010

My first question on stackoverflow

Thursday, September 09, 2010

Type erasure does not erase all of it

Monday, August 30, 2010

Java concurrency is unsafe

Sunday, February 01, 2009

Excel XML Spreadsheet in Java

Subscribe To My Blog

Blog Archive

About Me