Sunday, February 01, 2009

Excel XML Spreadsheet in Java


Microsoft Excel spreadsheet is somewhat unique in a way that custom software applications need to deal with it much more often than with any other 3d party file format. Here I mean specific formats, not containers like XML or CSV. Oh, of course PDF would win if we were to consider formats for sending stuff out, but when it comes to processing inputs, no one beats Excel's popularity.

In a .NET / MS world reading and generating Excel data comes naturally, native. Not so true in the Java world where one needs a mediator, a framework or a utility to understand the Excel binary format. There're tools out there, both open source and commercial that would read the *.xls files and expose an API to the consumer to digest the data. So what's this all about you may wonder? Let's see...

I never liked those APIs. Not because some resemble the VBA object model or are hard to get accustomed with. The actual reason is that I never liked working with the data by writing procedural code that would read it sequentially and do stuff in between getValue() calls. Being a great fan of query languages like SQL and XPath/XQuery, I like seeing data processing problems as transformation or query puzzles. And they often are! Not always, I did not say always did I? In those rare cases I just go with Apache POI HSSF or jxl, but at all other times I want Excel data to come in as XML. That one I can consume in an elegant and decoupled way with XSLT, XPath, SAX events listeners, you name it.

It was in Excel XP (maybe earlier?) when Microsoft introduced the Excel XML Spreadsheet format. Since then, any spreadsheet can be saved as XML. Not just tabular data within tags and brackets as one could imagine considering the CSV analogy, a full featured Excel file format with formatting, multiple worksheets, formulas, etc. It has its limitations of course (charts, data grouping, etc.) but then those API frameworks can't read any Excel file either... So when it comes to consume or generate Excel data in your Java application, Excel XML Spreadsheet may come in as a handy alternative.

I first discovered this in 2003 when I was building a first version of a modeling engine (the one I once mentioned in my earlier post as something I ought to blog about more, and something I might consider going opensource with, company permitting of course). Part of the idea behind that engine was to use MS Excel as an authoring tool for design-time modeling (a very good analogy of what purpose it served can be seen in Decision Tables supported in JBoss Drools which did not exist back then). I basically had to architect a DSL for describing models and rules in Excel and then find a way to digest it by the Java back-end. This whole engine thing is worth a separate blog and one day I'll blog about it, for now let's just isolate the "Excel" problem.

Back then, faced with the challenge, I was thinking along multiple lines simultaneously: write a VBA macro to read the data and build XML files? use Apache library to read the binary file? forget it and build XMLs by hand planning to come back to the automation later? I liked neither but I needed the result. I told everybody I'd do it manually and consider a better way later. 4 hours into typing XML files (I still can remember how boooooring it was) I figured I wanted to reconsider :) That's when I saw the Excel XML Spreadsheet in the Save As menu... to make long story short, in two days I announced that we had automated the Excel <-> (our custom) XML conversion and could move on with the runtime. I didn't have full bi-directional conversion right there, but I had enough pieces to feel safe about it and move on.

If you made it this far into this post, you might feel puzzled what exactly did I do with that Excel XML and why, 5 years later, I decided to blog about it. 2d part is easier to answer so I'll attend to it first. I still see how knowledge about Excel XML spreadsheet format existence surprises engineers and makes their lives easier when suddenly discovered. And then I did not blog back then so I could not do it earlier :) Now to the 1st part and some code examples.

--

I started off very simple. I saved my Excel files manually and built few XSLTs (worth another blog post and I promise to get back to it) to dig out the data I needed and present it in the way I wanted. There was one long HTML page somewhere on MSDN describing the format in a very dry and short form. I skipped that and worked off of the format as it was saved by the Excel application. When it worked, I needed an automation to bulk-process all files in a model. Plus, I thought, no good engine could require engineers to go ahead and save Excel files one by one into some weird XML format.

My next step was again very simple. I built an abstraction that could loop through the given folder's contents, run a filter (an interface that one can implement as needed) to see if a file needs to be converted, and then run a converter (another interface) that would chew one Excel binary file at a time and spit out the Excel XML stream. What was next? Not sure if simple is the right word to describe the next step, brutal should do it better. I took the Java-2-COM bridge (it was jacob) and ran the Excel.Application.SaveAs() command for the given file. Not at all a Java way, agree, but it's not the end of story. Look at the code snippet with jacob and read on please.


excelApp = new ActiveXComponent("Excel.Application");
...
// make all Excel related manipulations in silent mode
excelApp.setProperty("DisplayAlerts", new Variant(false));
excelApp.setProperty("Visible", new Variant(false));

final Object workbooks = excelApp.getProperty("Workbooks").toDispatch();
final Object workbook =
Dispatch.invoke(workbooks, "Open", Dispatch.Method,
new Object[] {file.getAbsolutePath()}, new int[1]).toDispatch();

final Variant saveAsFormat = new Variant();

saveAsFormat.putInt((MODE_EXCEL_AS_XML == mode) ? FILE_FORMAT_XML_SPREADSHEET
: FILE_FORMAT_WORKBOOK_NORMAL);

// fake Variant to not specify Optional parameters for method call
final Variant optionalP = new Variant();
optionalP.noParam();

...

Dispatch.callSub(workbook, "SaveAs", newName, saveAsFormat, optionalP, optionalP,
optionalP, new Variant(false));
logger.info("File [" + file.getName() + "] saved as [" + newName + "]");

Dispatch.call(workbook, "Close", new Variant(false));


It lived like this for 2 months I guess. One of our customers, who we were to build a solution leveraging that engine for, was apparently concerned that they needed a Windows machine with MS Excel installed for the conversion to work. Technically, it was not an issue. All their machines were Wintel, MS Office was part of corporate OS image on each and every machine, conversion did not need to happen on runtime (production boxes were all *nix) and was considered a developer's responsibility. But they had enterprise architecture standards and were not sure the conversion would always remain local to the workstation. Anyway. I had to do something about it and the last thing I wanted to do was to give up the Excel XML and turn to those APIs I never liked.

I needed a way out and I wanted minimum changes to the system. Ideally, stay within all contracts so that nobody could notice a change. The abstraction I briefly described up above suggested that I should have replaced that individual file "save as" processor with another implementation. And I did just that. I took Apache POI HSSF and built a little framework (just a few classes) on top of it. It would take a binary Excel file and return the Excel XML Spreadsheet as a series of SAX events. Basically, I developed the Excel.Application.SaveAs emulator with as much format support as I needed. Let me show you how.


...

final ExcelParser parser = new ExcelParser(file.getAbsolutePath());

...

final Source xmlSource = new SAXSource(parser, new InputSource());
final TransformerFactory factory = new TransformerFactoryImpl();

OutputStream outp = new FileOutputStream(...);

final Transformer transformer = factory.newTransformer();

final StreamResult result = new StreamResult(outp);

transformer.transform(xmlSource, result);


Let's look under the hood of the ExcelParser. It extends org.xml.sax.helpers.XMLFilterImpl and does the following in its parse() implementation:


// InputSource input

...

final POIFSFileSystem fs = new POIFSFileSystem(input.getByteStream());
final HSSFWorkbook wb = new HSSFWorkbook(fs);

startDocument();

doProcess(wb);

endDocument();

...


Here's the doProcess():


doHeader();

doStyles(workbook);

final int size = workbook.getNumberOfSheets();
for (int i = 0; i < size; i++) {
doSheet(workbook, i);
}
doFooter();


I won't transcript all of it, it would take a few pages if I did, let me just show you the header part as apparently MS Excel is sensitive to the namespace prefixes, and then one of the file body parts to give you a better idea of what's inside.

The doHeader() part:


startPrefixMapping("o", Constants.NS_OFFICE);
startPrefixMapping("x", Constants.NS_EXCEL);
startPrefixMapping("html", Constants.NS_HTML);
startPrefixMapping("ss", Constants.NS_SPREADSHEET);

final AttributesImpl attrs = new AttributesImpl();

startElement(Constants.NS_SPREADSHEET, Constants.TAG_WORKBOOK,
Constants.TAG_WORKBOOK, attrs);


And here's the code that works on individual cells:


private void doCell(final HSSFCell cell, final HSSFRow row, final List mergedRegions)
throws SAXException {
final int cellIndex = cell.getCellNum();
final int rowIndex = row.getRowNum();

String hyperLink = null;
String cellValue = "";
String cellType = Constants.DATATYPE_STRING;

if (HSSFCell.CELL_TYPE_NUMERIC == cell.getCellType()) {
cellValue = Constants.DECIMAL_FORMATER.format(cell.getNumericCellValue());
cellType = Constants.DATATYPE_NUMBER;

} else if (HSSFCell.CELL_TYPE_FORMULA == cell.getCellType()) {
hyperLink = ExcelParser.parseHyperLinkFormula(cell.getCellFormula());

} else if (HSSFCell.CELL_TYPE_BOOLEAN == cell.getCellType()) {
cellValue = cell.getBooleanCellValue() ? "1" : "0";
cellType = Constants.DATATYPE_BOOLEAN;

} else {
cellValue = cell.getStringCellValue();
cellType = Constants.DATATYPE_STRING;
}

final AttributesImpl attrs = new AttributesImpl();

if ((hyperLink != null) && (hyperLink.length() > 0)) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_HREF_LOCAL,
Constants.ATT_HREF_PREFIXED,
Constants.DATA_TYPE, hyperLink);
}

attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_INDEX_LOCAL,
Constants.ATT_INDEX_PREFIXED,
Constants.DATA_TYPE, String.valueOf((int) cellIndex + 1));

final CellStyle style = StyleConverter.convert(cell.getCellStyle());
final String styleId = (String) styles.get(style);
if (null != styleId) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_STYLEID_LOCAL,
Constants.ATT_STYLEID_PREFIXED,
Constants.DATA_TYPE, styleId);
}

int mergedAcross = 0;
int mergedDown = 0;
if ((null != mergedRegions) && !mergedRegions.isEmpty()) {
for (Iterator it = mergedRegions.iterator(); it.hasNext();) {
final Region region = (Region) it.next();
if ((rowIndex == region.getRowFrom()) && (cellIndex == region.getColumnFrom())) {
mergedAcross = region.getColumnTo() - cellIndex;
mergedDown = region.getRowTo() - rowIndex;
break;
}
}
}
if (mergedAcross > 0) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_MERGEACROSS_LOCAL,
Constants.ATT_MERGEACROSS_PREFIXED, Constants.DATA_TYPE,
String.valueOf((int) mergedAcross));
}
if (mergedDown > 0) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_MERGEDOWN_LOCAL,
Constants.ATT_MERGEDOWN_PREFIXED, Constants.DATA_TYPE,
String.valueOf((int) mergedDown));
}

startElement(Constants.NS_SPREADSHEET, Constants.TAG_CELL,
Constants.TAG_CELL, attrs);

attrs.clear();

attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_TYPE_LOCAL,
Constants.ATT_TYPE_PREFIXED,
Constants.DATA_TYPE, cellType);
startElement(Constants.NS_SPREADSHEET, Constants.TAG_DATA,
Constants.TAG_DATA, attrs);

characters(cellValue.toCharArray(), 0, cellValue.length());

endElement(Constants.NS_SPREADSHEET, Constants.TAG_DATA, Constants.TAG_DATA);
endElement(Constants.NS_SPREADSHEET, Constants.TAG_CELL, Constants.TAG_CELL);
}


The StyleConverter that was mentioned in the code is really just a small service to read the style data from the cell and put it into a nice little object for further processing (style data lives separate from the cells in the Excel XML format).

That was it. I had Excel XML Spreadsheet on any Java platform. Windows requirement was dropped and the engine has been using the new conversion approach since. Later it was refactored to become a separate Excel format utility that other projects and tools could consume. I hope some still do.

--

Do you think a tool like this would be a valuable contribution to the Apache HSSF POI project?

5 comments:

Anonymous said...

I was impressed by this jacob thing ))
It's cool

Vidhya said...

That was very good piece of information. I was struggling doing this since morning.

Thanks for sharing.
Vidhya

Alex said...

For me exist some new softwares which I frequently use for work with excel files. But one day I ran upon a tool, which stroke me and helped with old issues. I conceive that this tool can many good resources for resolving varied troubles with excel documents - Excel files fix.

Anonymous said...

Thanks for sharing
I wonder if you could share with us the full code, or at least the code of style conversion fonction. Thanks a lot.
Youssef

Astrid said...

There is an API available now that can convert excel file to XML and also can open and read XML file and you don't have to add any XML Mapping file etc it will do all for you, this API is called Aspose.Cells for Java. Try it i hope this API will benefit you.