Monday, September 14, 2009
Writing about my son
Today I was assigned my first true homework since I don't know how long. My son Denis started 2d grade his year and his teacher assigned us, the parents, to write about their children: "in a million words or less, please write about your child":
--
Denis is an extraordinary child!
I believe all kids are extraordinary and every parent will tell so but Denis is truly amazing :) He is as brilliant as he is restless, as well-read as he is vibrant, as bright and open-minded as he is active, as lovely as he is challenging, as full of joy as he is full of energy. Denis never hides and his eyes will tell you exactly what he feels that very moment you make the eye contact with him.
Denis was born in Minsk, Belarus which is in Eastern Europe half the way across the globe from U.S. That makes his native language and the language he speaks at home – Russian. His name was picked to serve two purposes: us parents both liking it :) and it being not tied to that only culture so that he doesn’t struggle should we move someplace else. But as we found out later it did come with a little secret. Back where Denis was born this name is pronounced as “Denis” with a soft “d” and an “i” stressed. Here in the states it would have been a girl’s name so he changed to become Dennis (with a hard “d” and an “e” stressed) but with his own unique single “n” in spelling.
Denis has always had dark brown eyes but he used to have curly hairs that straightened as he grew up. And I now think he was born with that wide and inviting smile that I hope will never change.
Our son moved to the states with us 2 years ago and has been here since then until our long trip back home this summer. 6 month into his American life Denis started picking up some English and 12 months into it he was fluent. He did a great job thanks to his inexhaustible desire to socialize and communicate. He does not seem to have fed it well enough yet so one of the things he loves doing the most is talking, especially when he should not be :)
His active lifestyle keeps his legs and elbows black and blue since he learned to walk. Even though Denis is a little clumsy when it comes to playing sports (especially those with balls and rockets) he loves to do it and enjoys the process until it hurts :) He excelled in riding his scooter and enjoys riding his bike. Last year we were able to keep up with him by running on his side but this year he outperformed us and it seems to be a good time to get our own bikes. Another outdoor activity that Denis admires is skiing. Back where he was born we had all 4 seasons with summers hot as they are here and winters cold as they are in Canada so we knew Denis would love it. Denis accompanied us on our first US-time winter vacation to New Hampshire where we all learned to ski. It since then became our family hobby and it remains so filling up all our winter weekends.
Denis loves company and he fades left alone with no friends around for even a day. He does not seem to appreciate how good it is to have a full day to yourself :) Whenever we can’t get him a company we must plan something out and better make it a good variety to burn enough of his energy to secure an uneventful evening.
Last year, when his little sister Alice was born, Denis became a Big Brother and has been sharing his universe with Alice since. Alice and Denis love each other and while he shows her what it means to be a big boy she easily makes him little baby back again when he plays her little toys.
Denis’s great passions are encyclopedias, dinosaurs, star wars, parks, and rivers. Ask him about what he liked the most on our last year trip to California and he will surely name the Muir Woods - red woods park just north from San Francisco...
Denis grows up fast and this is the one and only upsetting thing we can tell about him.
--
Sunday, February 01, 2009
Excel XML Spreadsheet in Java
Microsoft Excel spreadsheet is somewhat unique in a way that custom software applications need to deal with it much more often than with any other 3d party file format. Here I mean specific formats, not containers like XML or CSV. Oh, of course PDF would win if we were to consider formats for sending stuff out, but when it comes to processing inputs, no one beats Excel's popularity.
In a .NET / MS world reading and generating Excel data comes naturally, native. Not so true in the Java world where one needs a mediator, a framework or a utility to understand the Excel binary format. There're tools out there, both open source and commercial that would read the *.xls files and expose an API to the consumer to digest the data. So what's this all about you may wonder? Let's see...
I never liked those APIs. Not because some resemble the VBA object model or are hard to get accustomed with. The actual reason is that I never liked working with the data by writing procedural code that would read it sequentially and do stuff in between getValue() calls. Being a great fan of query languages like SQL and XPath/XQuery, I like seeing data processing problems as transformation or query puzzles. And they often are! Not always, I did not say always did I? In those rare cases I just go with Apache POI HSSF or jxl, but at all other times I want Excel data to come in as XML. That one I can consume in an elegant and decoupled way with XSLT, XPath, SAX events listeners, you name it.
It was in Excel XP (maybe earlier?) when Microsoft introduced the Excel XML Spreadsheet format. Since then, any spreadsheet can be saved as XML. Not just tabular data within tags and brackets as one could imagine considering the CSV analogy, a full featured Excel file format with formatting, multiple worksheets, formulas, etc. It has its limitations of course (charts, data grouping, etc.) but then those API frameworks can't read any Excel file either... So when it comes to consume or generate Excel data in your Java application, Excel XML Spreadsheet may come in as a handy alternative.
I first discovered this in 2003 when I was building a first version of a modeling engine (the one I once mentioned in my earlier post as something I ought to blog about more, and something I might consider going opensource with, company permitting of course). Part of the idea behind that engine was to use MS Excel as an authoring tool for design-time modeling (a very good analogy of what purpose it served can be seen in Decision Tables supported in JBoss Drools which did not exist back then). I basically had to architect a DSL for describing models and rules in Excel and then find a way to digest it by the Java back-end. This whole engine thing is worth a separate blog and one day I'll blog about it, for now let's just isolate the "Excel" problem.
Back then, faced with the challenge, I was thinking along multiple lines simultaneously: write a VBA macro to read the data and build XML files? use Apache library to read the binary file? forget it and build XMLs by hand planning to come back to the automation later? I liked neither but I needed the result. I told everybody I'd do it manually and consider a better way later. 4 hours into typing XML files (I still can remember how boooooring it was) I figured I wanted to reconsider :) That's when I saw the Excel XML Spreadsheet in the Save As menu... to make long story short, in two days I announced that we had automated the Excel <-> (our custom) XML conversion and could move on with the runtime. I didn't have full bi-directional conversion right there, but I had enough pieces to feel safe about it and move on.
If you made it this far into this post, you might feel puzzled what exactly did I do with that Excel XML and why, 5 years later, I decided to blog about it. 2d part is easier to answer so I'll attend to it first. I still see how knowledge about Excel XML spreadsheet format existence surprises engineers and makes their lives easier when suddenly discovered. And then I did not blog back then so I could not do it earlier :) Now to the 1st part and some code examples.
--
I started off very simple. I saved my Excel files manually and built few XSLTs (worth another blog post and I promise to get back to it) to dig out the data I needed and present it in the way I wanted. There was one long HTML page somewhere on MSDN describing the format in a very dry and short form. I skipped that and worked off of the format as it was saved by the Excel application. When it worked, I needed an automation to bulk-process all files in a model. Plus, I thought, no good engine could require engineers to go ahead and save Excel files one by one into some weird XML format.
My next step was again very simple. I built an abstraction that could loop through the given folder's contents, run a filter (an interface that one can implement as needed) to see if a file needs to be converted, and then run a converter (another interface) that would chew one Excel binary file at a time and spit out the Excel XML stream. What was next? Not sure if simple is the right word to describe the next step, brutal should do it better. I took the Java-2-COM bridge (it was jacob) and ran the Excel.Application.SaveAs() command for the given file. Not at all a Java way, agree, but it's not the end of story. Look at the code snippet with jacob and read on please.
excelApp = new ActiveXComponent("Excel.Application");
...
// make all Excel related manipulations in silent mode
excelApp.setProperty("DisplayAlerts", new Variant(false));
excelApp.setProperty("Visible", new Variant(false));
final Object workbooks = excelApp.getProperty("Workbooks").toDispatch();
final Object workbook =
Dispatch.invoke(workbooks, "Open", Dispatch.Method,
new Object[] {file.getAbsolutePath()}, new int[1]).toDispatch();
final Variant saveAsFormat = new Variant();
saveAsFormat.putInt((MODE_EXCEL_AS_XML == mode) ? FILE_FORMAT_XML_SPREADSHEET
: FILE_FORMAT_WORKBOOK_NORMAL);
// fake Variant to not specify Optional parameters for method call
final Variant optionalP = new Variant();
optionalP.noParam();
...
Dispatch.callSub(workbook, "SaveAs", newName, saveAsFormat, optionalP, optionalP,
optionalP, new Variant(false));
logger.info("File [" + file.getName() + "] saved as [" + newName + "]");
Dispatch.call(workbook, "Close", new Variant(false));
It lived like this for 2 months I guess. One of our customers, who we were to build a solution leveraging that engine for, was apparently concerned that they needed a Windows machine with MS Excel installed for the conversion to work. Technically, it was not an issue. All their machines were Wintel, MS Office was part of corporate OS image on each and every machine, conversion did not need to happen on runtime (production boxes were all *nix) and was considered a developer's responsibility. But they had enterprise architecture standards and were not sure the conversion would always remain local to the workstation. Anyway. I had to do something about it and the last thing I wanted to do was to give up the Excel XML and turn to those APIs I never liked.
I needed a way out and I wanted minimum changes to the system. Ideally, stay within all contracts so that nobody could notice a change. The abstraction I briefly described up above suggested that I should have replaced that individual file "save as" processor with another implementation. And I did just that. I took Apache POI HSSF and built a little framework (just a few classes) on top of it. It would take a binary Excel file and return the Excel XML Spreadsheet as a series of SAX events. Basically, I developed the Excel.Application.SaveAs emulator with as much format support as I needed. Let me show you how.
...
final ExcelParser parser = new ExcelParser(file.getAbsolutePath());
...
final Source xmlSource = new SAXSource(parser, new InputSource());
final TransformerFactory factory = new TransformerFactoryImpl();
OutputStream outp = new FileOutputStream(...);
final Transformer transformer = factory.newTransformer();
final StreamResult result = new StreamResult(outp);
transformer.transform(xmlSource, result);
Let's look under the hood of the ExcelParser. It extends org.xml.sax.helpers.XMLFilterImpl and does the following in its parse() implementation:
// InputSource input
...
final POIFSFileSystem fs = new POIFSFileSystem(input.getByteStream());
final HSSFWorkbook wb = new HSSFWorkbook(fs);
startDocument();
doProcess(wb);
endDocument();
...
Here's the doProcess():
doHeader();
doStyles(workbook);
final int size = workbook.getNumberOfSheets();
for (int i = 0; i < size; i++) {
doSheet(workbook, i);
}
doFooter();
I won't transcript all of it, it would take a few pages if I did, let me just show you the header part as apparently MS Excel is sensitive to the namespace prefixes, and then one of the file body parts to give you a better idea of what's inside.
The doHeader() part:
startPrefixMapping("o", Constants.NS_OFFICE);
startPrefixMapping("x", Constants.NS_EXCEL);
startPrefixMapping("html", Constants.NS_HTML);
startPrefixMapping("ss", Constants.NS_SPREADSHEET);
final AttributesImpl attrs = new AttributesImpl();
startElement(Constants.NS_SPREADSHEET, Constants.TAG_WORKBOOK,
Constants.TAG_WORKBOOK, attrs);
And here's the code that works on individual cells:
private void doCell(final HSSFCell cell, final HSSFRow row, final List mergedRegions)
throws SAXException {
final int cellIndex = cell.getCellNum();
final int rowIndex = row.getRowNum();
String hyperLink = null;
String cellValue = "";
String cellType = Constants.DATATYPE_STRING;
if (HSSFCell.CELL_TYPE_NUMERIC == cell.getCellType()) {
cellValue = Constants.DECIMAL_FORMATER.format(cell.getNumericCellValue());
cellType = Constants.DATATYPE_NUMBER;
} else if (HSSFCell.CELL_TYPE_FORMULA == cell.getCellType()) {
hyperLink = ExcelParser.parseHyperLinkFormula(cell.getCellFormula());
} else if (HSSFCell.CELL_TYPE_BOOLEAN == cell.getCellType()) {
cellValue = cell.getBooleanCellValue() ? "1" : "0";
cellType = Constants.DATATYPE_BOOLEAN;
} else {
cellValue = cell.getStringCellValue();
cellType = Constants.DATATYPE_STRING;
}
final AttributesImpl attrs = new AttributesImpl();
if ((hyperLink != null) && (hyperLink.length() > 0)) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_HREF_LOCAL,
Constants.ATT_HREF_PREFIXED,
Constants.DATA_TYPE, hyperLink);
}
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_INDEX_LOCAL,
Constants.ATT_INDEX_PREFIXED,
Constants.DATA_TYPE, String.valueOf((int) cellIndex + 1));
final CellStyle style = StyleConverter.convert(cell.getCellStyle());
final String styleId = (String) styles.get(style);
if (null != styleId) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_STYLEID_LOCAL,
Constants.ATT_STYLEID_PREFIXED,
Constants.DATA_TYPE, styleId);
}
int mergedAcross = 0;
int mergedDown = 0;
if ((null != mergedRegions) && !mergedRegions.isEmpty()) {
for (Iterator it = mergedRegions.iterator(); it.hasNext();) {
final Region region = (Region) it.next();
if ((rowIndex == region.getRowFrom()) && (cellIndex == region.getColumnFrom())) {
mergedAcross = region.getColumnTo() - cellIndex;
mergedDown = region.getRowTo() - rowIndex;
break;
}
}
}
if (mergedAcross > 0) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_MERGEACROSS_LOCAL,
Constants.ATT_MERGEACROSS_PREFIXED, Constants.DATA_TYPE,
String.valueOf((int) mergedAcross));
}
if (mergedDown > 0) {
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_MERGEDOWN_LOCAL,
Constants.ATT_MERGEDOWN_PREFIXED, Constants.DATA_TYPE,
String.valueOf((int) mergedDown));
}
startElement(Constants.NS_SPREADSHEET, Constants.TAG_CELL,
Constants.TAG_CELL, attrs);
attrs.clear();
attrs.addAttribute(Constants.NS_SPREADSHEET, Constants.ATT_TYPE_LOCAL,
Constants.ATT_TYPE_PREFIXED,
Constants.DATA_TYPE, cellType);
startElement(Constants.NS_SPREADSHEET, Constants.TAG_DATA,
Constants.TAG_DATA, attrs);
characters(cellValue.toCharArray(), 0, cellValue.length());
endElement(Constants.NS_SPREADSHEET, Constants.TAG_DATA, Constants.TAG_DATA);
endElement(Constants.NS_SPREADSHEET, Constants.TAG_CELL, Constants.TAG_CELL);
}
The StyleConverter that was mentioned in the code is really just a small service to read the style data from the cell and put it into a nice little object for further processing (style data lives separate from the cells in the Excel XML format).
That was it. I had Excel XML Spreadsheet on any Java platform. Windows requirement was dropped and the engine has been using the new conversion approach since. Later it was refactored to become a separate Excel format utility that other projects and tools could consume. I hope some still do.
--
Do you think a tool like this would be a valuable contribution to the Apache HSSF POI project?
Subscribe to:
Posts (Atom)