Sunday, October 23, 2011

Reading Excel files with Apache POI(usermodel)

Recently I have stumbled upon need to read spreadsheet data from usual Excel files. After some research I have found that the most suitable solution for me is to use Apache POI. It works fine, simple to use, has a long history(10 years) and is still alive.

The POI API is full of acronyms, the meanings of which is not understandable (e.g. HSSF, XSSF, SXSSF) until you read the documentation. So let's start with some definitions:
  • HSSF(Horrible SpreadSheet Format) is the POI Project's pure Java implementation of the Excel '97(-2007) file format. 
  • XSSF (XML SpreadSheet Format) is the POI Project's pure Java implementation of the Excel 2007 OOXML (.xlsx) file format.
  • SXSSF is an API-compatible streaming extension of XSSF to be used when very large spreadsheets have to be produced, and heap space is limited. It's built on top of XSSF. Available only since 3.8-beta3.
Before I show you some examples of work I want you to be aware of additional complexities in POI. There are different ways to read files. The API provides us with 2 capabilities, they are:
  • UserModel
  • UserEventModel. 
The first one is really straightforward and easy to use, but the second one is much more intricate and is based upon reading files using SAX (e. g. Apache Xerces). And by the way if you need to load big files then you should choose usereventmodel because usermodel consumes a lot of memory and it's very likely you will become the victim of the dreadful Java OutOfMemory exception. And if you need to write to SpreadSheet then you have no choice and will be forced to use usermodel. In this article we will see the usage of usermodel API for reading.
Figure 1. Spreadsheet API Feature Summary

So at first you need to download the library from official site or to configure maven dependencies. I prefer the second option, so here are dependencies (I got them from mvnrepository.com):

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>3.5-FINAL</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-contrib</artifactId>
    <version>3.5-FINAL</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.5-FINAL</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>

There are two ways to read data in usermodel:
  • Read only data and no null value. It usually produces ragged arrays(depends on your data)
  • Read all data (inclusive nulls)
Now let's create the code to read all data:

public List<List<String>> parseSpreadSheet(InputStream inputStream) {
    Workbook workBook = null;
    try {
        workBook = WorkbookFactory.create(inputStream);
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
    Sheet sheet = workBook.getSheetAt(0);    
    List<List<String>> rowHolder = new ArrayList<List<String>>();
    int cellNum = sheet.getRow(1).getLastCellNum();    

    for (int i = 0; i <= sheet.getLastRowNum(); i++) {
        Row row = sheet.getRow(i);
        List<String> cellHolder = new ArrayList<String>();

        for (int j = 0; j < row.getLastCellNum(); j++) {
            Cell cell = row.getCell(j);           
            String cellValue = parseCellValue(workBook, cell);
            cellHolder.add(cellValue);
        }

        //add empty cells to the end if required
        while (cellHolder.size() < cellNum) {
            cellHolder.add(null);                   
        }
        rowHolder.add(cellHolder);    
    }
    return rowHolder;
}


The return type of the method is Java SpreadSheet representation (selected by me). The argument of the method is InputStream of the SpreadSheet file.

I have decided to use Workbook(represents Excel file), which is the parent class of HSSFWorkbook and XSSFWorkbook. WorkbookFactory frees you from necessity to write conditions for different kinds of files (xls and xlsx), it creates appropriate descendant class from InputStream. The rest of the code is simple(The code of parseCellValue() method I will show later).

We need the method to parse cell values:

private String parseCellValue(Workbook workBook, Cell cell) {       
    FormulaEvaluator evaluator = workBook.getCreationHelper().createFormulaEvaluator();
    String cellValue = null;               
    if (cell != null) {
        switch (cell.getCellType()) {
            case Cell.CELL_TYPE_STRING:
                cellValue = cell.getRichStringCellValue().getString();
                break;
            case Cell.CELL_TYPE_NUMERIC:
                if (DateUtil.isCellDateFormatted(cell)) {
                    cellValue = cell.getDateCellValue().toString();
                } else {
                    cellValue = new Double(cell.getNumericCellValue()).toString();
                }
                break;
            case Cell.CELL_TYPE_BOOLEAN:
                cellValue = new Boolean(cell.getBooleanCellValue()).toString();
                break;
            case Cell.CELL_TYPE_FORMULA:
                cellValue = evaluator.evaluate(cell).formatAsString();
                break;
        }                   
    }
    return cellValue;
}

As you can see we should parse cell value according to its Excel format. Note POI UserAPI is pretty powerful it even allows us to evaluate formula values.

And finally here's the way you can read all data but null values. This type of reading is based upon iterators of Row and Cell:

public List<List<String>> readSpreadSheetWOnull(InputStream inputStream) {
    Workbook workBook = null;
    try {
        workBook = WorkbookFactory.create(inputStream);
        Sheet sheet = workBook.getSheetAt(SHEET_NUMBER);

        Iterator<Row> rowIter = sheet.rowIterator();

        List<List<String>> rowHolder = new ArrayList<List<String>>();
        while (rowIter.hasNext()) {
            Row row = (Row) rowIter.next();
            Iterator<Cell> cellIter = row.cellIterator();

            List<String> cellHolder = new ArrayList<String>();
            while (cellIter.hasNext()) {
                Cell cell = (Cell) cellIter.next();
                String cellValue = parseCellValue(workBook, cell);
                cellHolder.add(cellValue);
            }
            rowHolder.add(cellHolder);
        }
        return rowHolder;
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

That's all. Later I will write a post about UserEventModel