概要
Java对Excel的操作一般都是用POI,但是数据量大的话可能会导致频繁的FGC或OOM,分别对于xls和xlsx文件进行大批量数据的导入和导出优化。
一次线上问题
这是一次线上的问题,因为一个大数据量的Excel导出功能,而导致服务器频繁FGC,具体如图所示

可以看出POI的对象以及相关的XML对象占用了绝大部分的内存消耗,频繁FGC说明这些对象一直存活,没有被回收。
原因是由于导出的数据比较大量,大概有10w行 * 50列,由于后台直接用XSSFWorkbook导出,在导出结束前内存有大量的Row,Cell,Style等,以及基于XLSX底层存储的XML对象没有被释放。
Excel的存储格式
下面的优化内容涉及Excel的底层存储格式,所以要先了解一下这些基础知识。
XLS
03版的XLS采用的是一种名为BIFF8(Binary-Interchange-File-Format),基于OLE2规范的二进制文件格式。大概就是一种结构很复杂的二进制文件,已经被淘汰了。
XLSX
07版的XLSX则是采用OOXML(Office Open Xml)的格式存储数据。简单来说就是一堆xml文件用zip打包之后文件。把xlsx文件后缀名改为zip后,再解压出来就可以看到文件结构
打开sheet1.xml,可以看到是描述第一个sheet的内容
导出优化
XLSX
由于xlsx底层使用xml存储,占用内存会比较大,官方也意识到这个问题,在3.8版本之后,提供了SXSSFWorkbook来优化写性能。
官方说明
https://poi.apache.org/components/spreadsheet/how-to.html#sxssf
使用
SXSSFWorkbook使用起来特别的简单,只需要改一行代码就OK了。
原来代码可能是这样的
Workbook workbook = new XSSFWorkbook(inputStream);
那么需要改成下面的样子,就可以用上SXSSFWorkbook了
Workbook workbook = new SXSSFWorkbook(new XSSFWorkbook(inputStream));
其原理是可以定义一个window size(默认100),生成Excel期间只在内存维持window size那么多的行数Row,超时window size时会把之前行Row写到一个临时文件并且remove释放掉,这样就可以达到释放内存的效果。
SXSSFSheet在创建Row时会判断并刷盘、释放超过window size的Row。
@Overridepublic SXSSFRow createRow(int rownum){int maxrow = SpreadsheetVersion.EXCEL2007.getLastRowIndex();if (rownum < 0 || rownum > maxrow) {throw new IllegalArgumentException("Invalid row number (" + rownum+ ") outside allowable range (0.." + maxrow + ")");}// attempt to overwrite a row that is already flushed to diskif(rownum <= _writer.getLastFlushedRow() ) {throw new IllegalArgumentException("Attempting to write a row["+rownum+"] " +"in the range [0," + _writer.getLastFlushedRow() + "] that is already written to disk.");}// attempt to overwrite a existing row in the input templateif(_sh.getPhysicalNumberOfRows() > 0 && rownum <= _sh.getLastRowNum() ) {throw new IllegalArgumentException("Attempting to write a row["+rownum+"] " +"in the range [0," + _sh.getLastRowNum() + "] that is already written to disk.");}SXSSFRow newRow=new SXSSFRow(this);_rows.put(rownum,newRow);allFlushed = false;//如果大于窗口的size,就会flushif(_randomAccessWindowSize>=0&&_rows.size()>_randomAccessWindowSize){try{flushRows(_randomAccessWindowSize);}catch (IOException ioe){throw new RuntimeException(ioe);}}return newRow;}public void flushRows(int remaining) throws IOException{//flush每一个rowwhile(_rows.size() > remaining) {flushOneRow();}if (remaining == 0) {allFlushed = true;}}private void flushOneRow() throws IOException{Integer firstRowNum = _rows.firstKey();if (firstRowNum!=null) {int rowIndex = firstRowNum.intValue();SXSSFRow row = _rows.get(firstRowNum);// Update the best fit column widths for auto-sizing just before the rows are flushed_autoSizeColumnTracker.updateColumnWidths(row);//写盘_writer.writeRow(rowIndex, row);//然后把row remove掉,这里的_rows是一个TreeMap结构_rows.remove(firstRowNum);lastFlushedRowNumber = rowIndex;}}
再看看刷盘的具体操作
SXSSFSheet在创建的时候,都会创建一个SheetDataWriter,刷盘动作正是由这个类完成的
看下SheetDataWriter的初始化
public SheetDataWriter() throws IOException {//创建临时文件_fd = createTempFile();//拿到文件的BufferedWriter_out = createWriter(_fd);}//在本地创建了一个临时文件前缀为poi-sxssf-sheet,后缀为.xmlpublic File createTempFile() throws IOException {return TempFile.createTempFile("poi-sxssf-sheet", ".xml");}public static File createTempFile(String prefix, String suffix) throws IOException {//用一个策略去创建文件return strategy.createTempFile(prefix, suffix);}//这个策略就是在执行路径先创建一个目录(如果不存在的话),然后再在里面创建一个随机唯一命名的文件public File createTempFile(String prefix, String suffix) throws IOException {// Identify and create our temp dir, if neededcreatePOIFilesDirectory();// Generate a unique new filenameFile newFile = File.createTempFile(prefix, suffix, dir);// Set the delete on exit flag, unless explicitly disabledif (System.getProperty(KEEP_FILES) == null) {newFile.deleteOnExit();}// All donereturn newFile;}
POI就是把超过window size的Row刷到临时文件里,然后再把临时文件转为正常的xlsx文件格式输出。
看看刷盘时写了什么,SheetDataWriter的writeRow方法
public void writeRow(int rownum, SXSSFRow row) throws IOException {if (_numberOfFlushedRows == 0)_lowestIndexOfFlushedRows = rownum;_numberLastFlushedRow = Math.max(rownum, _numberLastFlushedRow);_numberOfCellsOfLastFlushedRow = row.getLastCellNum();_numberOfFlushedRows++;beginRow(rownum, row);Iterator<Cell> cells = row.allCellsIterator();int columnIndex = 0;while (cells.hasNext()) {writeCell(columnIndex++, cells.next());}endRow();}void beginRow(int rownum, SXSSFRow row) throws IOException {_out.write("<row");writeAttribute("r", Integer.toString(rownum + 1));if (row.hasCustomHeight()) {writeAttribute("customHeight", "true");writeAttribute("ht", Float.toString(row.getHeightInPoints()));}if (row.getZeroHeight()) {writeAttribute("hidden", "true");}if (row.isFormatted()) {writeAttribute("s", Integer.toString(row.getRowStyleIndex()));writeAttribute("customFormat", "1");}if (row.getOutlineLevel() != 0) {writeAttribute("outlineLevel", Integer.toString(row.getOutlineLevel()));}if(row.getHidden() != null) {writeAttribute("hidden", row.getHidden() ? "1" : "0");}if(row.getCollapsed() != null) {writeAttribute("collapsed", row.getCollapsed() ? "1" : "0");}_out.write(">\n");this._rownum = rownum;}void endRow() throws IOException {_out.write("</row>\n");}public void writeCell(int columnIndex, Cell cell) throws IOException {if (cell == null) {return;}String ref = new CellReference(_rownum, columnIndex).formatAsString();_out.write("<c");writeAttribute("r", ref);CellStyle cellStyle = cell.getCellStyle();if (cellStyle.getIndex() != 0) {// need to convert the short to unsigned short as the indexes can be up to 64k// ideally we would use int for this index, but that would need changes to some more// APIswriteAttribute("s", Integer.toString(cellStyle.getIndex() & 0xffff));}CellType cellType = cell.getCellTypeEnum();switch (cellType) {case BLANK: {_out.write('>');break;}case FORMULA: {_out.write("><f>");outputQuotedString(cell.getCellFormula());_out.write("</f>");switch (cell.getCachedFormulaResultTypeEnum()) {case NUMERIC:double nval = cell.getNumericCellValue();if (!Double.isNaN(nval)) {_out.write("<v>");_out.write(Double.toString(nval));_out.write("</v>");}break;default:break;}break;}case STRING: {if (_sharedStringSource != null) {XSSFRichTextString rt = new XSSFRichTextString(cell.getStringCellValue());int sRef = _sharedStringSource.addEntry(rt.getCTRst());writeAttribute("t", STCellType.S.toString());_out.write("><v>");_out.write(String.valueOf(sRef));_out.write("</v>");} else {writeAttribute("t", "inlineStr");_out.write("><is><t");if (hasLeadingTrailingSpaces(cell.getStringCellValue())) {writeAttribute("xml:space", "preserve");}_out.write(">");outputQuotedString(cell.getStringCellValue());_out.write("</t></is>");}break;}case NUMERIC: {writeAttribute("t", "n");_out.write("><v>");_out.write(Double.toString(cell.getNumericCellValue()));_out.write("</v>");break;}case BOOLEAN: {writeAttribute("t", "b");_out.write("><v>");_out.write(cell.getBooleanCellValue() ? "1" : "0");_out.write("</v>");break;}case ERROR: {FormulaError error = FormulaError.forInt(cell.getErrorCellValue());writeAttribute("t", "e");_out.write("><v>");_out.write(error.getString());_out.write("</v>");break;}default: {throw new IllegalStateException("Invalid cell type: " + cellType);}}_out.write("</c>");}
测试
本地测试使用SXSSFWorkbook导出30w行 * 10列内存使用情况
可以看出内存有被回收的情况,比较平稳。
XLS
POI没有像XLSX那样对XLS的写做出性能的优化,原因是:
- 官方认为XLS的不像XLSX那样占内存
- XLS一个Sheet最多也只能有65535行数据
导入优化
POI对导入分为3种模式,用户模式User Model,事件模式Event Model,还有Event User Model。用户模式
用户模式(User Model)就类似于dom方式的解析,是一种high level api,给人快速、方便开发用的。缺点是一次性将文件读入内存,构建一颗Dom树。并且在POI对Excel的抽象中,每一行,每一个单元格都是一个对象。当文件大,数据量多的时候对内存的占用可想而知。
用户模式就是类似用WorkbookFactory.create(inputStream),poi 会把整个文件一次性解析,生成全部的Sheet,Row,Cell以及对象,如果导入文件数据量大的话,也很可能会导致OOM。
本地测试用户模式读取XLSX文件,数据量10w行 * 50列,内存使用如下
事件模式
事件模式(Event Model)就是SAX解析。Event Model使用的方式是边读取边解析,并且不会将这些数据封装成Row,Cell这样的对象。而都只是普通的数字或者是字符串。并且这些解析出来的对象是不需要一直驻留在内存中,而是解析完使用后就可以回收。所以相比于User Model,Event Model更节省内存,效率也更。
但是作为代价,相比User Model功能更少,门槛也要高一些。需要去学习Excel存储数据的各个Xml中每个标签,标签中的属性的含义,然后对解析代码进行设计。User Event Model
User Event Model也是采用流式解析,但是不同于Event Model,POI基于Event Model封装了一层。这样不用面对Element的事件编程,而是面向StartRow,EndRow,Cell等事件编程。而提供的数据,也不再像之前是原始数据,而是全部格式化好,方便开发者开箱即用。简化了开发效率。XLSX
POI对XLSX支持Event Model和Event User ModelXLSX的Event Model
使用
官网例子:
简单来说就是需要继承DefaultHandler,覆盖其startElement,endElement方法。然后方法里获取想要的数据。
原理
DefaultHandler相信熟悉的人都知道,这是JDK自带的对XML的SAX解析用到处理类,POI在进行SAX解析时,把读取到每个XML的元素时则会回调这两个方法,然后就可以获取到想用的数据了。
回忆一下上面说到的XLSX存储格式中sheet存储数据的格式。
再看看官方例子中的解析过程
@Overridepublic void startElement(String uri, String localName, String name,Attributes attributes) throws SAXException {//c代表是一个单元格cell,判断c这个xml元素里面属性attribute t// c => cellif(name.equals("c")) {// Print the cell referenceSystem.out.print(attributes.getValue("r") + " - ");// Figure out if the value is an index in the SSTString cellType = attributes.getValue("t");nextIsString = cellType != null && cellType.equals("s");inlineStr = cellType != null && cellType.equals("inlineStr");}// Clear contents cachelastContents = "";}@Overridepublic void endElement(String uri, String localName, String name)throws SAXException {// Process the last contents as required.// Do now, as characters() may be called more than onceif(nextIsString) {Integer idx = Integer.valueOf(lastContents);lastContents = lruCache.get(idx);if (lastContents == null && !lruCache.containsKey(idx)) {lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();lruCache.put(idx, lastContents);}nextIsString = false;}//v 元素代表这个cell的内容// v => contents of a cell// Output after we've seen the string contentsif(name.equals("v") || (inlineStr && name.equals("c"))) {System.out.println(lastContents);}}
可以看出需要对XLSX的XML格式清楚,才能获取到对应的内容。
XLSX的Event User Model
使用
官方例子
简单来说就是继承XSSFSheetXMLHandler.SheetContentsHandler,覆盖其startRow,endRow,cell,endSheet 等方法。POI每开始读行,结束读行,读取一个cell,结束读取一个sheet时回调的方法。从方法名上看Event User Model有更好的用户体验。
原理
其实Event User Model也是 Event Model的封装,在XSSFSheetXMLHandler(其实也是一个DefaultHandler来的)中持有一个SheetContentsHandler,在其startElement,endElement方法中会调用SheetContentsHandler的startRow,endRow,cell,endSheet等方法。
看看XSSFSheetXMLHandler的startElement和endElement方法
public void startElement(String uri, String localName, String qName,Attributes attributes) throws SAXException {if (uri != null && ! uri.equals(NS_SPREADSHEETML)) {return;}if (isTextTag(localName)) {vIsOpen = true;// Clear contents cachevalue.setLength(0);} else if ("is".equals(localName)) {// Inline string outer tagisIsOpen = true;} else if ("f".equals(localName)) {// Clear contents cacheformula.setLength(0);// Mark us as being a formula if not alreadyif(nextDataType == xssfDataType.NUMBER) {nextDataType = xssfDataType.FORMULA;}// Decide where to get the formula string fromString type = attributes.getValue("t");if(type != null && type.equals("shared")) {// Is it the one that defines the shared, or uses it?String ref = attributes.getValue("ref");String si = attributes.getValue("si");if(ref != null) {// This one defines it// TODO Save it somewherefIsOpen = true;} else {// This one uses a shared formula// TODO Retrieve the shared formula and tweak it to// match the current cellif(formulasNotResults) {logger.log(POILogger.WARN, "shared formulas not yet supported!");} /*else {// It's a shared formula, so we can't get at the formula string yet// However, they don't care about the formula string, so that's ok!}*/}} else {fIsOpen = true;}}else if("oddHeader".equals(localName) || "evenHeader".equals(localName) ||"firstHeader".equals(localName) || "firstFooter".equals(localName) ||"oddFooter".equals(localName) || "evenFooter".equals(localName)) {hfIsOpen = true;// Clear contents cacheheaderFooter.setLength(0);}else if("row".equals(localName)) {String rowNumStr = attributes.getValue("r");if(rowNumStr != null) {rowNum = Integer.parseInt(rowNumStr) - 1;} else {rowNum = nextRowNum;}//回调了SheetContentsHandler的startRow方法output.startRow(rowNum);}// c => cellelse if ("c".equals(localName)) {// Set up defaults.this.nextDataType = xssfDataType.NUMBER;this.formatIndex = -1;this.formatString = null;cellRef = attributes.getValue("r");String cellType = attributes.getValue("t");String cellStyleStr = attributes.getValue("s");if ("b".equals(cellType))nextDataType = xssfDataType.BOOLEAN;else if ("e".equals(cellType))nextDataType = xssfDataType.ERROR;else if ("inlineStr".equals(cellType))nextDataType = xssfDataType.INLINE_STRING;else if ("s".equals(cellType))nextDataType = xssfDataType.SST_STRING;else if ("str".equals(cellType))nextDataType = xssfDataType.FORMULA;else {// Number, but almost certainly with a special style or formatXSSFCellStyle style = null;if (stylesTable != null) {if (cellStyleStr != null) {int styleIndex = Integer.parseInt(cellStyleStr);style = stylesTable.getStyleAt(styleIndex);} else if (stylesTable.getNumCellStyles() > 0) {style = stylesTable.getStyleAt(0);}}if (style != null) {this.formatIndex = style.getDataFormat();this.formatString = style.getDataFormatString();if (this.formatString == null)this.formatString = BuiltinFormats.getBuiltinFormat(this.formatIndex);}}}}
~
@Overridepublic void endElement(String uri, String localName, String qName)throws SAXException {if (uri != null && ! uri.equals(NS_SPREADSHEETML)) {return;}String thisStr = null;// v => contents of a cellif (isTextTag(localName)) {vIsOpen = false;// Process the value contents as required, now we have it allswitch (nextDataType) {case BOOLEAN:char first = value.charAt(0);thisStr = first == '0' ? "FALSE" : "TRUE";break;case ERROR:thisStr = "ERROR:" + value;break;case FORMULA:if(formulasNotResults) {thisStr = formula.toString();} else {String fv = value.toString();if (this.formatString != null) {try {// Try to use the value as a formattable numberdouble d = Double.parseDouble(fv);thisStr = formatter.formatRawCellContents(d, this.formatIndex, this.formatString);} catch(NumberFormatException e) {// Formula is a String result not a Numeric onethisStr = fv;}} else {// No formatting applied, just do raw value in all casesthisStr = fv;}}break;case INLINE_STRING:// TODO: Can these ever have formatting on them?XSSFRichTextString rtsi = new XSSFRichTextString(value.toString());thisStr = rtsi.toString();break;case SST_STRING:String sstIndex = value.toString();try {int idx = Integer.parseInt(sstIndex);XSSFRichTextString rtss = new XSSFRichTextString(sharedStringsTable.getEntryAt(idx));thisStr = rtss.toString();}catch (NumberFormatException ex) {logger.log(POILogger.ERROR, "Failed to parse SST index '" + sstIndex, ex);}break;case NUMBER:String n = value.toString();if (this.formatString != null && n.length() > 0)thisStr = formatter.formatRawCellContents(Double.parseDouble(n), this.formatIndex, this.formatString);elsethisStr = n;break;default:thisStr = "(TODO: Unexpected type: " + nextDataType + ")";break;}// Do we have a comment for this cell?checkForEmptyCellComments(EmptyCellCommentsCheckType.CELL);XSSFComment comment = commentsTable != null ? commentsTable.findCellComment(new CellAddress(cellRef)) : null;//回调了SheetContentsHandler的cell方法// Outputoutput.cell(cellRef, thisStr, comment);} else if ("f".equals(localName)) {fIsOpen = false;} else if ("is".equals(localName)) {isIsOpen = false;} else if ("row".equals(localName)) {// Handle any "missing" cells which had comments attachedcheckForEmptyCellComments(EmptyCellCommentsCheckType.END_OF_ROW);//回调了SheetContentsHandler的endRow方法// Finish up the rowoutput.endRow(rowNum);// some sheets do not have rowNum set in the XML, Excel can read them so we should try to read them as wellnextRowNum = rowNum + 1;} else if ("sheetData".equals(localName)) {// Handle any "missing" cells which had comments attachedcheckForEmptyCellComments(EmptyCellCommentsCheckType.END_OF_SHEET_DATA);}else if("oddHeader".equals(localName) || "evenHeader".equals(localName) ||"firstHeader".equals(localName)) {hfIsOpen = false;output.headerFooter(headerFooter.toString(), true, localName);}else if("oddFooter".equals(localName) || "evenFooter".equals(localName) ||"firstFooter".equals(localName)) {hfIsOpen = false;output.headerFooter(headerFooter.toString(), false, localName);}}
代码有点多,
- 一是为了展示一下XSSFSheetXMLHandler解析XML的过程,可以粗略看看
- 二是可以看出Event User Model也是Event Model的封装
测试
本地测试使用Event User Model读取XLSX文件,数据量10w行 * 50列
可以看出内存有回收的情况,比User Model好多了。XLS
POI对XLS支持Event Model使用
官方例子
需要继承HSSFListener,覆盖processRecord 方法,POI每读取到一个单元格的数据则会回调次方法。
原理
总结
POI优化了对XLSX的大批量写,以及支持对XLS和XLSX的SAX读,在实际开发时需要根据业务量来选择正确的处理,不然可能会导致OOM。
