Generate XLSX with unlimited rows
17 Aug 2019In my previous post I have talked about Elixir stream. In this post I am going to talk about a particular use case. How would one approach the problem of generating a Excel spreadsheet with unlimited number of rows. The main constrain is that we only have a limited memory available.
The first part of the problem is to understand the XLSX file format. I won’t get into the details here, but there is a nice article about the anatomy of the XLSX file. In essence, XLSX is just a zip file with different file extension.
$ unzip -q -d spreadsheet spreadsheet.xlsx $ tree -C spreadsheet/ spreadsheet/ ├── [Content_Types].xml ├── _rels ├── docProps │ ├── app.xml │ └── core.xml └── xl ├── _rels │ └── workbook.xml.rels ├── sharedStrings.xml ├── styles.xml ├── workbook.xml └── worksheets └── sheet1.xml
Except sheet1.xml
, other xml files are just made of static contents
that will be nearly the same size regardless of the number of rows.
Now we have two smaller problems
1) can we generate xml using stream.
2) can we combine multiple xml stream and generate a zip stream.
Let’s start with the xml problem. In abstract sense, xml document is a
tree structure. To print a xml document, we have to first print the
open tag <tag>
and all its children and the close tag
</tag>
. Let’s say we do a pre-order like traversal, at any point in
time, we have to keep the list of parent nodes in a stack like data
structure. Essentially at minimum we would need memory proportional to
the depth of the tree. Fortunately, most of the xml documents we deal
in real world won’t have much depth. In our case with spreadsheet, the
max depth would be less than 10.
Let’s take the example of generating html table. Consider the case of generating table with unlimited rows and unlimited cells within each row.
import XmlStream
rows = Stream.map(1..1000_000, fn row ->
Stream.map(1..1000, fn cell ->
{row, cell}
end)
end)
element(
"table",
Stream.map(rows, fn row ->
element(
"tr",
Stream.map(row, fn {row_index, cell_index} ->
element("td", content("row #{row_index} cell #{cell_index}"))
end)
)
end)
)
|> XmlStream.stream!
We have a nested stream called rows
and a xml document constructed
based on that stream using primitives like element
and
content
. element
takes tag name as the first argument. The second
argument represents the children, it could be a either a stream or a
text generated using content
.
XmlStream.stream!
is the one which does the real work. It takes a
xml document and returns a byte stream. Essentially it does a
pre-order traversal of the xml nodes and emits the required tags
whenever it enters and exists a node.
Even though the approach makes sense, we haven’t explained how to do a
pre-order traversal using
Stream module. There is no
Stream.traverse function, but it’s quite trivial to implement a
traverse
function on top of
Stream.flat_map
def traverse(stream, leaf?) do
Stream.flat_map(stream, fn node ->
if leaf?.(node) do
[:leaf, node]
else
Stream.concat([
[:node_start],
traverse(node, leaf?),
[:node_end]
])
end
end)
end
traverse(rows, &is_tuple/1)
|> Stream.each(&IO.inspect/1)
|> Stream.run
The important part is the recursive call. The callback function passed to Stream.flat_map can return either normal list or another stream. In case of stream, it will be forced which will trigger another Stream.flat_map and so on. At any point in time, we will have at max only one active stream per level of depth. A full implementation of XmlStream can be found at activesphere/xml_stream. Even though the code might look complicated, the essential principle is the same.
Now that we have solved the first problem, how could we combine these xml streams and generate a zip file.
If we look at the overview of the zip file format in the specification, we can find two important headers: local file header and central directory header.
[local file header 1] ──────────────────── [encryption header 1] [file data 1] [data descriptor 1] . . . [local file header n] [encryption header n] [file data n] [data descriptor n] [archive decryption header] [archive extra data record] [central directory header 1] ───────────── . . . [central directory header n] [zip64 end of central directory record] [zip64 end of central directory locator] [end of central directory record]
local file header signature 4 bytes (0x04034b50) version needed to extract 2 bytes general purpose bit flag 2 bytes compression method 2 bytes last mod file time 2 bytes last mod file date 2 bytes crc-32 4 bytes compressed size 4 bytes uncompressed size 4 bytes file name length 2 bytes extra field length 2 bytes file name (variable size) extra field (variable size)
central file header signature 4 bytes (0x02014b50) version made by 2 bytes version needed to extract 2 bytes general purpose bit flag 2 bytes compression method 2 bytes last mod file time 2 bytes last mod file date 2 bytes crc-32 4 bytes compressed size 4 bytes uncompressed size 4 bytes file name length 2 bytes extra field length 2 bytes file comment length 2 bytes disk number start 2 bytes internal file attributes 2 bytes external file attributes 4 bytes relative offset of local header 4 bytes file name (variable size) extra field (variable size) file comment (variable size)
Both local file header and central directory header contain fields like crc-32 and compressed and uncompressed file size, which can’t be calculated beforehand in our case. Fortunately, these fields are optional in the local file header and could be set to 0. The irony with zip format is that, if we write it in a streaming way, we can’t read it back in a streaming way.
Zstream.zip([
Zstream.entry("xl/worksheets/sheet1.xml", sheet1_stream),
Zstream.entry("xl/worksheets/workbook.xml", workbook)
])
|> Stream.into(File.stream!("spreadsheet.xlsx"))
|> Stream.run
Assume we expose an interface like the above, we could use a combination of Stream.flat_map and Stream.transform.
def zip(entries) do
Stream.concat([
[{:start}],
Stream.flat_map(entries,
fn %{stream: stream, name: name, options: options} ->
Stream.concat(
[{:head, %{name: name, options: options}}],
stream
)
end),
[{:end}]
])
|> Stream.transform(fn -> %State{} end, &construct/2, &free_resource/1)
end
The entries are flat mapped and extra annotations are added, so we
know when a file stream starts and when it ends. The construct
function will then just emit correct zip headers and keep track of the
list of files and their size in the state.
defp construct({:start}, state) do
defp construct({:end}, state) do
defp construct({:head, header}, state) do
defp construct(chunk, state) do
When all the streams are emitted and it hits the :end
annotation,
the central directory headers will be emitted based on the data in the state.
The only time we need to use unbounded memory is for the state where we keep track of the central directory headers. But as we use very small amount of memory per file, this should not be an issue in real world. The full source code of the library can be found at ananthakumaran/zstream