xlripper

package module
v0.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 1, 2019 License: MIT Imports: 21 Imported by: 0

README

xlripper

Quickly Extract Data from XLSX Sheets in Go

Background

We have encountered (once in Node.js and once in Go) libraries that extract data from XLSX files, but that do so much, much too slowly (or with too much RAM) for our use case. Our use case involves extracting a spreadsheet that has about 65 columns, and about 380,000 rows. The total (compressed) file size is somewhere around 80 MB of data.

Writing a C++ Node.js plugin, which uses xlsxio (https://github.com/brechtsanders/xlsxio) we achieved a benchmark of approximately 30 seconds using about 4 GB of RAM (though it was hard to tell how the plugin would have performed without the JavaScript engine). We later ported our main application to Go and attempted to use https://github.com/tealeg/xlsx. It's not clear how long this library would take, but it's more than 5 minutes and at least 7GB of RAM. We believe there are two bottlenecks; 1) (primarily) Go's built-in XML parsing is too slow, and 2) (secondarily) tealeg's library provides features that we do not need which may slow it down slightly. However, profiling reveals that the largest bottleneck is Go's very sad XML parser.

Solution

The xlripper library does one thing only. Its purpose is to take in an xlsx file and to return arrays of strings representing the data found in the xlsx file's sheets.

Priorities

  1. Native Go implementation, no cgo.
  2. Go as fast as possible
  3. Optimize for lower memory overhead if it can be done without making it too slow.

Installation

This is a Go library, there is no main function. To use this library in your own application:

go get -u github.com/bitflip-software/xlsx

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Column

type Column struct {
	Cells []*string
}

Column represents a column of values in an xlsx spreadsheet. Cell values are represented as strings. Type and formatting information from the spreadsheet is discarded, only a string representation of the value remains.

The strings are held as pointers for the sake of memory optimization. You should not mutate these as you may be surprised by the results if other columns or cells are pointing to the same string. The data structure is intended to be used as a read-only data structure.

func NewColumn

func NewColumn() Column

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

func NewParser

func NewParser(filename string) (Parser, error)

func NewParserFromBytes

func NewParserFromBytes(b []byte) (Parser, error)

func (Parser) NumSheets

func (p Parser) NumSheets() int

func (Parser) Parse

func (p Parser) Parse() ([]Sheet, error)

func (Parser) ParseOne

func (p Parser) ParseOne(sheetIndex int) (Sheet, error)

func (Parser) SheetNames

func (p Parser) SheetNames() []string

type Sheet

type Sheet struct {
	Name    string
	Index   int
	Columns []Column
	// contains filtered or unexported fields
}

func NewSheet

func NewSheet() Sheet

Directories

Path Synopsis
Package privxml exists so that structs can be publicly exported for Go's build-in XML parser.
Package privxml exists so that structs can be publicly exported for Go's build-in XML parser.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL