snappy

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 2, 2024 License: Apache-2.0 Imports: 6 Imported by: 0

README

Hadoop Snappy Reader

Go Reference Build Status Coverage

Small library that provides a reader for reading Hadoop Snappy encoded data. See the Go Package documentation for more information and examples of how to use the reader.

There are not currently plans to implement a writer, as the main utility of this library is to read and use data already produced by the Hadoop ecosystem. However, we are open to extending this library to support a writer or other use cases if there is interest.

Developing

Prerequisites
  1. Install Go
Run Tests
go test ./...
Creating Test Data
  1. Install snzip
  2. Add the uncompressed file to testdata/
  3. Create the compressed file with snzip -t hadoop-snappy -k testfile/{uncompressed file}

Release

Be sure to understand how Go Module publishing works, especially semantic versioning.

To release simply create a new semantically versioned tag and push it.

# Create a new semantic versioned tag with release notes
git tag -a v1.0.0 -m "release notes"

# Push the tag to the remote repository
git push origin v1.0.0

Hadoop-Snappy Stream Encoding Format

The Hadoop format of snappy is similar to regular snappy block encoding, except that instead of compressing into one big block, Hadoop will create a stream of frames where each frame contains blocks that can each be independently decoded. A frame can contain 1 or more blocks and a stream can contain 1 or more frames.

Each FRAME begins with a 4 byte header, which represents the total length of the frame after being DECOMPRESSED (i.e. once we're done decompressing the frame, this is how long the decompressed frame will be). This 4 byte header is a big endian encoded uint32. The header is not included in the total length of the frame.

Each BLOCK in the frame also begins with a 4 byte header that is the COMPRESSED length of the block (i.e. how many bytes we need to read from the stream to get the entire block before we can decompress it). This header is also a big endian encoded uint32. The header is not included in the total length of the block.

The stream structure is as follows

'['   == start of stream
']'   == end of stream
'|'   == component separator (symbolic only as the actual data has no padding or separators)
'...' == abbreviated

[ frame 1 header | block 1 header | block 1 | block 2 header | block 2 | ... | frame 2 header | block 1 header | block 1 | ... ]

The format of each individual snappy block can be found here.

Documentation

Overview

Package snappy implements decompression for the Hadoop format of snappy; a compression scheme internal to the Hadoop ecosystem and HDFS.

Example
package main

import (
	"bytes"
	"fmt"
	"io"

	snappy "github.com/qualtrics/hadoop-snappy"
)

func main() {
	// encodedData is the string "Hello, world!" encoded using the hadoop-snappy compression format.
	encodedData := []byte{0x00, 0x00, 0x00, 0x0D, 0x00, 0x00, 0x00, 0x0F, 0x0D, 0x30, 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x2C, 0x20, 0x77, 0x6F, 0x72, 0x6C, 0x64, 0x21}

	r := snappy.NewReader(bytes.NewReader(encodedData))

	output, err := io.ReadAll(r)
	if err != nil {
		panic(err)
	}

	fmt.Printf("%s\n", output)
}
Output:

Hello, world!

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader wraps a hadoop-snappy compressed data stream and decompresses the stream as it is read by the caller.

func NewReader

func NewReader(in io.Reader) *Reader

NewReader returns a Reader that reads the hadoop-snappy compressed data stream provided by the input reader.

Reading from an input stream that is not hadoop-snappy compressed will result in undefined behavior. Because there is no data signature to detect the compression format, the reader can only try to read the stream and will likely return an error, but it may return garbage data instead.

func (*Reader) Read

func (r *Reader) Read(out []byte) (int, error)

Read implements the io.Reader interface. Read will return the decompressed data from the compressed input data stream. Read returns io.EOF when all data has been decompressed and read.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL