snappy

package module

v0.1.0 Latest Latest Go to latest Published: Aug 2, 2024 License: Apache-2.0 Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/qualtrics/hadoop-snappy

Links

Open Source Insights

README ¶

Hadoop Snappy Reader

Coverage

Small library that provides a reader for reading Hadoop Snappy encoded data. See the Go Package documentation for more information and examples of how to use the reader.

There are not currently plans to implement a writer, as the main utility of this library is to read and use data already produced by the Hadoop ecosystem. However, we are open to extending this library to support a writer or other use cases if there is interest.

Developing

Prerequisites

Install Go

Run Tests

go test ./...

Creating Test Data

Install snzip
- Mac: brew install snzip
- Other: Instructions
Add the uncompressed file to testdata/
Create the compressed file with snzip -t hadoop-snappy -k testfile/{uncompressed file}

Release

Be sure to understand how Go Module publishing works, especially semantic versioning.

To release simply create a new semantically versioned tag and push it.

# Create a new semantic versioned tag with release notes
git tag -a v1.0.0 -m "release notes"

# Push the tag to the remote repository
git push origin v1.0.0

Hadoop-Snappy Stream Encoding Format

The Hadoop format of snappy is similar to regular snappy block encoding, except that instead of compressing into one big block, Hadoop will create a stream of frames where each frame contains blocks that can each be independently decoded. A frame can contain 1 or more blocks and a stream can contain 1 or more frames.

Each FRAME begins with a 4 byte header, which represents the total length of the frame after being DECOMPRESSED (i.e. once we're done decompressing the frame, this is how long the decompressed frame will be). This 4 byte header is a big endian encoded uint32. The header is not included in the total length of the frame.

Each BLOCK in the frame also begins with a 4 byte header that is the COMPRESSED length of the block (i.e. how many bytes we need to read from the stream to get the entire block before we can decompress it). This header is also a big endian encoded uint32. The header is not included in the total length of the block.

The stream structure is as follows

'['   == start of stream
']'   == end of stream
'|'   == component separator (symbolic only as the actual data has no padding or separators)
'...' == abbreviated

[ frame 1 header | block 1 header | block 1 | block 2 header | block 2 | ... | frame 2 header | block 1 header | block 1 | ... ]

The format of each individual snappy block can be found here.

Documentation ¶

Overview ¶

Package snappy implements decompression for the Hadoop format of snappy; a compression scheme internal to the Hadoop ecosystem and HDFS.

Example ¶

package main

import (
	"bytes"
	"fmt"
	"io"

	snappy "github.com/qualtrics/hadoop-snappy"
)

func main() {
	// encodedData is the string "Hello, world!" encoded using the hadoop-snappy compression format.
	encodedData := []byte{0x00, 0x00, 0x00, 0x0D, 0x00, 0x00, 0x00, 0x0F, 0x0D, 0x30, 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x2C, 0x20, 0x77, 0x6F, 0x72, 0x6C, 0x64, 0x21}

	r := snappy.NewReader(bytes.NewReader(encodedData))

	output, err := io.ReadAll(r)
	if err != nil {
		panic(err)
	}

	fmt.Printf("%s\n", output)
}

Output:

Hello, world!

Index ¶

type Reader
- func NewReader(in io.Reader) *Reader
- func (r *Reader) Read(out []byte) (int, error)

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Reader ¶

type Reader struct {
	// contains filtered or unexported fields
}

Reader wraps a hadoop-snappy compressed data stream and decompresses the stream as it is read by the caller.

func NewReader ¶

func NewReader(in io.Reader) *Reader

NewReader returns a Reader that reads the hadoop-snappy compressed data stream provided by the input reader.

Reading from an input stream that is not hadoop-snappy compressed will result in undefined behavior. Because there is no data signature to detect the compression format, the reader can only try to read the stream and will likely return an error, but it may return garbage data instead.

func (*Reader) Read ¶

func (r *Reader) Read(out []byte) (int, error)

Read implements the io.Reader interface. Read will return the decompressed data from the compressed input data stream. Read returns io.EOF when all data has been decompressed and read.

Source Files ¶

View all Source files

snappy.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL