float8

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 1, 2024 License: MIT Imports: 1 Imported by: 1

README

Float8

minifloat in Golang


In computing, minifloats are floating-point values represented with very few bits. The library implements float8 (8-bit uint8). It ideal for applications where memory and storage efficiency are crucial but lossy precision is acceptable (e.g. computer graphics, manche learning, etc).

Features

  • IEEE 754 and FP8 E4M3 compatible format.
  • Fast conversion from/to float32.
  • Fast algebraic operations (+, -, *, /).

Getting Started

The latest version of the module is available at main branch. All development, including new features and bug fixes, take place on the main branch using forking and pull requests as described in contribution guidelines. The stable version is available via Golang modules.

Use go get to retrieve the library and add it as dependency to your application.

go get -u github.com/kshard/float8

Quick example

package main

import (
  "fmt"
  "math"

  "github.com/kshard/float8"
)

func main() {
  phi08 := float8.ToFloat8(math.Phi)
  phi32 := float8.ToFloat32(phi08)
  fmt.Printf("𝝅(08) %f : %08b \n", phi32, phi08)
  fmt.Printf("𝝅(32) %f : %032b \n", float32(math.Phi), math.Float32bits(math.Phi))
}

The conversion is lossy and supported range of values is limited.

Benchmark

The library implement public api using code books so that its pure Go code is efficient:

go test -run=^$ -bench=. -benchtime=1s -cpu 1
BenchmarkToFloat8     546344320         1.9410 ns/op
BenchmarkToFloat32   1000000000         0.5158 ns/op
BenchmarkAdd         1000000000         0.5804 ns/op
BenchmarkMul         1000000000         0.6461 ns/op
BenchmarkToSlice8       3481468         348.70 ns/op

The internal package math8 implements float-point algebra with focus on correctness, which is used to build code books.

How To Contribute

The library is MIT licensed and accepts contributions via GitHub pull requests:

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

The build and testing process requires Go version 1.21 or later.

commit message

The commit message helps us to write a good release note, speed-up review process. The message should address two question what changed and why. The project follows the template defined by chapter Contributing to a Project of Git book.

bugs

If you experience any issues with the library, please let us know via GitHub issues. We appreciate detailed and accurate reports that help us to identity and replicate the issue.

License

See LICENSE

References

  1. https://en.wikipedia.org/wiki/Minifloat
  2. https://en.wikipedia.org/wiki/Exponent_bias
  3. https://en.wikipedia.org/wiki/Floating-point_arithmetic
  4. https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html
  5. https://github.com/x448/float16

Documentation

Overview

DO NOT EDIT! Use cmd to regenerate it.

DO NOT EDIT! Use cmd to regenerate it.

DO NOT EDIT! Use cmd to regenerate it.

Package float8 implement minifloat (https://en.wikipedia.org/wiki/Minifloat) compatible with IEEE 754 and FP8 E4M3 The number is defined as ±mantissa × 2^exponent

DO NOT EDIT! Use cmd to regenerate it.

DO NOT EDIT! Use cmd to regenerate it.

Index

Constants

View Source
const (
	Infinity = 0x7f | mantissaMask
)

Variables

This section is empty.

Functions

func ToFloat32

func ToFloat32(f8 Float8) float32

Convert float8 to float32

Types

type Float8

type Float8 = uint8

Float8 data type

func Add

func Add(a, b Float8) Float8

Add float8(s)

func Div

func Div(a, b Float8) Float8

Divide float8(s)

func Mul

func Mul(a, b Float8) Float8

Multiply float8(s)

func Sub

func Sub(a, b Float8) Float8

Subtract float8(s)

func ToFloat8

func ToFloat8(f32 float32) Float8

Convert float32 to float8

func ToSlice8 added in v0.0.3

func ToSlice8(f32s []float32) (f8s []Float8)

Convert []float32 to []float8 Note: the function is faster than standard range over []float32

Directories

Path Synopsis
internal
math8
Package math8 implements canonical operations using float8 type.
Package math8 implements canonical operations using float8 type.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL