huggingface

package
v0.14.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 24, 2024 License: Apache-2.0 Imports: 21 Imported by: 0

Documentation ¶

Overview ¶

Package huggingface 🤗 provides functionality do download HuggingFace (HF) models and extract tensors stored in the ".safetensors" format.

Example: Download (only the first time) and enumerate all the tensors from Google's Gemma v2 model:

import (
	"github.com/janpfeifer/must"
	hfd "github.com/gomlx/gomlx/ml/data"
	hfd "github.com/gomlx/gomlx/ml/data/huggingface"
)

var (
	hfModelId = "google/gemma-2-2b-it"
	hfToken = "..."  // Create a read-only token for you in HuggingFace site.
	flagDataDir = flag.String("data", "~/work/gemma", "Directory to cache downloaded and generated dataset files.")
)

func HuggingFaceDir() string {
	dataDir := data.ReplaceTildeInDir(*flagDataDir)
	return path.Join(dataDir, "huggingface")
}

func main() {
	flag.Parse()
	hfm := must.M1(hfd.New(hfModelId, hfToken, HuggingFaceDir()))
	for e, err := range hfm.EnumerateTensors() {
		must.M(err)
		fmt.Printf("\t%s -> %s\n", e.Name, e.Tensor.Shape())
	}
}

Index ¶

Constants ¶

View Source
const InfoFile = "_info_.json"

InfoFile is the file with the information about the model. The info about a model is fetched once and cached on this file, to prevent going to the network.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type FileInfo ¶

type FileInfo struct {
	Name string `json:"rfilename"`
}

FileInfo represents one of the model file, in the Info structure.

type FileNameAndPath ¶

type FileNameAndPath struct {
	Name, Path string
}

FileNameAndPath to a files for a model. Name is stored in the info "Siblings" field, and Path is the path in the local storage.

type Info ¶

type Info struct {
	ID          string          `json:"id"`
	ModelID     string          `json:"model_id"`
	Author      string          `json:"author"`
	SHA         string          `json:"sha"`
	Tags        []string        `json:"tags"`
	Siblings    []*FileInfo     `json:"siblings"`
	SafeTensors SafeTensorsInfo `json:"safetensors"`
}

Info holds information about a HuggingFace model, it is the json served when hitting the URL https://huggingface.co/api/models/<model_id>

type Model ¶

type Model struct {
	// ID may include owner/model. E.g.: google/gemma-2-2b-it
	ID string

	// AuthToken is the HuggingFace authentication token to be used when downloading the files.
	AuthToken string

	// BaseDir is where the local copy of the model is stored.
	BaseDir string

	// Verbosity: 0 for quiet operation; 1 for information about progress; 2 and higher for debugging.
	Verbosity int

	// MaxParallelDownload indicates how many files to download at the same time. Default is 20.
	// If set to <= 0 it will download all files in parallel.
	// Set to 1 to make downloads sequential.
	MaxParallelDownload int

	// Info downloaded from model.
	// It is only available after DownloadInfo is called.
	Info *Info
}

func New ¶

func New(id string, authToken, baseDir string) (*Model, error)

New creates a reference to a HuggingFace model given its id.

The id typically include owner/model. E.g.: "google/gemma-2-2b-it"

The authToken can be created in HuggingFace site, in the profile settings page. A "read-only" token will do for most models. Leave empty if not using one (but some models can't be downloaded without it).

The baseDir is suffixed with the model's id (after converting "/" to "_"). So the same baseDir can be used to hold different models.

func (*Model) Download ¶

func (hfm *Model) Download() error

Download first download the info about the model, with the list of files associated with the model, and then all the model files.

It then downloads any files not available locally yet -- files are downloaded to a ".downloading" suffix, and moved to the final destination once they finished to download.

func (*Model) DownloadInfo ¶

func (hfm *Model) DownloadInfo() error

DownloadInfo structure about the model -- or read it from disk if it is cached locally already. It sets Model.Info with the downloaded information if successful.

func (*Model) EnumerateFileNames ¶

func (hfm *Model) EnumerateFileNames() iter.Seq2[FileNameAndPath, error]

EnumerateFileNames loads the model info and lists the file names stored for the model. It doesn't download the files, only lists their relative name, and their local storage path.

See Model.Download to actually download the files.

func (*Model) EnumerateTensors ¶

func (hfm *Model) EnumerateTensors() iter.Seq2[*NamedTensor, error]

EnumerateTensors returns an iterator over all the tensors stored in ".safetensors" files, already converted to GoMLX *tensors.Tensor, with their associated names.

It calls Download first, to make sure the files are already there.

type NamedTensor ¶

type NamedTensor struct {
	Name   string
	Tensor *tensors.Tensor
}

NamedTensor represents a tensor and its name in a ".safetensors" file.

type SafeTensorsInfo ¶

type SafeTensorsInfo struct {
	Total int

	// Parameters: maps dtype name to int.
	Parameters map[string]int
}

SafeTensorsInfo holds counts on number of parameters of various types.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL