baremetal

package
v0.1.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2026 License: BSD-3-Clause Imports: 30 Imported by: 0

README

Bare Metal

Bare metal is a simple compute job manager for bare metal machines, with the manager running on a local "client" (laptop), using the goal language ssh facilities to connect to the servers and execute all the management through that connection, so that nothing needs to be installed on the server.

Jobs are submitted through an RPC connection (e.g., from simrun) also running on the local client typically. The job itself consists of a gzipped tar file ("job blob") containing an executable script (chmod +x) that is run on a server, along with relevant metadata.

There is no attempt to prioritize jobs: it is just FIFO. The main function is just to manage a queue of jobs so that the compute resources are not overloaded, along with basic job monitoring for completion, canceling, etc.

Each job consumes one GPU, as key a simplification to minimize resource management complexity.

State and config files

The baremetal program itself uses a config.html file for configuration, and saves a record of all jobs and active state in a state.json file, both of which are saved in the "app data" location for the app (e.g., ~/Library/BareMetal on mac). The state file allows the baremetal program to be fully restartable.

Environment variables for running job

  • BARE_GPU = the allocated GPU number (0..N]

job.* files

  • job.out contains all the output from running the job script.
  • job.pid has the pid process id of the job.
  • job.files.tar.gz has the files submitted for the job.
  • job.results.tar.gz has the results from the job

Configuring a new "bare metal" linux compute server

sudo apt install gcc libgl1-mesa-dev libegl1-mesa-dev mesa-vulkan-drivers xorg-dev vulkan-tools nvidia-driver-550-server nvidia-utils-550-server

Note: the nvidia-* packages are critical for vulkaninfo to see the GPUs. No additional DISPLAY variable or anything should be necessary. 550 is the stable branch (recommended) while 565 is a more advanced "new features" branch.

Typically have to install Go manually from https://go.dev/doc/install to get an up-to-date version.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AllFiles

func AllFiles(dir string, exclude ...string) ([]string, error)

AllFiles returns all file names within given directory, including subdirectory, excluding those matching given glob expressions. Files are relative to dir, and do not include the full path.

func JobIDsFromPB

func JobIDsFromPB(ids []int64) []int

JobIDsFromPB returns job id numbers from int64 in pb.JobIDList

func JobIDsToPB

func JobIDsToPB(ids []int) []int64

JobIDsToPB returns job id numbers as int64 for pb.JobIDList

func JobToPB

func JobToPB(job *Job) *pb.Job

JobToPB returns the protobuf version of given Job.

func JobsToPB

func JobsToPB(jobs []*Job) *pb.JobList

JobsToPB returns the protobuf version of given Jobs list.

func TarFiles

func TarFiles(w io.Writer, dir string, gz bool, files ...string) error

TarFiles writes a tar file to given writer, from given source directory. Tar file names are as listed here so it will unpack directly to those files. If gz is true, then tar is gzipped.

func Untar

func Untar(r io.Reader, dir string, gz bool) error

Untar extracts a tar file from given reader, into given source directory. If gz is true, then tar is gzipped.

Types

type BareMetal

type BareMetal struct {

	// Servers is the ordered list of server machines.
	Servers Servers `json:"-"`

	// NextID is the next job ID to assign.
	NextID int `edit:"-"`

	// Active has all the active (pending, running) jobs being managed,
	// in the order submitted.
	// The unique key is the bare metal job ID (int).
	Active Jobs

	// Done has all the completed jobs that have been run.
	// This list can be purged by time as needed.
	// The unique key is the bare metal job ID (int).
	Done Jobs

	// Lock for responding to inputs.
	// everything below top-level input processing is assumed to be locked.
	sync.Mutex `json:"-" toml:"-"`
	// contains filtered or unexported fields
}

BareMetal is the overall bare metal job manager.

func NewBareMetal

func NewBareMetal() *BareMetal

func (*BareMetal) CancelJobs

func (bm *BareMetal) CancelJobs(ids ...int) error

CancelJobs cancels list of job IDs. Returns error for jobs not found.

func (*BareMetal) FetchResults

func (bm *BareMetal) FetchResults(resultsGlob string, ids ...int) ([]*Job, error)

FetchResults gets job results back from server for given job id(s). Results are available as job.Results as a compressed tar file.

func (*BareMetal) Init

func (bm *BareMetal) Init()

Init does the full initialization of the baremetal server.

func (*BareMetal) Interactive

func (bm *BareMetal) Interactive()

Interactive runs the interpreter in interactive mode.

func (*BareMetal) JobStatus

func (bm *BareMetal) JobStatus(ids ...int) []*Job

JobStatus gets current job data for given job id(s). An empty list returns all of the currently Active jobs.

func (*BareMetal) OpenLog

func (bm *BareMetal) OpenLog(filename string) error

OpenLog opens a log file for recording actions.

func (*BareMetal) RecoverJob added in v0.1.3

func (bm *BareMetal) RecoverJob(job *Job) (*Job, error)

RecoverJob reinstates job information so files can be recovered etc.

func (*BareMetal) Server

func (bm *BareMetal) Server(name string) (*Server, error)

Server provides error-wrapped access to Servers by name.

func (*BareMetal) StartBGUpdates

func (bm *BareMetal) StartBGUpdates()

StartBGUpdates starts a ticker to update job status periodically.

func (*BareMetal) Submit

func (bm *BareMetal) Submit(src, path, script, results string, files []byte) *Job

Submit adds a new Active job with given parameters.

func (*BareMetal) UpdateJobs

func (bm *BareMetal) UpdateJobs() (nrun, nfinished int, err error)

UpdateJobs runs any pending jobs if there are available GPUs to run on. returns number of jobs started, and any errors incurred in starting jobs.

type Client

type Client struct {
	// The server address including port number.
	Host string `default:"localhost:8585"`

	Timeout time.Duration
	// contains filtered or unexported fields
}

func NewClient

func NewClient() *Client

func (*Client) CancelJobs

func (cl *Client) CancelJobs(ids ...int) error

CancelJobs cancels list of job IDs. Returns error for jobs not found.

func (*Client) Connect

func (cl *Client) Connect() error

Connect connects to the server

func (*Client) FetchResults

func (cl *Client) FetchResults(resultsGlob string, ids ...int) ([]*Job, error)

FetchResults gets job results back from server for given job id(s). Results are available as job.Results as a compressed tar file.

func (*Client) JobStatus

func (cl *Client) JobStatus(ids ...int) ([]*Job, error)

JobStatus gets current job data back from server for given job id(s).

func (*Client) RecoverJob added in v0.1.3

func (cl *Client) RecoverJob(job *Job) (*Job, error)

RecoverJob recovers a job which has been lost somehow. It just adds the given job to the job table.

func (*Client) Submit

func (cl *Client) Submit(source, path, script, resultsGlob string, files []byte) (*Job, error)

Submit adds a new Active job with given parameters.

func (*Client) UpdateJobs

func (cl *Client) UpdateJobs()

UpdateJobs pings the server to run its updates. This happens automatically very 10 seconds but this is for the impatient.

type Job

type Job struct {
	// ID is the overall baremetal unique ID number.
	ID int

	// Status is the current status of the job.
	Status Status

	// Source is info about the source of the job, e.g., simrun sim project.
	Source string

	// Path is the path from the SSH home directory to launch the job in.
	// This path will be created on the server when the job is run.
	Path string

	// Script is name of the job script to run, which must be at the top level
	// within the given tar file.
	Script string

	// Files is the gzipped tar file of the job files set at submission.
	Files []byte `display:"-"`

	// ResultsGlob is a glob expression for the result files to get back
	// from the server (e.g., *.tsv). job.out is automatically included as well,
	// which has the job stdout, stederr output.
	ResultsGlob string `display:"-"`

	// Results is the gzipped tar file of the job result files, gathered
	// at completion or when queried for results.
	Results []byte `display:"-"`

	// Submit is the time submitted.
	Submit time.Time

	// Start is the time actually started.
	Start time.Time

	// End is the time stopped running.
	End time.Time

	// ServerName is the name of the server it is running / ran on. Empty for pending.
	ServerName string

	// ServerGPU is the logical index of the GPU assigned to this job (0..N-1).
	ServerGPU int

	// pid is the process id of the job script.
	PID int
}

Job is one bare metal job.

func JobFromPB

func JobFromPB(job *pb.Job) *Job

JobFromPB returns a Job based on the protobuf version.

func JobsFromPB

func JobsFromPB(pjs *pb.JobList) []*Job

JobsFromPB returns Jobs from the protobuf version of given Jobs list.

type Jobs

type Jobs = keylist.List[int, *Job]

Jobs is the ordered list of jobs, in order submitted.

type Server

type Server struct {
	// Name is the alias used for gossh.
	Name string

	// SSH is string to gossh to.
	SSH string

	// NGPUs is the number of GPUs on this server.
	NGPUs int

	// Used is a map of GPUs current being used.
	Used map[int]bool `edit:"-" toml:"-"`
}

Server specifies a bare metal Server.

func (*Server) Avail

func (sv *Server) Avail() int

Avail returns the number of servers available.

func (*Server) FreeGPU

func (sv *Server) FreeGPU(gpu int)

FreeGPU makes the given GPU number available.

func (*Server) ID

func (sv *Server) ID() string

ID returns the server SSH ID string: @Name

func (*Server) NextGPU

func (sv *Server) NextGPU() int

NextGPU returns the next GPU index available, and adds it to the Used list. Returns -1 if none available.

func (*Server) OpenSSH

func (sv *Server) OpenSSH()

OpenSSH opens the SSH connection for this server.

func (*Server) Use

func (sv *Server) Use()

Use makes this the active server.

type ServerAvail

type ServerAvail struct {
	Name  string
	Avail int
}

ServerAvail is used to report the number of available gpus per server.

type Servers

type Servers = keylist.List[string, *Server]

Servers is the ordered list of servers, in order of use preference.

type Status

type Status int32 //enums:enum

Status are the job status values.

const (
	// NoStatus is the unknown status state.
	NoStatus Status = iota

	// Pending means the job has been submitted, but not yet run.
	Pending

	// Running means the job is running.
	Running

	// Completed means the job finished on its own, with no error status.
	Completed

	// Canceled means the job was canceled by the user.
	Canceled

	// Errored means the job quit with an error
	Errored
)
const StatusN Status = 6

StatusN is the highest valid value for type Status, plus one.

func StatusValues

func StatusValues() []Status

StatusValues returns all possible values for the type Status.

func (Status) Desc

func (i Status) Desc() string

Desc returns the description of the Status value.

func (Status) Int64

func (i Status) Int64() int64

Int64 returns the Status value as an int64.

func (Status) MarshalText

func (i Status) MarshalText() ([]byte, error)

MarshalText implements the encoding.TextMarshaler interface.

func (*Status) SetInt64

func (i *Status) SetInt64(in int64)

SetInt64 sets the Status value from an int64.

func (*Status) SetString

func (i *Status) SetString(s string) error

SetString sets the Status value from its string representation, and returns an error if the string is invalid.

func (Status) String

func (i Status) String() string

String returns the string representation of this Status value.

func (*Status) UnmarshalText

func (i *Status) UnmarshalText(text []byte) error

UnmarshalText implements the encoding.TextUnmarshaler interface.

func (Status) Values

func (i Status) Values() []enums.Enum

Values returns all possible values for the type Status.

Directories

Path Synopsis
cmd
baremetal command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL