jpegcc

package module
v0.0.0-...-edcf836 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 24, 2020 License: MIT Imports: 15 Imported by: 0

README

jpegcc GoDoc Build Status Coverage Status Go Report Card

JPEG Most Prevalent Colors Counter Package and Command Line Tool.

Requirements

  • There is list of links leading to an image.
  • The solution should create a CSV file with 3 mots prevalent colors.
  • The solution should be able to handle input files with more than a billion URLs.
  • The solution should work under limited resources (e.g. 1 CPU, 512MB RAM).
  • There is no limit on the execution time.
  • Utilization of the provided resources should be as much as possible at any time during the program execution.

Result

Research and Insights
  • Input file: There are duplicated URLs => avoid downloading and processing duplicates. Take result calculated before.
  • Input file: There are broken URLs => avoid download attempts of broken duplicates.
  • Input file: Link with .jpeg postfix in fact refers to .png file => mark this URL as broken.
  • Input file: All URLs refer to limited amount of hosts (2-3). => avoid blocking after DDoS by opening a lot of simultaneous HTTP connections. Use HTTP connection limitation per host.
  • Input file: too short for benchmarking => bigger file with links should be found. There is repository (18+ age) : https://github.com/EBazarov/nsfw_data_source_urls
  • Limited RAM => Reduce garbage generation by zero copy, object pools and escape analysis.
  • Network utilization => support simultaneous HTTP connections.
  • Storage utilization => buffered reading from input file and buffered result writing.
  • 1 CPU => localize the part of the program mostly loading the CPU.
Pipeline Concept

Input >>[Reading from input file] >> [1] ->>Nx[Image Down-loaders]>>[1]>>Mx[Image Processor]>>[1]>>[Buffered Result Writer] -> File

[1] - channel length

M, N - simultaneous goroutines.

  • Reading from file uses Scanner, what has buffer reading inside.
  • N down-loaders must be launched. Down-loader is based on fasthttp library, what has HTTP requests limitation per host, reading pools for HTTP Body and zero allocations. Idea of having several parallel downloading processes is to have something in the channel with downloaded images listened by processing. Because performance of a single image download is not guaranteed.
  • The most loading part of CPU is Processing part. Taking into account 1 CPU, there is no sense to have more than one simultaneous processing goroutine. Two processes will be competing for CPU cache and reduce overall performance. But, there is a command line option allowing M>1 processing goroutines.
  • Buffered result writer does not block (file i/o) Image processing goroutine per each single result.
Installation
go get github.com/regorov/jpegcc
cd ${GOPATH}/src/github.com/regorov/jpegcc/cmd/jpegcc
go build
Usage
jpegcc help
jpegcc help start
export GOMAXPROCS=1

jpegcc start -i ./input.txt --pw 1 --dw 8 -o ./result.csv
Profiling Results
jpegcc --pl ":6001" start -i ./input.txt --pw 1 --dw 8 -o ./result.csv
curl -sK -v http://127.0.0.1:6001/debug/pprof/heap > heap.out
curl -sK -v http://127.0.0.1:6001/debug/pprof/profile > cpu.out
go tool pprof -http=":8086" ./heap.out
  • Go runtime does not return memory back to OS as fast as possible.
Build You Own Image Processing Tool

There are several interfaces what can be implemented to change input source, download approach, image processing logic and output direction.


// Resulter is the interface that wraps Result and Header methods.
//
// Result returns string representation of processing result.
//
// Header returns header if output format expects header (e.g. CSV file format).
// If output format does not requires header, method implementation can return
// empty string.
type Resulter interface {
	Result() string
	Header() string
}

// Outputer is the interface that wraps Save and Close method,
//
// Save receives Resulter to be written to the output.
//
// Close flushes output buffer and closes output.
type Outputer interface {
	Save(Resulter) error
	Close() error
}

// Counter is the interface that wraps the basic Count method,
//
// Count receives Imager, process it in accordance to implementation and returns Resulter or error if processing failed.
type Counter interface {
	Count(Imager) (Resulter, error)
}

// Inputer is the interface that wraps the basic Next method.
//
// Next returns channel of URL's read from input. Channel closes
// when input EOF is reached.
type Inputer interface {
	Next() <-chan string
}

// Downloader is the interface that groups methods Download and Next.
//
// Download downloads image addressed by url and returns it wrapped into Imager.
//
// Next returns channel of Imager downloaded and ready to be processed. Channel
// closes when nothing to download.
type Downloader interface {
	Download(ctx context.Context, url string) (Imager, error)
	Next() <-chan Imager
}

// Imager is the interface that groups methods to deal with
// downloaded image.
//
// Bytes returns downloaded image as []byte.
//
// Reset releases []byte of HTTP Body. Do not call Bytes() after
// calling Reset.
//
// URL returns the URL of downloaded image.
type Imager interface {
	Bytes() []byte
	Reset()
	URL() string
}

Further research

  • Create RAM disk
  • Use RAM disk as shared storage
  • Split current jpegcc application to images downloader daemon and images processing apps (similar to FaSS). Because downloader part of application does not consumes memory and does not generate garbage it can stay as daemon.

Prague 2020

Documentation

Overview

Package jpegcc provides functionality for batch jpeg file processing.

Package jpegcc provides functionality for batch jpeg file processing.

Index

Constants

View Source
const (
	// DefaultMaxConnsPerHost defines default value of maximum parallel http connections
	// to the host. To prevent DDoS.
	DefaultMaxConnsPerHost = 32

	// DefaultReadTimeout defines maximum duration for full response reading (including body).
	DefaultReadTimeout = 8 * time.Second
)
View Source
const DefaultBufferLen = 10

DefaultBufferLen defines default output buffer length.

Variables

View Source
var ErrMediaIsEmpty = errors.New("url referes to the empty file")

ErrMediaIsEmpty is returned when size of downloaded file is equal to zero.

Functions

This section is empty.

Types

type BufferedCSV

type BufferedCSV struct {
	// contains filtered or unexported fields
}

BufferedCSV implements Outputer interface. CSV file with write buffer.

func NewBufferedCSV

func NewBufferedCSV(size int) *BufferedCSV

NewBufferedCSV returns new BufferedCSV instance. If size < 2, DefaultBufferLen (10) will be assigned.

func (*BufferedCSV) Close

func (out *BufferedCSV) Close() error

Close flushes to the output file unsaved buffer and closes file.

func (*BufferedCSV) Open

func (out *BufferedCSV) Open(fname string) error

Open creates file or appends if file is exist. CSV header writes only into empty file.

func (*BufferedCSV) Save

func (out *BufferedCSV) Save(res Resulter) error

Save adds Resulter to the buffer and flushes buffer to the file if buffer length reached the limit.

type Counter

type Counter interface {
	Count(Imager) (Resulter, error)
}

Counter is the interface that wraps the basic Count method,

Count receives Imager, process it in accordance to implementation and returns Resulter or error if processing failed.

type CounterDefault

type CounterDefault struct {
}

CounterDefault counts 3 most prevalent colors, getting RGBA color of each pixel one by one.

func NewCounterDefault

func NewCounterDefault() *CounterDefault

NewCounterDefault creates and returns CounterPix instance.

func (*CounterDefault) Count

func (cc *CounterDefault) Count(pic Imager) (Resulter, error)

Count implements interface Counter.

type CounterPix

type CounterPix struct {
}

CounterPix counts 3 most prevalent colors walking through the array of pixels. Works twice faster than DefaultCounter but consumes more memory.

func NewCounterPix

func NewCounterPix() *CounterPix

NewCounterPix returns CounterPix instance.

func (*CounterPix) Count

func (cc *CounterPix) Count(pic Imager) (Resulter, error)

Count implements interface Counter.

type Downloader

type Downloader interface {
	Download(ctx context.Context, url string) (Imager, error)
	Next() <-chan Imager
}

Downloader is the interface that groups methods Download and Next.

Download downloads image addressed by url and returns it wrapped into Imager.

Next returns channel of Imager downloaded and ready to be processed. Channel closes when nothing to download.

type ImageProcessor

type ImageProcessor struct {
	// contains filtered or unexported fields
}

ImageProcessor implements core logic orchestration functionality. It reads Imager(s) from Downloader, invocates Counter for processing and uses Outputer to save result. It's suggested to limit max amount of parallel processing goroutins equal to amount of cores.

func NewImageProcessor

func NewImageProcessor(l zerolog.Logger, d Downloader, o Outputer, c Counter) *ImageProcessor

NewImageProcessor returns new instance of ImageProcessor.

func (*ImageProcessor) Start

func (ip *ImageProcessor) Start(ctx context.Context, n int)

Start launches n paraller processing goroutines and waits completion.

type Imager

type Imager interface {
	Bytes() []byte
	Reset()
	URL() string
}

Imager is the interface that groups methods to deal with downloaded image.

Bytes returns downloaded image as []byte.

Reset releases []byte of HTTP Body. Do not call Bytes() after calling Reset.

URL returns the URL of downloaded image.

type Inputer

type Inputer interface {
	Next() <-chan string
}

Inputer is the interface that wraps the basic Next method.

Next returns channel of URL's read from input. Channel closes when input EOF is reached.

type Media

type Media struct {
	// contains filtered or unexported fields
}

Media implement interface Imager and represents downloaded image stored in the memory.

func (*Media) Bytes

func (i *Media) Bytes() []byte

Bytes returns image as []byte.

func (*Media) Reset

func (i *Media) Reset()

Reset implements interface Imager. Releases HTTP Body buffer.

func (*Media) URL

func (i *Media) URL() string

URL returns URL of downloaded image.

type MediaDownloader

type MediaDownloader struct {
	// contains filtered or unexported fields
}

MediaDownloader implements interface Downloader. Supports limiting connection per host, configurable amount of workers, uses fasthttp.Client to reduce garbage generation.

func NewMediaDownloader

func NewMediaDownloader(l zerolog.Logger, in Inputer) *MediaDownloader

NewMediaDownloader returns new instance of MediaDownloader, with default read timeout and MaxConnsPerHost (32) parameters.

func (*MediaDownloader) Download

func (id *MediaDownloader) Download(ctx context.Context, url string) (Imager, error)

Download retrive image by URL.

func (*MediaDownloader) Next

func (id *MediaDownloader) Next() <-chan Imager

Next returns chan with downloaded Imagers ready to process.

func (*MediaDownloader) SetMaxConnsPerHost

func (id *MediaDownloader) SetMaxConnsPerHost(n int)

SetMaxConnsPerHost set maximum parallel http connections to the host.

func (*MediaDownloader) SetReadTimeout

func (id *MediaDownloader) SetReadTimeout(d time.Duration)

SetReadTimeout set maximum duration for full response reading (including body).

func (*MediaDownloader) Start

func (id *MediaDownloader) Start(ctx context.Context, n int)

Start launches n parallel image download go-routines.

type Outputer

type Outputer interface {
	Save(Resulter) error
	Close() error
}

Outputer is the interface that wraps Save and Close method,

Save receives Resulter to be written to the output.

Close flushes output buffer and closes output.

type PlainTextFileInput

type PlainTextFileInput struct {
	// contains filtered or unexported fields
}

PlainTextFileInput implements interface Inputer and provides jpeg urls from plain text file.

func NewPlainTextFileInput

func NewPlainTextFileInput(l zerolog.Logger) *PlainTextFileInput

NewPlainTextFileInput returns new instance of PlainTextFileInput.

func (*PlainTextFileInput) Next

func (inp *PlainTextFileInput) Next() <-chan string

Next returns chan with urls read from file.

func (*PlainTextFileInput) Start

func (inp *PlainTextFileInput) Start(ctx context.Context, fname string) error

Start opens an input file in read only mode and starts runner (separate goroutine) of line by line reading to chan string. Returns error if could not open a file.

type RGB

type RGB uint32

RGB defines structure of Color. Memory representation is 0x00RRGGBB.

func ToRGB

func ToRGB(r, g, b uint32) RGB

ToRGB converts separate R, G, B colors into RGB type.

func (RGB) String

func (rgb RGB) String() string

String returns string representation of color like #RRGGBB.

type Result

type Result struct {
	URL    string
	Colors [3]RGB
}

Result implements interface Resulter. Defines a structure of desired output line.

func (*Result) Header

func (r *Result) Header() string

Header returns header in CSV format.

func (*Result) Result

func (r *Result) Result() string

Result returns a string in CVS format: "URL","color1","color2","color3".

type Resulter

type Resulter interface {
	Result() string
	Header() string
}

Resulter is the interface that wraps Result and Header methods.

Result returns string representation of processing result.

Header returns header if output format expects header (e.g. CSV file format). If output format does not requires header, method implementation can return empty string.

Directories

Path Synopsis
cmd
jpegcc
Management Console
Management Console

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL