gpandas

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 5, 2026 License: Apache-2.0 Imports: 16 Imported by: 1

README

GPandas

GPandas is a high-performance data manipulation and analysis library written in Go, drawing inspiration from Python's popular pandas library. It provides efficient and easy-to-use data structures, primarily the DataFrame, to handle structured data in Go applications.

Project Structure

  • benchmark/: Contains benchmark scripts for performance evaluation against Python's pandas:
  • dataframe/: Houses the core DataFrame implementation:
  • gpandas.go: Serves as the primary entry point for the GPandas library. It provides high-level API functions for DataFrame creation and data loading.
  • gpandas_sql.go: Extends GPandas to interact with SQL databases and Google BigQuery:
  • tests/: Contains unit tests to ensure the correctness and robustness of GPandas. It follows the exact same dir structure as the project for easy navigation.
  • utils/collection/: Contains generic collection utilities:
    • set.go: Implements a generic Set data structure in Go, providing common set operations like Add, Has, Union, Intersect, Difference, and Compare. This Set is used internally within GPandas for efficient data handling.
    • series.go: Implements a concurrency-safe Series type that enforces homogeneous data types within columns. Each Series maintains a dtype and provides methods like At(), Set(), Append(), and Len() for efficient columnar data access.

Code Functionality

GPandas is designed to provide a familiar and efficient way to work with tabular data in Go. Key functionalities include:

Core DataFrame Operations
  • DataFrame Creation: Construct columnar DataFrames from in-memory data using gpandas.DataFrame(), or load from external sources like CSV files using gpandas.Read_csv(). Each DataFrame uses a map[string]*Series structure for efficient columnar access.
  • Column Manipulation:
    • Renaming: Easily rename columns using DataFrame.Rename() while preserving column order.
  • Data Merging: Combine DataFrames based on common columns with DataFrame.Merge(), supporting:
    • Inner Join (InnerMerge): Keep only matching rows from both DataFrames.
    • Left Join (LeftMerge): Keep all rows from the left DataFrame, and matching rows from the right.
    • Right Join (RightMerge): Keep all rows from the right DataFrame, and matching rows from the left.
    • Full Outer Join (FullMerge): Keep all rows from both DataFrames, filling in missing values with nil.
  • Data Export:
    • CSV Export: Export DataFrames to CSV format using DataFrame.ToCSV(), with options for:
      • Custom separators.
      • Writing to a file path or returning a CSV string.
  • Data Display:
    • Pretty Printing: Generate formatted, human-readable table representations of DataFrames using DataFrame.String().
Indexing and Selection

GPandas provides pandas-like indexing capabilities for intuitive data access:

  • Column Selection
  • Label-based Indexing (Loc)
  • Position-based Indexing (iLoc)
  • Index Management
Data Loading from External Sources
  • CSV Reading: Efficiently read CSV files into DataFrames with gpandas.Read_csv(), leveraging concurrent processing for performance.
  • SQL Database Integration:
    • Read_sql(): Query and load data from SQL databases (SQL Server, PostgreSQL, and others supported by Go database/sql package) into DataFrames.
  • Google BigQuery Support:
    • From_gbq(): Query and load data from Google BigQuery tables into DataFrames, enabling analysis of large datasets stored in BigQuery.
Data Types

GPandas provides strong type support through its columnar architecture:

  • Series: The fundamental column type that enforces homogeneous data types within each column. Each Series maintains a dtype and provides type-safe access methods.
  • FloatCol: For float64 columns (legacy type for DataFrame construction).
  • StringCol: For string columns (legacy type for DataFrame construction).
  • IntCol: For int64 columns (legacy type for DataFrame construction).
  • BoolCol: For bool columns (legacy type for DataFrame construction).
  • Column: Generic column type to hold any type values when specific type constraints are not needed.
  • TypeColumn[T comparable]: Generic column type for columns of any comparable type T.

GPandas ensures type safety through Series-level dtype enforcement, preventing type mismatches and ensuring data integrity across all operations.

Performance Features

GPandas is built with performance in mind, incorporating several features for efficiency:

  • Columnar Storage: Uses a columnar DataFrame structure (map[string]*Series) for efficient column-wise operations and memory layout, similar to modern analytical databases.
  • Concurrent CSV Reading: Utilizes worker pools and buffered channels for parallel CSV parsing, significantly speeding up CSV loading, especially for large files.
  • Efficient Data Structures: Uses Go's native data structures and generics to minimize overhead and maximize performance.
  • Series-level Thread Safety: Provides thread-safe operations at the Series level using RWMutex, ensuring data consistency in concurrent environments while allowing concurrent reads.
  • Optimized Memory Management: Designed for efficient memory usage with columnar storage to handle large datasets effectively.
  • Buffered Channels: Employs buffered channels for data processing pipelines to improve throughput and reduce blocking.

Getting Started

Prerequisites

GPandas requires Go version 1.18 or above due to its use of generics.

Installation

Install GPandas using go get:

go get github.com/apoplexi24/gpandas

Core Components

DataFrame

The central data structure in GPandas, the DataFrame, is designed for handling two-dimensional, labeled data using a columnar architecture. It consists of a map[string]*Series for column storage and a ColumnOrder []string for maintaining column sequence. This design provides methods for data manipulation, analysis, and I/O operations, similar to pandas DataFrames in Python but with improved performance characteristics.

Series

The utils/collection/series.go provides a concurrency-safe Series type that serves as the fundamental building block for DataFrame columns. Each Series enforces homogeneous data types and provides efficient access methods like At(), Set(), Append(), and Len().

Set

The utils/collection/set.go provides a generic Set implementation, useful for various set operations. While not directly exposed as a primary user-facing component, it's an important utility within GPandas for efficient data management and algorithm implementations.

Performance

GPandas is engineered for performance through:

  • Columnar Architecture: The map[string]*Series structure enables efficient column-wise operations and better memory locality, similar to modern analytical databases.
  • Generics: Leveraging Go generics to avoid runtime type assertions and interface overhead, leading to faster execution.
  • Efficient Memory Usage: Designed to minimize memory allocations and copies with columnar storage for better performance when dealing with large datasets.
  • Concurrency: Utilizing Go's concurrency features, such as goroutines and channels, to parallelize operations like CSV reading and potentially other data processing tasks in the future.
  • Series-level Optimization: Each Series maintains its own type information and provides optimized access patterns for columnar data.
  • Zero-copy Operations: Aiming for zero-copy operations wherever feasible to reduce overhead and improve speed.
Development Setup
  1. Clone the repository:
    git clone https://github.com/apoplexi24/gpandas.git
    cd gpandas
    
  2. Install dependencies:
    go mod download
    

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Acknowledgments

  • Inspired by Python's pandas library, aiming to bring similar data manipulation capabilities to the Go ecosystem.
  • Built using Go's powerful generic system for type safety and performance.
  • Thanks to the Go community for valuable feedback and contributions.

Status

GPandas is under active development and is suitable for production use. However, it's still evolving, with ongoing efforts to add more features, enhance performance, and improve API ergonomics. Expect continued updates and improvements.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Concat

func Concat(objs []*dataframe.DataFrame, opts ...ConcatOptions) (*dataframe.DataFrame, error)

Concat concatenates pandas objects along a particular axis.

This function mirrors the behavior of pandas.concat for ease of switching from Python to Go for data scientists.

Parameters:

  • objs: A slice of DataFrame pointers to concatenate. Nil DataFrames are skipped.
  • opts: Optional ConcatOptions. If not provided, defaults are used.

Returns:

  • A new DataFrame containing the concatenated data, or an error if the operation fails.

Example:

df1 := &dataframe.DataFrame{Columns: map[string]Series{"A": ...}, ColumnOrder: []string{"A"}}
df2 := &dataframe.DataFrame{Columns: map[string]Series{"A": ...}, ColumnOrder: []string{"A"}}
result, err := gpandas.Concat([]*dataframe.DataFrame{df1, df2})
// result contains all rows from df1 followed by rows from df2

// With options:
result, err := gpandas.Concat([]*dataframe.DataFrame{df1, df2}, gpandas.ConcatOptions{Axis: gpandas.AxisColumns, Join: gpandas.JoinInner})

func FloatColumn

func FloatColumn(col []any) ([]float64, error)

func NewDataFrameFromSeries

func NewDataFrameFromSeries(columns map[string]collection.Series, columnOrder []string) (*dataframe.DataFrame, error)

NewDataFrameFromSeries creates a DataFrame from a map of Series.

Parameters:

columns: A map of column names to Series
columnOrder: Optional slice specifying column order (uses map order if nil)

Returns:

A pointer to a DataFrame, or an error if validation fails

func NewEmptyDataFrame

func NewEmptyDataFrame(columns []string, columnTypes map[string]reflect.Type) *dataframe.DataFrame

NewEmptyDataFrame creates an empty DataFrame with specified column names and types.

Parameters:

columns: A slice of column names
columnTypes: A map of column names to their types (uses reflect.Type)

Returns:

A pointer to an empty DataFrame with the specified structure

Types

type BoolCol

type BoolCol []bool

BoolColumn represents a slice of bool values.

type Column

type Column []any

Column represents a slice of any type.

type ConcatAxis

type ConcatAxis int

ConcatAxis specifies the axis along which to concatenate.

const (
	// AxisIndex (0) concatenates along rows (stacking DataFrames vertically).
	AxisIndex ConcatAxis = 0
	// AxisColumns (1) concatenates along columns (joining DataFrames horizontally).
	AxisColumns ConcatAxis = 1
)

type ConcatJoin

type ConcatJoin string

ConcatJoin specifies how to handle indexes on the non-concatenation axis.

const (
	// JoinOuter takes the union of indexes (all columns/rows, with nulls for missing).
	JoinOuter ConcatJoin = "outer"
	// JoinInner takes the intersection of indexes (only common columns/rows).
	JoinInner ConcatJoin = "inner"
)

type ConcatOptions

type ConcatOptions struct {
	// Axis is the axis to concatenate along. Default: AxisIndex (0).
	Axis ConcatAxis

	// Join determines how to handle indexes on other axis. Default: JoinOuter.
	Join ConcatJoin

	// IgnoreIndex if true, do not use the index values along the concatenation axis.
	// The resulting axis will be labeled 0, 1, ..., n-1. Default: false.
	IgnoreIndex bool

	// VerifyIntegrity if true, check whether the new concatenated axis contains duplicates.
	// This can be expensive. Default: false.
	VerifyIntegrity bool

	// Sort if true, sort non-concatenation axis if it is not already aligned. Default: false.
	Sort bool
}

ConcatOptions configures the behavior of the Concat function.

func DefaultConcatOptions

func DefaultConcatOptions() ConcatOptions

DefaultConcatOptions returns the default options for Concat.

type DbConfig

type DbConfig struct {
	Database_server string
	Server          string
	Port            string
	Database        string
	Username        string
	Password        string
}

struct to store db config.

NOTE: Prefer using env vars instead of hardcoding values

type FloatCol

type FloatCol []float64

FloatColumn represents a slice of float64 values.

type GoPandas

type GoPandas struct{}

func (GoPandas) DataFrame

func (GoPandas) DataFrame(columns []string, data []Column, columns_types map[string]any) (*dataframe.DataFrame, error)

DataFrame creates a new DataFrame from the provided columns, data, and column types.

It validates the input parameters to ensure data consistency and proper type definitions.

The function performs several validation checks: - Ensures column_types map is provided - Verifies at least one column name is present - Checks that data is not empty - Confirms the number of columns matches the data columns - Validates all columns have the same length - Ensures type definitions exist for all columns

The data is then converted to the internal DataFrame format, creating typed Series based on the specified column types (FloatCol, IntCol, StringCol, BoolCol). Null values (nil) are properly tracked using the boolean mask approach.

Parameters:

columns: A slice of strings representing column names
data: A slice of Columns containing the actual data
columns_types: A map defining the expected type for each column

Returns:

A pointer to a DataFrame containing the processed data, or an error if validation fails

func (GoPandas) From_gbq

func (GoPandas) From_gbq(query string, projectID string) (*dataframe.DataFrame, error)

QueryBigQuery executes a BigQuery SQL query and returns the results as a DataFrame.

Parameters:

query: The BigQuery SQL query string to execute.
projectID: The Google Cloud Project ID where the BigQuery dataset resides.

Returns:

  • A pointer to a DataFrame containing the query results.
  • An error if the query execution fails or if there are issues with the BigQuery client.

The DataFrame's structure will match the query results:

  • Columns will be named according to the SELECT statement
  • Data types will be converted from BigQuery types to Go types
  • NULL values are properly tracked using the boolean mask approach

Examples:

gp := gpandas.GoPandas{}
query := `SELECT name, age, city
          FROM dataset.users
          WHERE age > 25`
df, err := gp.QueryBigQuery(query, "my-project-id")
// Result DataFrame:
// name    | age | city
// Alice   | 30  | New York
// Bob     | 35  | Chicago
// Charlie | 28  | Boston

Note: Requires appropriate Google Cloud credentials to be configured in the environment.

func (GoPandas) Read_csv

func (GoPandas) Read_csv(filepath string) (*dataframe.DataFrame, error)

Read_csv reads a CSV file from the specified filepath and converts it into a DataFrame.

It opens the CSV file, reads the header to determine the column names, and then reads all the records.

The function checks for errors during file operations and ensures that the CSV file is not empty.

It initializes data columns based on the number of headers and populates them with the corresponding values from the records.

If the number of columns in any row is inconsistent with the header, that row is skipped.

All values are stored as strings in StringSeries with proper null handling.

Parameters:

filepath: A string representing the path to the CSV file to be read.

Returns:

A pointer to a DataFrame containing the data from the CSV file, or an error if the operation fails.

func (GoPandas) Read_csv_typed

func (gp GoPandas) Read_csv_typed(filepath string, columnTypes map[string]any) (*dataframe.DataFrame, error)

Read_csv_typed reads a CSV file and creates typed Series based on the provided column types.

This is similar to Read_csv but allows specifying column types for automatic type conversion. Empty strings in the CSV are treated as null values for non-string types.

Parameters:

filepath: A string representing the path to the CSV file to be read.
columnTypes: A map defining the expected type for each column (FloatCol, IntCol, StringCol, BoolCol)

Returns:

A pointer to a DataFrame containing the typed data from the CSV file, or an error if the operation fails.

func (GoPandas) Read_sql

func (GoPandas) Read_sql(query string, db_config DbConfig) (*dataframe.DataFrame, error)

Read_sql executes a SQL query against a database and returns the results as a DataFrame.

Parameters:

query: The SQL query string to execute.
db_config: A DbConfig struct containing database connection parameters:
  - database_server: Type of database ("sqlserver" or other)
  - server: Database server hostname or IP
  - port: Database server port
  - database: Database name
  - username: Database user
  - password: Database password

Returns:

  • A pointer to a DataFrame containing the query results.
  • An error if the database connection, query execution, or data processing fails.

The DataFrame's structure will match the query results:

  • Columns will be named according to the SELECT statement
  • Data types will be inferred from the database types
  • NULL values are properly tracked using the boolean mask approach

Examples:

gp := gpandas.GoPandas{}
config := DbConfig{
    database_server: "sqlserver",
    server: "localhost",
    port: "1433",
    database: "mydb",
    username: "user",
    password: "pass",
}
query := `SELECT employee_id, name, department
          FROM employees
          WHERE department = 'Sales'`
df, err := gp.Read_sql(query, config)
// Result DataFrame:
// employee_id | name  | department
// 1          | John  | Sales
// 2          | Alice | Sales
// 3          | Bob   | Sales

type IntCol

type IntCol []int64

IntColumn represents a slice of int64 values.

type StringCol

type StringCol []string

StringColumn represents a slice of string values.

type TypeColumn

type TypeColumn[T comparable] []T

TypeColumn represents a slice of a comparable type T.

Directories

Path Synopsis
examples
basic command
concat command
groupby command
indexing command
merge command
utils

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL