DCGM Go Testing Samples
This directory contains test versions of all the DCGM samples, reimplemented using the Go testing framework. These tests demonstrate the functionality of the NVIDIA Data Center GPU Manager (DCGM) Go bindings while being suitable for automated testing and CI/CD pipelines.
Test Files Overview
Core Device Management
-
deviceinfo_test.go
- Tests device information retrieval functionality
- Equivalent to
samples/deviceInfo/main.go
- Tests GPU device properties, identification, and topology information
- Includes tests for both embedded and standalone hostengine connections
-
dmon_test.go
- Tests device monitoring capabilities
- Equivalent to
samples/dmon/main.go
- Monitors GPU utilization, temperature, power, and clock speeds
- Includes time-limited monitoring tests and sample consistency checks
-
device_status_test.go
- Tests device status querying (part of dmon functionality)
- Tests single and multiple GPU status queries
- Validates utilization metrics and system health indicators
Diagnostics and Health
System Management
Process and Topology
REST API
restapi_test.go
- Tests REST API endpoint functionality
- Equivalent to
samples/restApi/
(complete implementation)
- Uses
httptest
for testing HTTP endpoints without starting a real server
- Tests JSON response formats and error handling
Running the Tests
Run All Tests
go test ./tests/... -v
Run Specific Test Files
# Run device information tests
go test ./tests/deviceinfo_test.go -v
# Run monitoring tests
go test ./tests/dmon_test.go -v
# Run diagnostic tests
go test ./tests/diag_test.go -v
Run Tests with Different Modes
# Run only quick tests (skip long-running tests)
go test ./tests/... -v -short
# Run tests with timeout
go test ./tests/... -v -timeout 5m
Run Specific Test Functions
# Run specific test function
go test ./tests/deviceinfo_test.go -v -run TestDeviceInfo
# Run all tests matching a pattern
go test ./tests/... -v -run "TestDevice.*"
Test Features
Adaptive Testing
- Tests automatically skip when no GPUs are available
- Different behavior for single vs. multi-GPU systems
- Graceful handling of permission-restricted operations
Time-Limited Execution
- Long-running samples (like monitoring) are time-limited in tests
- Configurable test durations for CI/CD environments
- Background operations are properly cancelled
Comprehensive Coverage
- Each test covers the core functionality of its corresponding sample
- Additional test scenarios for error conditions and edge cases
- Validation of return values and data consistency
CI/CD Friendly
- Tests use the Go testing framework's standard patterns
- Proper test isolation and cleanup
- Structured logging for debugging
Prerequisites
System Requirements
- NVIDIA GPU(s) with DCGM support
- NVIDIA drivers installed
- DCGM libraries available
- Go 1.19+ for testing framework features
Dependencies
The tests require the same dependencies as the original samples:
github.com/NVIDIA/go-dcgm/pkg/dcgm
github.com/gorilla/mux
(for REST API tests only)
Permissions
Some tests may require elevated privileges:
- Process monitoring tests work best when run as root
- Certain policy violation tests require administrative access
- Diagnostic tests may need elevated permissions for hardware access
Test Structure
Each test file follows a consistent pattern:
- Basic Functionality Test - Core sample functionality
- Extended Tests - Additional scenarios and edge cases
- Error Handling Tests - Validation of error conditions
- Performance/Consistency Tests - Multi-sample validation
Example Test Pattern
func TestSampleFunctionality(t *testing.T) {
// Initialize DCGM
cleanup, err := dcgm.Init(dcgm.Embedded)
if err != nil {
t.Fatalf("Failed to initialize DCGM: %v", err)
}
defer cleanup()
// Test core functionality
// ... test implementation
// Validate results
// ... assertions and checks
}
Integration with CI/CD
These tests are designed to integrate well with continuous integration systems:
- Use standard Go testing patterns
- Provide detailed logging for troubleshooting
- Support timeout and cancellation
- Can run with or without actual GPU hardware (with appropriate skipping)
Example GitHub Actions Integration
- name: Run DCGM Tests
run: |
go test ./tests/... -v -timeout 10m
continue-on-error: true # Optional: allow failure if no GPU available
Troubleshooting
Common Issues
- No GPUs Found - Tests will skip automatically
- Permission Denied - Some tests require root privileges
- DCGM Not Available - Ensure DCGM libraries are installed
- Timeout Issues - Increase test timeout for slow systems
All tests provide verbose logging when run with -v
flag:
go test ./tests/deviceinfo_test.go -v
Environment Variables
Tests respect standard Go testing environment variables:
GO_TEST_TIMEOUT_SCALE
- Scale test timeouts
DCGM_TESTING_MODE
- Custom testing configurations (if implemented)
Contributing
When adding new tests:
- Follow the existing naming pattern (
*_test.go
)
- Include comprehensive documentation
- Add appropriate test skipping for missing hardware
- Include both positive and negative test cases
- Update this README with new test descriptions