Dataset Management

Business Value: Centralize your ML datasets with automatic versioning and seamless access from notebooks and jobs — eliminating data pipeline complexity and ensuring reproducibility.

How It Works

The ML Platform provides unified dataset management for ingesting, storing, and accessing training data. When you register a dataset:

The platform ingests data from your specified source
Files store in high-performance workspace storage
Metadata (schema, size, version) is cataloged
Datasets auto-mount into notebooks and jobs at /data/{dataset-name}
Version history tracks changes over time

Technical Highlights

Automatic versioning — every update creates a new version
Deduplication — identical files share storage
Lazy loading — stream large datasets without full download
Format detection — automatic schema inference for structured data
Parallel ingestion for fast import of large datasets
Workspace-level access control

Data Sources

Source	Description
File Upload	CSV, JSON, JSONL, Parquet, Arrow (up to 10GB)
HuggingFace Hub	Public and private datasets with split selection
Kaggle	Competition and community datasets
S3 / Object Storage	AWS S3, MinIO, GCS, Azure Blob
Direct URL	HTTP/HTTPS with automatic archive extraction

How It Works​

Technical Highlights​

Data Sources​

How It Works

Technical Highlights

Data Sources