Skip to main content

Dataset Management

Business Value: Centralize your ML datasets with automatic versioning and seamless access from notebooks and jobs — eliminating data pipeline complexity and ensuring reproducibility.

How It Works

The ML Platform provides unified dataset management for ingesting, storing, and accessing training data. When you register a dataset:

  • The platform ingests data from your specified source
  • Files store in high-performance workspace storage
  • Metadata (schema, size, version) is cataloged
  • Datasets auto-mount into notebooks and jobs at /data/{dataset-name}
  • Version history tracks changes over time

Technical Highlights

  • Automatic versioning — every update creates a new version
  • Deduplication — identical files share storage
  • Lazy loading — stream large datasets without full download
  • Format detection — automatic schema inference for structured data
  • Parallel ingestion for fast import of large datasets
  • Workspace-level access control

Data Sources

SourceDescription
File UploadCSV, JSON, JSONL, Parquet, Arrow (up to 10GB)
HuggingFace HubPublic and private datasets with split selection
KaggleCompetition and community datasets
S3 / Object StorageAWS S3, MinIO, GCS, Azure Blob
Direct URLHTTP/HTTPS with automatic archive extraction