Data Streaming NOAA GEFS for cGAN Inference

Jul 23, 2025 · 1 min read

GEFS Data Streaming for cGAN Inference

The NOAA Global Ensemble Forecast System (GEFS) provides ensemble weather predictions hosted on AWS S3 Open Data. Accessing multiple forecast variables efficiently is critical for machine learning applications like cGAN (conditional Generative Adversarial Network) precipitation downscaling.

The Challenge: 11 Variables for cGAN

cGAN precipitation downscaling requires 11 different atmospheric variables from GEFS as input features (temperature, humidity, wind components, geopotential heights, etc.). Traditional approaches would require downloading entire GRIB2 files for each variable and ensemble member—potentially hundreds of gigabytes per forecast cycle. This creates significant bandwidth, storage, and time overhead for operational workflows.

Grib-Index-Kerchunk Method

Using grib-index and Kerchunk, GEFS GRIB2 files can be virtualized into Zarr format without downloading entire files. This enables:

  • Selective variable extraction - Stream only required variables for cGAN inference
  • Efficient cloud access - Byte-range requests to AWS S3
  • Zarr-to-NetCDF conversion - Transform streamed data into NetCDF for model input

Workflow Components

ScriptPurpose
run_gefs_inference_raw.pySelf-contained cGAN inference pipeline
zarr_to_raw_netcdf.pyConvert GIK Zarr to NetCDF for cGAN input
tfrecords_generator.pyGenerate TFRecords for model training

The workflow streams GEFS variables, converts to normalized NetCDF format, and feeds into the cGAN model to produce downscaled precipitation ensemble forecasts.

Resources