GeoSFM Model Input Preparation with Cloud-Native Tools

GeoSFM Hydrological Model

GeoSFM (Geospatial Stream Flow Model) is a distributed hydrological model used for flood forecasting and water resource assessment across East Africa. The model requires gridded precipitation and evapotranspiration inputs aggregated to hydrological response units (zones), making efficient data processing essential for operational workflows.

Cloud-Native Data Pipeline

The input preparation workflow leverages modern cloud-native tools to handle large-scale geospatial data processing:

Icechunk and Zarr Storage

Icechunk provides versioned, cloud-optimized storage for Zarr datasets. The workflow stores raw and regridded data in Icechunk Zarr format, enabling:

Efficient chunk-based access for large arrays
Version control for reproducible workflows
Seamless integration with Xarray and Dask

Flox-Dask for Zonal Aggregation

Flox accelerates grouped operations on labeled arrays, replacing slow Xarray groupby operations with optimized algorithms. Combined with Dask, it enables parallel zonal statistics computation across thousands of hydrological zones efficiently.

Parquet for Tabular Output

Aggregated zone-wise data is exported to Parquet format for efficient storage and fast analytical queries, facilitating integration with downstream analysis tools.

Workflow Architecture

The three-stage pipeline processes data from multiple sources:

Download & Regrid (01-get-regrid.py): Fetches CHIRPS-GEFS forecasts, IMERG observations, and PET data; regrids to unified 0.02° resolution
Flox Groupby (02-flox-groupby.py): Converts shapefiles to raster zones; performs zone-based aggregation using Flox
Zone Text Generation (03-zone-txt.py): Produces rain.txt and evap.txt files for each GeoSFM zone