Back to Catalogue
TACO DATASET DOCUMENTATION

Cloud 3D - Himawari Finetuning Dataset

v0.1.0 cloud3d-finetune-himawari CC-BY-4.0

Description

Himawari finetuning subset from the Global 3D Cloud Reconstruction Dataset. Contains colocated pairs of Himawari-8/9 AHI geostationary imagery with CloudSat radar profiles for supervised 3D cloud structure reconstruction. Each sample includes: multispectral Himawari imagery (16 spectral channels + satellite/solar angles), CloudSat vertical profiles as ground truth, and a colocation mask indicating valid CloudSat footprint pixels. 256x256 pixel patches in Cloud-Optimized GeoTIFF format.

Dataset Overview

103 partitions 2015 - 2020 temporal coverage

Spatial Coverage

Click on any region to view partition details

Keywords

cloud microphysics 3d reconstruction geostationary satellites Himawari-8 Himawari-9 CloudSat remote sensing tropical cyclones deep learning

ML Tasks

regression foundation-model

TACO Structure (Root-Sibling Uniform Tree)

Hierarchical structure showing representative samples across levels. The "..." notation indicates additional samples following the same pattern. All samples at the same level share identical structure (RSUT constraint).

Hierarchy Details

Level Types Total Samples Sample IDs (preview)
Level 0 All FOLDER 57,373 Root level samples
Level 1 FILE + FILE 114,746 geo_patch, cloudsat_aligned

Metadata Fields by Level

These fields are available for querying with SQL when using TacoReader.

LEVEL0 (22 fields)
Field Name Type Description
id string Unique sample identifier within parent scope. Must be unique among siblings.
type string Sample type discriminator (FILE or FOLDER).
stac:crs string Coordinate reference system (WKT2, EPSG, or PROJ)
stac:tensor_shape list<item: int64> Raster dimensions [bands, height, width]
stac:geotransform list<item: double> GDAL affine transform
stac:time_start timestamp[us] Start timestamp (μs since Unix epoch, UTC)
stac:centroid binary Center point in EPSG:4326 (WKB)
stac:time_end timestamp[us] End timestamp (μs since Unix epoch, UTC)
stac:time_middle timestamp[us] Middle timestamp (μs since Unix epoch, UTC)
split string Dataset partition identifier (train, test, or validation)
cloud3d:cyclone bool Whether this sample is from a tropical cyclone observation
cloud3d:satellite string Geostationary satellite source (GOES, Himawari, MSG)
cloud3d:geostationary_id string Original geostationary satellite file identifier
cloud3d:cloudsat_id string CloudSat granule/profile identifier
cloud3d:has_flxhr bool Whether 2B-FLXHR radiative flux/heating rate data is available
majortom:code string MajorTOM spherical grid cell identifier (e.g., 0100km_0003U_0005R) with ~dist_km spacing
geoenrich:elevation float Mean elevation in meters (GLO-30 DEM)
geoenrich:precipitation float Mean annual precipitation in mm estimated from GPM data
geoenrich:temperature float Mean annual temperature in °C estimated from MODIS LST data
geoenrich:admin_countries string Country name at centroid location
internal:current_id int64 Current sample position at this level (0-indexed). Enables O(1) random access and relational JOINs (ZIP, FOLDER, TACOCAT).
internal:parent_id int64 Foreign key referencing parent sample position in previous level (ZIP, FOLDER, TACOCAT).
LEVEL1 (7 fields)
Field Name Type Description
id string Unique sample identifier within parent scope. Must be unique among siblings.
type string Sample type discriminator (FILE or FOLDER).
geotiff:stats list<item: list<item: float>> Per-band statistics (List[List[Float32]]): categorical mode returns class probabilities, continuous mode returns [min, max, mean, std, valid%, p25, p50, p75, p95]
taco:header binary Binary TACOTIFF header (35 bytes + tile counts) for fast reading without IFD parsing
internal:current_id int64 Current sample position at this level (0-indexed). Enables O(1) random access and relational JOINs (ZIP, FOLDER, TACOCAT).
internal:parent_id int64 Foreign key referencing parent sample position in previous level (ZIP, FOLDER, TACOCAT).
internal:relative_path string Relative path from DATA/ directory. Format: {parent_path}/{id} or {id} for level0 (ZIP, FOLDER, TACOCAT).

Loading the Dataset

# pip install tacoreader
import tacoreader

# Load dataset
ds = tacoreader.load("https://data.source.coop/taco/3dclouds/finetune/himawari/")

# Basic info
print(f"ID: {ds.id}")
print(f"Version: {ds.version}")
print(f"Samples: {len(ds.data)}")

Providers & Curators

Data Providers

JMA producerhttps://www.jma.go.jp
European Space Agency (ESA) licensorhttps://www.esa.int
source.coop hosthttps://source.coop

Dataset Curators

Name Organization Email
Cesar Aybar Universitat de València cesar.aybar@uv.es
Shirin Ermis University of Oxford
Lilli Freischem University of Oxford
Stella Girtsou National Observatory of Athens
Kyriaki-Margarita Bintsi Harvard Medical School
Emiliano Diaz Salas-Porras Universitat de València
Michael Eisinger European Space Agency
William Jones University of Oxford
Anna Jungbluth European Space Agency
Benoit Tremblay Environment and Climate Change Canada

Publications & Citations

How to Cite This Dataset

If you use this dataset in your research, please cite:

Ermis, S., Aybar, C., Freischem, L., Girtsou, S., Bintsi, K.-M., Diaz Salas-Porras, E., Eisinger, M., Jones, W., Jungbluth, A., & Tremblay, B. (2025). Global 3D Reconstruction of Clouds & Tropical Cyclones. Tackling Climate Change with Machine Learning Workshop at NeurIPS 2025.

BibTeX

@dataset{cloud3d-finetune-himawari0,
  title = {Cloud 3D - Himawari Finetuning Dataset},
  author = {Cesar Aybar and Shirin Ermis and Lilli Freischem and Stella Girtsou and Kyriaki-Margarita Bintsi and Emiliano Diaz Salas-Porras and Michael Eisinger and William Jones and Anna Jungbluth and Benoit Tremblay},
  year = {2015},
  version = {0.1.0},
  publisher = {Universitat de València}
}