SpectraDiff:

Enhancing the Fidelity of Infrared Image Translation with Object-Aware Diffusion



Seamless Trans-X Lab, Yonsei University

Abstract

Autonomous systems commonly rely on RGB cameras, which are susceptible to failure in low-light and adverse conditions. Infrared (IR) imaging provides a viable alternative by capturing thermal signatures independent of visible illumination. However, its high cost and integration complexities limit widespread adoption. To address these challenges, we introduce SpectraDiff, a diffusion-based framework that synthesizes realistic IR images by fusing RGB inputs with refined semantic segmentation. Through our RGB-Seg Object-Aware (RSOA) module, SpectraDiff learns object-specific IR intensities by leveraging object-aware features. The SpectraDiff architecture, featuring a novel Spectral Attention Block, enforces self-attention among semantically similar pixels while leveraging cross-attention with the original RGB to preserve high-frequency details. Extensive evaluations on FLIR, FMB, MFNet, IDD-AW, and RANUS demonstrate SpectraDiff's superior performance over existing methods, as measured by both perceptual (FID, LPIPS, DISTS) and fidelity (SSIM, SAM) metrics.


Method

We exploit 3DGS itself to render stereo pairs and process for more accurate depth supervision. Given a camera pose among those in the training set, we derive a corresponding right viewpoint in a fictitious stereo configuration according to an arbitrary stereo baseline. During training, for each image in the training set we render a corresponding right frame; we process the two through a stereo network to obtain depth. We train 3DGS by minimizing the difference between rendered and real images, as well as between rendered depth and the depth map obtained from stereo.


Qualitative Results

Ours (FLIR)
GT (FLIR)
Ours (FMB)
GT (FMB)
Ours (MFNet)
GT (MFNet)
Ours (MFNet)
GT (MFNet)


Qualitative comparisons of different models on the FLIR dataset

FLIR Dataset results
The FLIR results illustrate the effectiveness of our SpectraDiff model in capturing the thermal properties of objects, including people, across both day and night conditions, producing clearer and more detailed images than other methods.


Qualitative comparisons with different models on the FMB dataset (top two rows) and MFNet dataset (bottom two rows) for LWIR translation.

FMB Dataset results
SpectraDiff demonstrates effectiveness in capturing IR intensities and produces clearer images. In particular, our model effectively captures thermal emissive properties across various objects and scenes in thermal infrared domain. Results for NIR datasets (IDD-AW, RANUS) are available in the supplementary material.


Quantitative Results

Quantitative comparison of the proposed model's performance across various infrared (IR) range datasets. The results demonstrate performance variations across different IR ranges and highlight where our model outperforms other methods based on SAM, FID, LPIPS, and DISTS metrics. The best results are shaded in green and the second-best results are shaded in yellow.

BibTeX

@article{park2026spectradiff,
title = {SpectraDiff: Enhancing the fidelity of Infrared Image Translation with object-aware diffusion},
journal = {Computer Vision and Image Understanding},
pages = {104709},
year = {2026},
issn = {1077-3142},
doi = {https://doi.org/10.1016/j.cviu.2026.104709},
url = {https://www.sciencedirect.com/science/article/pii/S1077314226000767},
author = {Incheol Park and Youngwan Jin and Nalcakan Yagiz and Hyeongjin Ju and Sanghyeop Yeo and Shiho Kim},
keywords = {Image-to-image translation, Data augmentation, Infrared imaging, Diffusion models},
}