Federated Heterogeneous Contrastive Distillation for Molecular Representation Learning

Published in 33rd ACM International Conference on Information and Knowledge Management, 2024

Abstract: With the increasing application of deep learning to solve scientific problems in biochemistry, molecular federated learning has become popular due to its ability to offer distributed privacy-preserving solutions. However, most existing molecular federated learning methods rely on joint training with public datasets, which are difficult to obtain in practice. These methods also fail to leverage multi-modal molecular representations effectively. To address the above issues, we propose a novel framework, \textbf{Fed}erated \textbf{H}eterogeneous \textbf{C}ontrastive \textbf{D}istillation (\textbf{FedHCD}), which enables to jointly train global models from clients with heterogeneous data modalities, learning tasks, and molecular models. To aggregate data representations of different modalities in a data-free manner, we design a global multi-modal contrastive strategy to align the representation of clients without public dataset. Utilizing intrinsic characteristics of molecular data in different modalities, we tackle the exacerbation of local model drift and data Non-IIDness caused by multi-modal clients. We introduce a multi-view contrastive knowledge transfer to extract features from atoms, substructures, and molecules, solving the issue of information distillation failure due to dimensional biases in different data modalities. Our evaluations on eight real-world molecular datasets and ablation experiments show that FedHCD outperforms other state-of-the-art FL methods, irrespective of whether or not they use public datasets.