MICER: a pre-trained encoder-decoder architecture for molecular image captioning

Automatic recognition of chemical structures from molecular images provides an important avenue for the rediscovery of chemicals. Traditional rule-based approaches rely on expert knowledge and struggle with diverse drawing styles. We propose MICER, which leverages a pre-trained encoder on large-scale molecular images to learn robust visual representations, coupled with an attention-based decoder to translate molecular images into SMILES strings. Fine-tuning the pre-trained model dramatically boosts performance on molecular image captioning benchmarks, achieving state-of-the-art optical chemical structure recognition accuracy.