Can you explain more on multimodal images?
I guess there can be some deep learning approach to this. The one I suggested previously is one. But you see the mode has to be backed with strong mathematical operations. Old school convolution and FC will just create a backbone.
Using something like gram matrix to merge the images in neural style transfer is that mathematical touch.
In your case what if some kind of clever feature encoding will help.