We propose a Transformer-based architecture that directly predicts posterior marginal distributions for missing data without fitting an explicit generative model. Inspired by BERT's masked language modeling, our approach trains on artificially masked observed values to learn dependencies in the data. Unlike generative approaches that model joint distributions, we focus exclusively on the posterior marginals needed for downstream decisions, enabling more direct optimization and avoiding the complexity of full generative modeling.
The design for the experiments presented in this paper. Each domain, parametrized by its attributes, acts as data source to generate Missing-at-Random (MAR) instances. The Marformer uses masked auto-encoding to get pθ that approximates the posterior marginals. We compare against baselines that are parametrized by learned qφ on the observed examples.
The Marformer is a Transformer encoder that processes structured attributes through compositional embeddings. Each attribute's initial representation combines missingness indicators, type embeddings, observed values, and learned positional embeddings. Through multiple layers of cross-attention, the model iteratively refines its predictions by attending to other distributions, similar to message-passing algorithms but with learned refinement procedures.
Architecture of Marformer, which is a Transformer encoder. The input is represented at layer 0 with unordered input tokens, one per attribute. Each token is transformed through L layers of cross-attention to yield refined predictions at the top layer.