Efficient Multiple Imputation for Diverse Data in Python and R: MIDASpy and rMIDAS

Authors: Ranjit Lall and Thomas Robinson

We provide a single replication script (Code/code.R) that substantively reproduces all results presented in the paper (including a full imputation of the CCES data using MIDASpy).

This script takes approximately one hour to run. Generated figures will be saved in the subdirectory Figures/Replication. To facilitate comparison, Figures contains the figures presented in the paper.

Note: due to the complex nature of our full tests, this is only a substantive replication (in line with JSS guidelines). For a complete replication, please run Code/full_code.R. This file has a runtime of 1.4 days, most of which is spent on the hyperparameter test.

All file paths in scripts are relative to the main replication folder.

IMPORTANT: Setting up the replication environment

To aid replication, we include both a YAML file (in Data) that initializes a conda environment with the correct Python package dependencies.

Manual conda setup

Please ensure you have conda installed on your machine. Next, in a terminal window, navigate to this replication folder. Then, run the following at the command line:

conda env create -f Data/midas-env.yml

NOTE: Setup for Apple Silicon (i.e., Macs with M1 or M2 chips)

rMIDAS and MIDASpy are compatible with Apple’s new ARM64 architecture. However, we recommend using the miniforge installer rather than anaconda or miniconda, as it offers better support for the ARM64 architecture.

Once you have installed miniforge, Apple Silicon users should navigate to this replication folder and run the following at the command line:

conda env create -f Data/midas-env-arm64.yml

Dependency details

Replication script

We replicated this code limiting the memory available to 8GB. The script was also tested on a MacBook Pro with Apple M1 Max chip using miniforge, and a Ubuntu 22.04 linux system.

Full results

The paper results generated from Code/full_code.R were produced on an Amazon AWS EC2 server using a c6a.8xlarge instance with 64GB RAM and Ubuntu 22.04 Server operating system.

Runtimes

Replication script discrepancies

As noted above, we provide a substantive replication script (code.R) due to the Lengthy runtime of the full replication script (full_code.R). The two scripts differ in the following ways:

Section 5.1

Section 5.2

Section 6.2

Section 6.3