I am an ETH Postdoctoral Fellow at ETH Zurich, working with Professor Torsten Hoefler in the Scalable Parallel Computing Laboratory. I previously completed my PhD in computer science at the University of Illinois at Urbana-Champaign, advised by Professor Marc Snir. I work heavily with members of the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory and many other collaborators.

My research focuses on the intersection of high-performance computing and machine learning. I am particularly interested in scalable training of deep neural networks and applying neural networks to scientific and computational simulation datasets. I also work on parallel algorithms and runtimes, graph analytics, and communication and performance optimization.

Current students

  • Lukas Ernst (masters)
  • Siméone de Fremond de la Merveillere (masters)
  • Simon Jacob (bachelors)
  • George Mtui (masters)
  • Julien Schenkel (bachelors)
  • Stefan Scholbe (bachelors)
  • Bernhard Walser (masters)

Former students

  • Christoph Amevor (bachelors)
  • Roman Böhringer (bachelors)
  • Jinfan Chen (bachelors)
  • Tobia Claglüna (bachelors)
  • Maximilian Fries (bachelors)
  • Ali Nasser (masters at KAUST)
  • Anton Schäfer (bachelors)
  • Peter Tatkowski (masters)
  • Neville Walo (bachelors)
  • Andreas Zingg (masters)

Selected projects


NoPFS, the Near-optimal PreFetching System is a deep learning I/O middleware that employs clairvoyant prefetching and distributed caching to fully utilize the storage hierarchy on a large cluster to mitigate training I/O overhead.

Deep Weather

I lead the Deep Weather project to apply deep learning to weather forecasting and to post-processing numerical weather system ensembles.


Substation is an overarching project to develop high-performance transformers implementations. It currently provides the fastest public implementation of the BERT-large model.


Aluminum is a generic communication framework enabling high-performance asynchronous point-to-point and collective operations, especially on GPUs. It includes more GPU-friendly semantics than MPI, and a suite of latency- and bandwidth-optimized algorithms, both from existing library and custom implementations. Aluminum has been integrated into both the LBANN deep learning toolkit and the Hydrogen distributed linear algebra library.


LBANN (Livermore Big Artificial Neural Network Toolkit) is a research toolkit for scaling the training of deep neural networks on HPC systems. My work includes optimized communication algorithms and patterns, communication sparsification and quantization, more general distributed-memory convolution algorithms, and more scalable data-parallel training algorithms.


PPL is an experimental C++11 runtime system for exploring different implementation tradeoffs, especially in the context of future exa-scale systems. See the paper on it below. If you’re interested in further details, contact me.

The name (probably) inventively stands for “Parallel Programming Library”, and is certainly not meant to be confused with the nice people right next door at the other PPL (Parallel Programming Laboratory).


PGDB is a parallel debugger for large-scale MPI applications, written primarily in Python with some C/C++. I haven’t found time to work on it in quite a while, but I continually find situations where it would be useful.


Xenos was a web-based RPG I did back-end PHP (the horror!) and database work for between 2005 and 2008, primarily in collaboration with Alistair Lynn, Nick Farley, Taylor Vaughan, and Alec Ingulsrud. At its height, we had several hundred players.