I am a research scientist in the Informatics Group of the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. Previously, I was an ETH Postdoctoral Fellow at ETH Zurich, working with Professor Torsten Hoefler in the Scalable Parallel Computing Laboratory. I completed my PhD in computer science at the University of Illinois at Urbana-Champaign, advised by Professor Marc Snir. During my PhD and postdoc, I worked heavily with members of the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory and many other collaborators.

My research focuses on the intersection of high-performance computing and machine learning. I am particularly interested in scalable training of deep neural networks and applying neural networks to scientific and computational simulation datasets. I also work on parallel algorithms and runtimes, graph analytics, and communication and performance optimization.

Current students

  • Akanksha Baranwal (masters)
  • Julia Bazinska (masters)

Former students

  • Christoph Amevor (bachelors)
  • Roman Böhringer (bachelors)
  • Jinfan Chen (bachelors)
  • Tobia Claglüna (bachelors)
  • Lukas Ernst (masters)
  • Siméone de Fremond de la Merveillere (masters)
  • Maximilian Fries (bachelors)
  • Cliff Hodel (masters)
  • Simon Jacob (bachelors)
  • George Mtui (masters)
  • Ali Nasser (masters at KAUST)
  • Anton Schäfer (bachelors)
  • Julien Schenkel (bachelors)
  • Stefan Scholbe (bachelors)
  • Peter Tatkowski (masters)
  • Neville Walo (bachelors)
  • Bernhard Walser (masters)
  • Andreas Zingg (masters)

Selected projects


NoPFS, the Near-optimal PreFetching System is a deep learning I/O middleware that employs clairvoyant prefetching and distributed caching to fully utilize the storage hierarchy on a large cluster to mitigate training I/O overhead.

Deep Weather

I lead the Deep Weather project to apply deep learning to weather forecasting and to post-processing numerical weather system ensembles.


Substation is an overarching project to develop high-performance transformers implementations. It currently provides the fastest public implementation of the BERT-large model.


Aluminum is a generic communication framework enabling high-performance asynchronous point-to-point and collective operations, especially on GPUs. It includes more GPU-friendly semantics than MPI, and a suite of latency- and bandwidth-optimized algorithms, both from existing library and custom implementations. Aluminum has been integrated into both the LBANN deep learning toolkit and the Hydrogen distributed linear algebra library.


LBANN (Livermore Big Artificial Neural Network Toolkit) is a research toolkit for scaling the training of deep neural networks on HPC systems. My work includes optimized communication algorithms and patterns, communication sparsification and quantization, more general distributed-memory convolution algorithms, and more scalable data-parallel training algorithms.


PPL is an experimental C++11 runtime system for exploring different implementation tradeoffs, especially in the context of future exa-scale systems. See the paper on it below. If you’re interested in further details, contact me.

The name (probably) inventively stands for “Parallel Programming Library”, and is certainly not meant to be confused with the nice people right next door at the other PPL (Parallel Programming Laboratory).


PGDB is a parallel debugger for large-scale MPI applications, written primarily in Python with some C/C++. I haven’t found time to work on it in quite a while, but I continually find situations where it would be useful.


Xenos was a web-based RPG I did back-end PHP (the horror!) and database work for between 2005 and 2008, primarily in collaboration with Alistair Lynn, Nick Farley, Taylor Vaughan, and Alec Ingulsrud. At its height, we had several hundred players.