The techniques behind AQUA@home

Hi everybody, linked to below are two articles describing some of the techniques used in the quantum Monte Carlo code running at AQUA@home. This project started out with the objective of reproducing Peter Young’s results on extracting minimum energy gaps in fairly large systems, and would never have gotten off the ground without his help and advice (and access to his source code!) (ASIDE — also this re-introduced me to the wonderful world of Fortran, the first time I’ve coded in Fortran in about 20 years🙂 ).  Big thanks  also to Helmut Katzgraber and David Gosset who have been invaluable in the evolution of the code to its current point.

The objective of this project is to compute the adiabatic time and minimum energy gap by simulating a specific quantum system (one of our processor circuits) attempting to solve a specific problem instance. Basically the idea is to compute a lower bound on how fast you can run the processor by doing a full quantum simulation of the system. We can very precisely measure all of the terms in our circuits’ Hamiltonians so I have some confidence that these simulations will be predictive.

These simulations are extremely computationally intensive. Currently our volunteers at AQUA@home are donating about 10,000 cores, and as described in the publications linked to below, the application is very highly optimized. Still it can take months to get statistically significant results for the larger problem sizes, for a single setting of parameters. Changing the actual chip being simulated, or trying a different annealing path, requires a different simulation.

We’re currently writing up the first results obtained using the technique, on a type of random spin glass… stay tuned.

Here are the two papers recently posted to arxiv:

High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context

This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. Having multiple chains in Parallel Tempering allows parallelization in a manner that is similar to the serial algorithm. Volunteer computing introduces important constraints to high performance computing, and we show that both versions of the application are able to adapt themselves to the varying and unpredictable computing resources of volunteers’ computers, while leaving the machines responsive enough to use. We present experiments to show the scalable performance of these two approaches, and indicate that the efficiency of the methods increases with bigger problem sizes.

Importance of Explicit Vectorization for CPU and GPU Software Performance

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s