A couple weeks ago, the professor for my Statistical Optimal Transport class said that there would be a conference on sampling happening at Yale over the fall break. The timing was fortuitous; I’ve been engrossed in reading my professor’s books and papers, but his perspective on the field is naturally limited. Attending this conference would give me a broader view of the field as a whole.

Day 1

On Day 1, the conference took place on the 14th floor of Klein Tower, Yale’s largest academic building. The 14th floor is the faculty lounge, but it doubles as a nice venue for conferences, department holiday parties, and expos. It’s gorgeous. Due to the intimate nature of the conference (I would estimate around 40 to 50 attendees), the entire event was split between two rooms and their connecting hallways: one large, spacious room for the presentations, and another for food and beverage. The presentation room is particularly striking, with ceiling-high windows surrounding the perimeter, offering an unparalleled view of Science Hill, the lush tree cover of Hillhouse Avenue to the south, and the pseudo-Victorian red and tan buildings of Prospect Street to the southwest.

Courtesy of being a Yale graduate student, my registration fee for the conference was waived (including the evening banquet—though I did have to help set up for Day 2’s poster session in order to get that freebie thrown in). My primary goal for the conference was to expose myself to the field of sampling. What sorts of questions is the field currently interested in? What fields of study—statistics, math, CS, physics—have a perspective on sampling?

The two faculty hosts of the conference were Sinho Chewi and Andre Wibisono, two Yale professors from the Statistics and Computer Science departments, respectively. Sinho is the professor for my Statistical Optimal Transport class and is the reason I knew about the conference in the first place. I had never met Andre before attending the conference (I did briefly introduce myself) though his work had been cited multiple times in class. I trust their judgement in terms of whose work in sampling is worth taking a closer look at.

I had the secondary goal of socializing and “networking” at the conference. But this is always a pipe dream I’m afraid. I always go into conferences with the goal of being a Fake Extrovert before inevitably burning out and finding a nice dark corner to listen to music. (Spoiler: It happened this time too.)

The topics covered on Day 1 felt broad—and I intend that as a compliment. There seemed to be a conscientious effort to group the presentations based on similarity of subject matter. The morning was focused on Markov Chain Monte Carlo (MCMC). MCMC is a technique for sampling from a distribution where you construct a Markov chain (a sequence where the probability of the next location only depends on the current location) whose stationary distribution is the target distribution.

The first speaker was Gareth Roberts, who has been doing work in MCMC at least since the 90s (if not earlier). Sinho even credited Roberts with being one of the people who originally got him into the field. While the meat of Roberts’s talk didn’t quite make sense to me, I did get exposure to some cool ideas in MCMC. One was that of non-reversible Markov chains. In more traditional MCMC algorithms, the Markov chain is reversible: for any sequence of points visited by the chain, the reversed sequence is equally probable. (This property is also known as detailed balance.) But this type of behavior is inherently “diffusive,” which can be a bit too slow depending on the context. So under certain conditions, non-reversible algorithms can outperform reversible ones by allowing them to explore the state space more quickly.

Roberts also provided a clear explanation of simulated tempering: a way to handle sampling when you have a multi-modal distribution. Naive algorithms struggle with probability landscapes with multiple local maxima, so there are many research programs dedicated to coming up with different answers to tackle this fundamental challenge of sampling. I’ve come across simulated tempering before in the context of machine learning, but I had never seen the mathematical formalism before, and to my pleasant surprise, it was pretty approachable. I didn’t have the same luck with parallel tempering, unfortunately.

I also really enjoyed the talk by Dan Lacker, a professor at Columbia. Among the presenters I saw on Day 1, his work was the most closely aligned with what I find interesting. His work involved entropic optimal transport (optimal transport but with an entropic regularization term) and also variational inference (VI). With VI, rather than simply finding the optimal distribution over the space of all couplings, you limit yourself to some subset, e.g. the Bures-Wasserstein space (space of Gaussians).

Lacker referenced Schrödinger’s bridge, which I had heard mentioned in class before, but it didn’t conceptually click with me until his presentation. My recollection is that Schrödinger proposed the following thought experiment: if you have a gas in one state and then some time later the gas is in another state (by diffusing, let’s say), what are the likely intermediate states of the gas? Nowadays, we can recognize the connection with optimal transport and stochastic differential equations. But Schrödinger proposed his thought experiment without all of that mathematical machinery.

The early afternoon talks had to do with graphs. That’s about all I got out of them. The last couple of talks were on diffusion models. I understand the basic conceptual idea of diffusion models, but I’m really interested in learning how they work on a more technical level. One of the papers that got me to start looking in the direction of this field—“Renormalizing Diffusion Models” by Jordan Cotler and Semon Rezchikov—analogizes reversing the renormalization group transformations from low-energy to high-energy to what a diffusion model does, trying to learn the inverse process for diffusion. Again, there is a long way for me to go before I can fully engage on a technical level, but this is the type of work that I would like to do in the future.

One thing that stood out to me was just how useful my physics background was for quickly understanding what was “really” going on with many of the different sampling techniques. Whether’s it’s Hamilton Monte Carlo or the Donsker-Varadhan variational representation, being able to analogize the language of sampling and statistics back into physics never stopped being useful.

What’s nice is how interdisciplinary sampling is as a field. I already mentioned how helpful physics is for interpreting the various sampling algorithms. But the various professors I talked to were from many different departments. There were the statisticians, obviously. But also a lot of CS people, applied mathematicians, and even the odd engineer.

Day 2

This was a tougher day for me. The previous day of conferencing had took a lot out of me. The mental strain of following along the highly-technical talks, the fake extroverting, and the schmoozing with other academics at the previous night’s banquet while trying not to talk with my mouth full of cheese-filled baked potato—it all left me feeling a little bit drained.

The second day of the conference wasn’t on the 14th floor, but the 13th floor. Still a nice space, but not the same expansive, luxurious venue. I stumbled in around 10:30 AM, halfway through the keynote speaker’s talk. I didn’t want to interrupt, so I watched the talk in the dining room where it was being livestreamed on a huge flat-screen TV. I didn’t seem to have missed much; the talk was too technical for me to follow along anyway. I did recognize standard notions like the Dirichlet energy, the log-Sobolev inequality, etc., that I had been exposed to in my Statistical Optimal Transport class.

The afternoon talks seemed to be focused on different modifications of Hamiltonian Monte Carlo. “Normal” Monte Carlo is all about creating a Markov chain whose stationary distribution is the target distribution. Hamiltonian Monte Carlo has a “position” space which represents the original space and then also adds a momentum space. The difference is that while in an algorithm like the Metropolis-Adjusted Langevin Algorithm (MALA), the position is directly updated by the gradient of the potential, with HMC, the gradient updates the momentum and the momentum updates the position. It replaces the first-order equation of normal Monte Carlo with a second-order equation. The idea is that the marginal distribution where you integrate out the momentum degrees of freedom should get you your target distribution. I’m quite new to Hamiltonian Monte Carlo (like, “I learned about it last week” levels of new), so I don’t understand in detail what it has to offer over normal MCMC methods.

A lot of the talks seemed centered around a modification of Hamiltonian Monte Carlo called NUTS, which stands for No U-Turn Sampling. There is an explanation here, though I can’t pretend to understand the core insight behind why it’s better than normal HMC. There was a talk given by Bob Carpenter on GIST, which from what I gathered is a general framework for understanding HMC under variable time step, length of sample, etc., which has NUTS as a special case. This is important as working in the GIST framework allows you to understand various theoretical guarantees better than working with NUTS directly (that’s my hazy recollection of the talk; errors are mine). This was super neat as my first introduction to academic statistics was Andrew Gelman’s blog where Bob Carpenter is a frequent guest blogger. It was cool to finally see him in the flesh.

One strange thing was seeing so many experts roughly the same age as me. The conference skewed fairly young, not only in terms of the composition of graduate students versus professors, but also among the professors themselves. A lot of the professors, both those I talked to and those giving presentations, were assistant professors in the first couple of years of their positions. There was something sort of eerie about people in their late twenties and early thirties going up to the front of the room and commanding it like that. Maybe that will be me someday!