Introduction
Gaussian processes (GPs) provide a powerful Bayesian framework for regression with principled uncertainty quantification. Despite their theoretical appeal, they face a critical computational bottleneck: inverting the $ n \times n $ kernel matrix requires $O(n^3)$ operations, rendering standard GPs impractical for large datasets.
While various approximation methods have extended GPs to larger datasets, they typically fail to account for the uncertainty introduced by their own approximations. This oversight can lead to overconfident predictions and potentially flawed decisions in critical applications.
Iterative Gaussian processes, particularly the IterGP framework by Wenger et al.
In this article, we’ll explore how IterGP reframes GP inference as an iterative process, treating representer weights as objects of Bayesian inference and updating beliefs through carefully chosen projections. We’ll examine how different action policies influence convergence behavior and uncertainty distribution, with a focus on applications to computationally demanding problems like physics simulations.
Scaling up Gaussian Distributions
To understand Gaussian processes, it is instructive to begin with fundamental probabilistic structures and gradually progress toward more complex formulations. Let us commence with the univariate Gaussian distribution, formally denoted as $X \sim \mathcal{N}(0,1)$, which represents a random variable characterized by a mean of zero and unit variance. When extending this construct to two dimensions, we encounter the bivariate Gaussian distribution, which necessitates not only a mean vector $\boldsymbol{\mu}$, but also a covariance matrix $\boldsymbol{\Sigma}$ that quantifies both the variance of individual dimensions and their interdependence. This distribution is formally expressed as $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$.
The visualization of higher-dimensional Gaussian distributions presents significant challenges. For conceptual clarity, consider three independent univariate Gaussian distributions, each governing the position along a distinct spatial dimension. This configuration can be conceptualized as a three-dimensional Gaussian distribution exhibiting statistical independence between dimensions. In this scenario, the covariance matrix $\boldsymbol{\Sigma}$ adopts a diagonal structure, where each diagonal element represents the variance of its corresponding dimension, while all off-diagonal elements—representing covariances between dimensions—are precisely zero, reflecting the absence of interdimensional relationships.
As we extend our multivariate Gaussian distribution by incorporating additional dimensions, each dimension manifests as a new vertical Gaussian distribution along our horizontal axis. The spacing between these vertical distributions progressively decreases as the number of dimensions increases. When extrapolated to the limit of infinite dimensions, these distributions form a continuum that spans the entire horizontal domain. This mathematical progression yields a single infinite-dimensional distribution defined over a function space. We now don’t sample single datapoint anymore but entire function. In practical visualizations, this complexity is typically represented through summary statistics—primarily the mean function and associated standard deviation bounds. It is important to note that computational implementations necessarily discretize this theoretical construct, as infinite-dimensional calculations exceed the capabilities of contemporary computing systems.
For now this model is not very interesting to work with. The independence of dimensions implies that modifying the mean parameter of a specific dimension—for instance, shifting from $0$ to $1$—results in a corresponding upward translation of only that particular vertical Gaussian distribution. Concurrently, all other vertical Gaussian distributions maintain their original positions, exhibiting no response to this parametric adjustment due to the absence of cross-dimensional dependencies.
The substantive transformation emerges when we introduce interdimensional correlations. The objective is to establish relational dependencies between dimensions such that perturbations in one dimension induce corresponding shifts in adjacent dimensions. This interdependence can be formalized through the implementation of a covariance function that quantifies the interaction between neighboring dimensions. Such a covariance function, commonly referred to as a kernel, establishes the mathematical basis for these relationships. The desired smoothness properties can be elegantly captured using a Radial Basis Function (RBF) kernel, mathematically expressed as: $$ k(\mathbf{x}, \mathbf{x}’) = \sigma^2 \exp\left(-\frac{(\mathbf{x} - \mathbf{x}’)^2}{2l^2}\right) $$ It is important to note that the kernel function operates without explicit knowledge of dimensional adjacency or other conceptual structures imposed for interpretive purposes. Rather, it quantifies the distance between input parameters ($\mathbf{x}$ and $\mathbf{x}’$), which in our visualization corresponds to the spatial separation between vertical distributions. The correlation strength diminishes as the distance between distributions increases. The lengthscale parameter $l$ regulates the rate at which correlation decays with distance—larger values of $l$ yield smoother functions as the interdimensional influence extends across greater distances. In essence, an increased lengthscale results in more extensive propagation of effects across neighboring functions. The parameter $\sigma$ modulates the amplitude of the interdimensional relationship; higher values of $\sigma$ intensify the coupling between neighboring dimensions, resulting in stronger correlations.
Let’s recap shortly what we have learned so far:
- A gaussian process is just a infinite dimensional gaussian distribution and samples whole functions not simply points
- The kernel function changes the correlation between dimensions by adjusting the covariance matrix
While this forms a nice framework the most importan aspect is still missing. How can we use the kernel function to actually move our distributions around? How can we introduce fixed points?
Conditioning to actually get something
Having established the conceptual foundation of Gaussian processes as infinite-dimensional Gaussian distributions governed by kernel functions, we now turn to the crucial mechanism that transforms this theoretical construct into a practical predictive tool: conditioning.
Conditioning represents the mathematical process by which we incorporate observed data points into our model, effectively constraining the distribution of possible functions to pass through (or near) these points. This fundamental operation distinguishes Gaussian processes as powerful Bayesian non-parametric models capable of adapting to observed evidence.
In the context of our dimensional visualization, conditioning can be understood as fixing the position of specific vertical Gaussian distributions at predetermined values. When we condition our Gaussian process on an observation—for instance, specifying that at input location $x_1$, the function value equals $y_1$—we are effectively imposing a constraint that all plausible functions must honor this relationship.
The mathematical formulation of conditioning follows directly from the properties of multivariate Gaussian distributions. Given a joint Gaussian distribution over observed points $\mathbf{X}$ and unobserved points $\mathbf{X}_*$:
The conditional distribution $p(\mathbf{y}_* | \mathbf{y})$ is also Gaussian with parameters:
The implications of these equations are profound. The conditional mean $\mathbf{\mu}_* | \mathbf{y}$ represents our best prediction for function values at unobserved locations, while the conditional covariance $\Sigma_* | \mathbf{y}$ quantifies our uncertainty about these predictions.
The effect of conditioning manifests in two significant ways:
Mean Function Adjustment: The conditional mean function passes through (or near) all observed data points, with the surrounding regions influenced according to the kernel-defined correlation structure. Locations closer to observed points are influenced more strongly than distant ones.
Uncertainty Reduction: The conditional variance reaches its minimum at observed data points (zero in the noiseless case) and progressively increases as we move away from these points. This creates the characteristic “pinching” effect visible in Gaussian process visualizations, where the confidence bands narrow at observations and widen elsewhere.
In our visualization of dimensions as vertical Gaussian distributions, conditioning effectively transforms selected distributions from probabilistic entities into deterministic points with fixed values. The interdimensional correlations established by our kernel function then propagate the influence of these fixed points to neighboring dimensions, adjusting their means and reducing their variances proportionally to their proximity to the conditioned points.
This conditioning mechanism is what enables Gaussian processes to function as flexible regression models. By observing function values at specific input locations, we constrain the posterior distribution to favor functions that explain our observations while maintaining the smoothness properties encoded in our kernel function. The unobserved regions between and beyond our data points are then predicted according to these constraints, with accompanying uncertainty estimates that reflect the distance from observed evidence.
The mathematical elegance of Gaussian process conditioning lies in its closed-form solution, which directly translates our prior beliefs (encoded in the kernel) and observed evidence into a posterior distribution over functions. This Bayesian approach provides not just predictions, but a full probabilistic characterization of all functions consistent with our data and prior assumptions.
Prefer small but continuous improvements over one big leap
The following sections descibes IterGP by Wenger et al.
The $O(n^3)$ problem
While Gaussian processes provide an elegant framework for non-parametric Bayesian regression, they face a significant computational challenge that limits their application to large datasets. This limitation arises directly from the mathematical formulation we’ve just explored.
The computational bottleneck occurs when computing the conditional distribution, which requires inverting the kernel matrix $K(\mathbf{X}, \mathbf{X})$. For a dataset with $n$ observations, this matrix has dimensions $n \times n$, and its inversion traditionally requires $O(n^3)$ operations using standard techniques such as Cholesky decomposition. Additionally, storing this matrix requires $O(n^2)$ memory.
To put this in perspective:
- With 1,000 data points, the computation remains feasible on modern hardware
- With 10,000 data points, the computation becomes significantly more expensive
- With 100,000 data points, which are common in modern datasets, direct computation becomes practically impossible. Not only computationally but also memory-wise.
$v^*$ - the representer weights
Having identified the computational challenge of standard Gaussian process regression, we now direct our attention to a key mathematical structure that will become the cornerstone of our iterative approach: the representer weights. The representer weights, denoted as $ v^* $, emerge when we examine the conditional mean function more carefully. Recall that our posterior mean at test points $\mathbf{X}_*$ is given by:
This reformulation reveals something profound about Gaussian processes. The posterior mean at any point can be expressed as a weighted sum of kernel functions centered at our training points, where the weights are precisely the components of $ v^* $. This is the essence of the representer theorem in action—our optimal prediction is represented as a linear combination of kernel functions.
The representer weights encapsulate all the information from our observations that we need to make predictions. If we knew $ v^* $ exactly, we could predict the mean at any test point without needing to explicitly compute or store the inverse kernel matrix. This insight is crucial because it suggests an alternative approach to GP regression: rather than directly computing the full inverse of the kernel matrix, we could focus on obtaining a good approximation of $ v^* $.
The calculation of $ v^* $ requires solving the linear system:
Moreover, the posterior covariance can also be reformulated in terms related to $ v^* $. If we define matrix $ C_i $ (which we’ll explore more in subsequent sections), we get:
This perspective on Gaussian processes—focusing on representer weights rather than direct matrix inversion—forms the theoretical foundation for iterative GP methods. Instead of trying to compute the full inverse matrix, we’ll progressively refine our estimate of $ v^* $ through carefully chosen computations, yielding significant efficiency gains while maintaining principled uncertainty quantification.
In the next section, we’ll explore how we can iteratively update our belief about these representer weights, transitioning from the standard GP formulation to a computation-aware perspective that acknowledges and quantifies the uncertainty arising from our computational approximations.
Moving Beyond Direct Computation
Having established the central role of the representer weights $ v^* $ in Gaussian process prediction, we now turn our attention to the iterative approach for approximating these weights. Rather than computing the exact $ v^* $ in one computationally expensive step, iterative methods progressively refine an estimate through a sequence of more manageable calculations.
The fundamental insight underpinning iterative Gaussian processes is to treat the representer weights themselves as objects of Bayesian inference. Instead of viewing $ v^* $ as a fixed, unknown vector to be computed exactly, we conceptualize it as a random variable with an associated probability distribution that reflects our uncertainty about its true value.
Initially, we begin with a prior belief about $ v^* $ , typically centered at zero with covariance structure informed by the kernel matrix. Formally, this prior is expressed as:
These “observations” in the context of iterative GPs are not additional data points, but rather projections of the current residual vector. The residual at iteration $ i-1 $ is defined as:
After computing $ \alpha_i $, we can perform an exact Bayesian update on our belief about $ v^* $. The updated belief $ p(v^*) = \mathcal{N}(v^*; v_i, \Sigma_i) $ has parameters:
With each iteration, our estimate $ v_i $ approaches the true $ v^* $, while our uncertainty about this estimate, encoded in $ \Sigma_i $, progressively contracts. After $ n $ iterations (where $ n $is the number of data points), assuming linearly independent actions, we recover the exact $ v^* $ with zero uncertainty.
This iterative process can be interpreted geometrically as projecting the representer weights onto an expanding subspace spanned by our chosen actions $ {s_1, \ldots, s_i} $. Each iteration adds a new dimension to this subspace, improving our approximation of $ v^* $.
The choice of action vectors $s_i $ defines different variants of iterative GPs, each with distinct convergence properties and computational characteristics. Common choices include standard unit vectors (yielding a method equivalent to Cholesky factorization), conjugate gradient directions, or kernel functions centered at inducing points. This flexibility allows the iterative approach to be adapted to the specific structure and requirements of different applications, including the physics simulations that motivate our exploration.
In the next section, we will examine how this iterative approximation of representer weights affects our predictions and, critically, how we can quantify the additional uncertainty introduced by the computational approximation itself.
Quantifying Uncertainty
Having established the iterative approach to approximating the representer weights, we now confront a profound question: How does this approximation affect our uncertainty about the underlying function? In the standard Gaussian process formulation, the posterior variance captures our uncertainty due to limited observations. However, in the iterative approach, an additional source of uncertainty emerges—the uncertainty stemming from our limited computational budget.
The key insight of the iterative Gaussian process framework is that this computational uncertainty can and should be quantified in a principled Bayesian manner. When we have performed only $ i < n $ iterations, our estimate $ v_i $ of the representer weights is imperfect, and the distribution $ p(v^*) = \mathcal{N}(v^*; v_i, \Sigma_i) $ explicitly captures our remaining uncertainty about these weights.
To account for this computational uncertainty, we treat the representer weights as a latent variable in our predictive model. Instead of conditioning solely on the data, we now condition on both the data and the computations we have performed. The resulting predictive distribution is obtained by marginalizing over all possible values of $ v^* $ consistent with our computations:
This decomposition is not merely a theoretical curiosity but has profound practical implications. The combined uncertainty provides a principled, tight worst-case bound on the error between our approximate posterior mean and the latent function. Formally, for any input $ x $:
The spatial distribution of computational uncertainty depends on the chosen action vectors. For instance, when using unit vectors as actions (corresponding to the Cholesky approach), computational uncertainty is reduced primarily around the selected data points. In contrast, conjugate gradient actions tend to reduce computational uncertainty more globally across the input space. Inducing point methods reduce uncertainty in regions surrounding the inducing points.
This framework transforms the computational approximation from a mere implementation detail into an integral part of the Bayesian model itself. By acknowledging and quantifying the uncertainty arising from limited computation, iterative Gaussian processes provide a more honest and complete assessment of predictive uncertainty, which is particularly valuable in high-stakes applications like physics simulations where uncertainty quantification is crucial for decision-making.
Action Policies and their consequences
Having established the general framework of iterative Gaussian processes and the concept of computational uncertainty, we now turn our attention to the central mechanism that drives the iterative process: the selection of action vectors. These actions, denoted as $ s_i $, determine which projections of the residual we observe in each iteration and consequently how our belief about the representer weights evolves.
The choice of action policy is not merely a technical detail but a strategic decision that profoundly influences the convergence behavior, computational efficiency, and spatial distribution of uncertainty in our model. Different action policies lead to different instances of iterative GPs, many of which correspond to classical approximation methods.
Requirements for valid actions
For an action policy to be effective within the iterative GP framework, the selected actions must satisfy certain properties:
- Linear independence: To ensure convergence in at most $ n $ iterations, the actions $ {s_1, \ldots, s_n} $ must be linearly independent. This guarantees that each new action expands the subspace in which we approximate $ v^* $
- Accessibility: The actions must be computable without requiring operations that scale cubically with the dataset size, as this would negate the computational advantages of the iterative approach.
- Effectiveness: Ideally, actions should rapidly reduce the error in our estimate of $ v^* $, targeting the directions where our approximation is weakest.
Beyond these basic requirements, different action policies offer distinct trade-offs between computational efficiency, convergence speed, and the spatial distribution of computational uncertainty.
Canonical action policies
Several canonical action policies have emerged, each with its own characteristics:
Unit Vector Actions (IterGP-Chol): Using standard unit vectors $ e_i $ as actions yields an approach equivalent to the Cholesky decomposition. Specifically, at iteration $ i $, we select $ s_i = e_i $, focusing on a single data point. This approach reduces computational uncertainty primarily around the selected datapoints, resulting in a localized uncertainty reduction pattern. The convergence speed depends heavily on the ordering of the data points.
Conjugate Gradient Actions (IterGP-CG): By choosing actions based on the residual or preconditioned residual, we recover the method of conjugate gradients. These actions are designed to maximally reduce the error in each iteration, leading to faster global convergence. Specifically, when using $ s_i = r_{i-1} $ or $ s_i = \hat{P}^{-1}r_{i-1} $ (where $ \hat{P} $ is a preconditioner), the iterative GP recovers the same posterior mean as conjugate gradients. The convergence rate depends on the condition number of the kernel matrix and can be substantially improved through effective preconditioning.
Randomized Actions: Actions can also be selected randomly, leading to methods analogous to the randomized Kaczmarz algorithm. This approach offers probabilistic convergence guarantees without requiring deterministic patterns.
Convergence and error bounds
The convergence behavior of iterative GPs depends critically on the chosen action policy. For any linearly independent set of actions, we are guaranteed convergence in at most $ n $ iterations. However, the rate of convergence can vary substantially.
For unit vector actions, the convergence rate depends on the ordering strategy and the structure of the kernel matrix. For conjugate gradient actions, the convergence rate is governed by:
The practical implication is that well-designed action policies can achieve excellent approximations with far fewer than $ n $ iterations, making iterative GPs applicable to very large datasets.
Spatial distribution of computational uncertainty
Perhaps the most distinctive characteristic of different action policies is how they distribute computational uncertainty across the input space. This distribution has direct consequences for predictive performance in different regions:
- Unit vector actions reduce uncertainty in localized regions around selected data points, creating a “patchy” uncertainty landscape where some regions have low computational uncertainty while others remain highly uncertain.
- Conjugate gradient actions reduce uncertainty more globally, targeting directions that maximize error reduction across the entire domain. This results in a more uniform decrease in computational uncertainty.
- Kernel-based actions reduce uncertainty in the vicinity of the selected inducing points, allowing for targeted uncertainty reduction in regions of interest.
This spatial variation in computational uncertainty is not merely a theoretical curiosity but has practical implications for model performance. In applications where prediction accuracy is more important in specific regions, action policies can be designed to prioritize uncertainty reduction in those areas.
The flexibility to choose different action policies allows iterative GPs to be adapted to the specific requirements and constraints of different applications. Whether the priority is computational efficiency, fast convergence, or targeted uncertainty reduction, an appropriate action policy can be selected or designed.
In the context of physics simulations, which motivate our exploration, the ability to target computational resources toward regions of physical interest or high dynamics using carefully designed action policies represents a powerful tool for efficient and accurate simulation.