pali pi sona pi ilo nanpa

Part 5: The Volume of Fluid method

Pierre Ballif — Fri, 03 Apr 2026 17:29:51 GMT

In this post, I'll walk through my implementation of the Volume of Fluid method. The overall motivation is to simulate 3D waves and droplets of water, like this:

In practice, this post will mostly describe "why are the incompressible Euler equations useful and how can we simulate them?"

Models

There is one reality, and there are different models of it. Some models include:

the Navier-Stokes equations (our most complete understanding of fluid dynamics, except for more advanced models that involve atomic theory; very difficult to solve)
the Euler equations (a simplified model where we neglect friction and heat transfer)
the incompressible Euler equations (a simplified model that neglects fluid compression)
a computer simulation of the incompressible Euler equations (an inaccurate model, because it can only represent discrete timesteps and machine-precision double values instead of real numbers)
my implementation of the simulation of the incompressible Euler equations (which might introduce some bugs)

Another model of reality is of course "intuition", i.e. whatever you're doing when you're looking at a picture of a wave and thinking "that's how the wave will move in the next second".

For all of those models to be useful, they have to agree, i.e. predict the same things (at least to some extent).

The dam break

Imagine a wall of water: one meter of water to the right, and nothing to the left. (You can imagine a dam separating the water that suddenly disappeared at t=0). What do the different models tell us will happen?

Your mental model of the world should imagine the water evolving like this:

https://www.youtube.com/watch?v=i5bpA-f8FW0

The incompressible NS equations tell us a similar, but more detailed, story. (The math assumes that you are familiar with partial derivatives and the notation $\text{div }\mathbf{u} =: \nabla \cdot \mathbf{u}$).

The wall of water starts at rest, i.e. $u = 0$, and evolves according to:

$$\begin{aligned} \frac{d \mathbf{u}}{d t} + \mathbf{u} \cdot \nabla \mathbf{u} &= \mathbf{f} - \frac{\nabla p}{\rho} \ \nabla \cdot \mathbf{u} &= 0 \end{aligned}$$

where
- $\mathbf{u} \in \mathbb{R}^3$ is the fluid velocity
- $p \in \mathbb{R}$ is the pressure
- $\rho \in \mathbb{R}$ is the density
- $\mathbf{f} \in \mathbb{R}^3$ are the external forces
Nonzero gravity ($\mathbf{f} = -9.81 \rho \vec{z}$ ), and zero convective transport in the absence of velocity ( $\mathbf{u} \cdot \nabla \mathbf{u} = 0$), would cause a downwards velocity everywhere, if there wasn't the pressure term to balance it out.
The pressure corresponds to hydrostatic pressure, i.e. high pressure at the bottom and a low pressure at the top. This pressure exactly compensates the effect from gravity, so that the downwards velocity is zero everywhere.
But pressure is non-directed - so the high pressure at the bottom, or rather the difference between this high pressure and the low air pressure next to it, will cause the fluid to move to the right at the bottom. Similarly, the fluid will move to the left at the top.
Now that fluid is leaking at the bottom, the fluid at the top will also move downwards due to gravity.
The whole wall of water will spectacularly break down.

Here, we have an agreement between intuition, the above video, and the incompressible Navier-Stokes equations. So all of these models are useful!

Solving the incompressible Euler equations

The incompressible Euler equations are already a simplified model, but they are still differential equations involving continuous quantities over continuous space evolving in continuous time. To simulate them, we need to discretize them both in time and in space (casting them from a differential equation to a difference equation) and then to solve them.

The incompressible Euler equations are as follows (from Wikipedia):

$$\begin{align} \frac{\partial \rho}{\partial t} + \mathbf{u} \cdot \nabla\rho &= 0 & \text{(a)} \ \frac{\partial \mathbf{u}}{\partial t} + \mathbf{u} \cdot \nabla \mathbf{u} &= - \frac{\nabla p}{\rho} + \mathbf{f} & \text{(b)} \ \nabla \cdot \mathbf{u} &= 0 & \text{(c)} \end{align}$$

In our case, we also have a no-slip boundary condition at the walls: $ \vec{n} \cdot \mathbf{u} |_{\partial \Omega} = 0$.

(a) represents the mass conservation, (b) the impulse conservation, and (c) the incompressibility. (a) is handled by VoF, because VoF is what we use to model mass and density. We will (somewhat arbitrarily) first solve (b) and (c) for the velocity $\mathbf{u}$, then solve (a) based on that velocity.

(Reference for the previous paragraph: https://web.stanford.edu/class/me469b/handouts/incompressible.pdf)

The projection method

When trying to solve (b), we encounter the problem that $p$ has no explicit expression, and isn't even conserved between time steps. The solution is to use a projection method. The idea is to solve (b) and (c) at the same time and solve them for both $\mathbf{u}$ and $p$.

This section is based on https://orbi.uliege.be/bitstream/2268/2649/1/BBEC9106.pdf.

The idea behind the projection method was described above: we consider "what would the velocity do if there was no pressure?", i.e. we first compute a "transport velocity" $\mathbf{u}_\text{trans}$ that solves the first equation without the pressure term. Then we compute what the pressure must be to fulfill both (ii) and (iii).

We partly discretize the equations with an explicit scheme into

$$\begin{align} \frac{\mathbf{u}^{(i+1)} - \mathbf{u}^{(i)}}{\Delta t} + \mathbf{u}^{(i)} \cdot \nabla \mathbf{u}^{(i)} &= - \frac{\nabla p}{\rho} + \mathbf{f} \ \nabla \cdot \mathbf{u}^{(i+1)} &= 0 \end{align}$$

$$\implies \begin{align} \mathbf{u}^{(i+1)} &= \mathbf{u}^{(i)} - \Delta t ( \mathbf{u}^{(i)} \cdot \nabla \mathbf{u}^{(i)} ) - \Delta t \frac{\nabla p}{\rho} + \Delta t \mathbf{f} \ \nabla \cdot \mathbf{u}^{(i+1)} &= 0 \end{align}$$

Define $\mathbf{u}_\text{trans} := -\mathbf{u}^{(i)} \cdot \nabla \mathbf{u}^{(i)} + \mathbf{f}$

$$\begin{align} \mathbf{u}^{(i+1)} &= \mathbf{u}^{(i)} + \Delta t \mathbf{u}_ \text{trans} - \Delta t \frac{\nabla p}{\rho} \ \nabla \cdot \mathbf{u}^{(i+1)} &= \nabla \cdot \mathbf{u}^{(i)} = 0 \ \implies \nabla \cdot (\mathbf{u}\text{trans} -\frac{\nabla p}{\rho}) &= 0 \ \implies \nabla \cdot \frac{\nabla p}{\rho} &= \nabla \cdot \mathbf{u}\text{trans} \end{align}$$

where we can compute the right-hand side with finite differences. Then we can solve this differential equation and get $p$ (up to a constant; we can arbitrarily set the mean pressure to 0).

You might wonder why we solve for $p$ while we're only interested in $\frac{\nabla p}{\rho}$. But the equation $\frac{\nabla p}{\rho} = \mathbf{u}_\text{trans}$ does not have a unique solution for $\nabla p$; rather, it must be combined with the knowledge that $\nabla p$ is a gradient (and therefore rotation-free) to get a unique solution. This is what the Poisson-like equation is trying to achieve.

Algorithm:

$$\begin{aligned} \mathbf{u}\text{trans} &\leftarrow -( \mathbf{u}^{(i)} \cdot \nabla \mathbf{u}^{(i)} ) + \mathbf{f} \ p &\leftarrow solve\left(\nabla \cdot \frac{\nabla p}{\rho} = \nabla \cdot \mathbf{u}\text{trans}\right) \ \mathbf{u}^{(i+1)} &\leftarrow \mathbf{u}^{(i)} + \Delta t \mathbf{u}_\text{trans} - \Delta t \frac{\nabla p}{\rho} \end{aligned}$$

To compute $\mathbf{u}_\text{trans}$: Use the fact that (using index notation):

$$\mathbf{u} \cdot \nabla \mathbf{u} = u_i \frac{\partial u_j}{\partial x_i} = \frac{\partial u_i}{\partial x_i} u_j + u_i \frac{\partial u_j}{\partial x_i} = \frac{\partial u_i u_j}{\partial x_i} = \nabla \cdot (\mathbf{u} \otimes \mathbf{u})$$

because $\frac{\partial u_i}{\partial x_i} = \nabla \cdot \mathbf{u} = 0$. So we can compute $\mathbf{u} \cdot \nabla \mathbf{u}$ by computing $\mathbf{u} \otimes \mathbf{u}$ for all points and taking derivatives (or finite differences).

To compute $p$: We approximate the differential equations with finite differences. If the density was constant, this would be a Poisson equation; but the density is not constant, so we need a slightly different stencil: (here $p_i$ denotes "$p$ at cell $i$".)

$$\nabla \cdot \frac{\nabla p}{\rho} \approx \frac{1}{\Delta x^2} (1/\rho_{i+1/2} (p_{i+1} - p_{i}) - 1/\rho_{i-1/2} (p_{i} - p_{i-1}))$$

This corresponds to the stencil

$$\frac{1}{\Delta x^2} \begin{bmatrix} \frac{1}{\rho_{i+1/2}} & (-\frac{1}{\rho_{i+1/2}} - \frac{1}{\rho_{i-1/2}}) & \frac{1}{\rho_{i-1/2}} \end{bmatrix}$$

We then solve this system of finite difference equations with Eigen's Conjugate Gradient solver, which is recommended for the 3D Poisson equation.

Discretization in space: I discretize

the pressure on the cell centers; therefore the pressure gradient on the cell faces can be approximated trivially by the pressure difference
$\rho$ is also given on the cell centers, therefore the $\rho_{i \pm 1/2}$ above require averaging
$\mathbf{u}$ on the cell faces. therefore $\nabla \cdot \mathbf{u}$ can be approximated trivially on the cell centers
$\mathbf{u} \otimes \mathbf{u}$ on the cell centers by averaging the velocities from the neighboring faces
$\mathbf{u}_\text{trans}$ on the cell centers; it involves $\mathbf{f}$, given on the cell centers, and $\mathbf{u} \otimes \mathbf{u}$

I think this is the discretization that requires the least effort.

Volume of Fluid and solving the density equation

This section is adapted from Volume of Fluid: A Brief Review.

Volume of Fluid (VoF) tracks the distribution of two fluids in a system. Since the fluids have different, but constant, densities, tracking fluid distributions is equivalent to enforcing the mass conservation from the Navier-Stokes equations:

$$\frac{\partial \rho}{\partial t} + \mathbf{u} \cdot \nabla\rho= 0$$

We specifically use the geometric VoF approach, as opposed to the algebraic VoF approach; geometric VoF is more standard. In geometric VoF, we reconstruct the fluid surface to decide how the fluid evolves.

To reconstruct the fluid interface, we use the piecewise linear interface calculation (PLIC) approach. We look at a 3x3x3 cube of cells and their volume fractions, find the normal vector that best matches the volume fractions in those cells, and set the center cell's normal vector as that vector.

Once we have the normal vector and the volume fraction in a cell, we can compute the "cell wet wall sizes", i.e. the fraction of a the cell wall sizes that is occupied by the fluid. The following image shows in blue the cell wet wall sizes in a given configuration:

Then, we compute the flux between cells by multiplying the velocity with the cell wet wall size. We should also take into account the fact that these walls have a slope. In 2D, the following figure explains the slope (figure adapted from Volume of Fluid Method: a Brief Review):

The wet wall size is the blue line on the right; the advected volume is not just the rectangle given by $\text{wallsize} \cdot u_{i+\frac{1}{2},j} \Delta t$, but rather the trapeze highlighted in blue. A special case happens if the time step is large enough that the blue trapeze meets the top line. In 3D, this is even more complicated. So I just ignored the slope and considered only the rectangle.

Then, we advect the volumes and update the volume fraction of each cell. Note that there is no guarantee that the volume fractions after the advection are still in the interval $[0, 1]$. If we were using an advanced VoF scheme, we'd have such guarantees. But instead we just clamp them to $[0, 1]$ and do not worry too much about the volume loss.

Various fun implementation details

I added Python bindings for the wet-wall-size formula, so that I can unit-test it from Python. This also allows me to visualize wet wall sizes, which is how I generated the plot above.
I added Numpy loading of matrices, instead of hardcoding the initial conditions. For this, I used the cnpy library.

Results

The simulation results for a dambreak-like scenario look like this:

The overall shape of the dambreak is as it should, but there is still some non-physical behavior with non-fluid bubbles forming behind the wave. There is another error (possibly linked) where the mesh becomes completely chaotic at the end (the program then usually ends with a failed assertion because some values are NaN).

The main computational bottleneck is the Eigen matrix solver to get the pressure. As far as I know, this is a fundamental limitation of the incompressible Euler equations. In the incompressible mode, the pressure everywhere is coupled and any grid point can influence any other grid point (this is different from the compressible Euler equations, which have only conservation equations and where there is a "finite signal speed").

Conclusion and future steps

This post outlined the implementation of a volume-of-fluid (VoF) scheme to simulate the incompressible Euler equations. The behavior of the simulation code mostly follows physics, but still has a few issues. The remaining steps are:

Fix any remaining VoF bugs
Use a more advanced VoF method with volume conservation guarantees
Implement the compressible Euler equations
Porting the solver to CUDA

Part 4: 3D meshes and Marching Cubes

Pierre Ballif — Sun, 23 Nov 2025 20:38:46 GMT

Motivation

So far, we’ve simulated featureless, undulating orange grids, which I find boring. What’s really cool would be an animation of a droplet of water falling into a water surface, including all the splashing effects, like this:

A quick search tells me that this is typically done with the Volume of Fluids method — roughly, modeling voxels that are filled with fluids to different degrees. My first concern was “doesn’t that completely ignore surface tension?”, but apparently there are approaches to compensate for this (see e.g. here).

To get to this result, the first thing we need is to have a 3D grid, instead of a 2D one; or even better, have a template class that can represent any number of dimensions. This is done in commit 0c5a35a. Then we’ll have to rewrite the GLFW code to display not a 2D grid, but a 3D grid. This is very far from trivial. This post is about how we can do it with the Marching Cubes algorithm.

Marching Cubes

Marching Cubes [^2] is an algorithm from computer graphics that, given a scalar field on a rectangular grid, computes a mesh of triangles that approximates an iso-surface. Without loss of generality, we assume that we want to compute the isosurface of all points where the scalar field is zero.

The idea is to look at the spaces between the points of the grid; these spaces form cubes, with 8 mesh points forming the vertices. If the vertices do not all have the same signs, then the iso-surface passes inside the cube (because the iso-surface must separate the positive points from the positive ones. Depending on which vertices have which sign, we select a different pattern of triangles.

Algorithm

The original marching cubes [^2] had 14 cases. However, it did not take into account ambiguities, which could cause it to generate holes in the surface [^1]. The updated MC algorithm that we are using goes as follows:

For a given combination of “on” and “off” vertices, find which of the 14 cases applies.
The case may have multiple sub-cases; we need to run some face tests and/or the interior test to decide which sub-case applies.

Cases

There are 2^8 = 256 combinations of “on” and “off” vertices; we reduce them to 14 cases by exploiting rotations (a cube has 23 possible rotations plus the “identity” rotation), and the fact that the “on” and “off” vertices can be swapped (the surface then still separates “on” from “off” vertices - the surface normal has to be inverted though). We do not exploit symmetry; the only case where that would play a role (in other words, the only case that is chiral) is case 11, which we are happy to duplicate into its chiral opposite, case 14.

Tests

When a face has two diagonally opposed corners to one side of the isosurface, and the two other corners to the other side, the face test tells you whether the positive corners or the negative corners are connected, like this:

Similarly, when two diagonally opposed vertices of the cube have the same sign, and there are some vertices with a different sign in between, the center test tells you whether the two vertices are connected or not:

Both tests operate under the assumption that the function varies trilinearly inside the cube.

Based on the test results, we then select a sub-case. For example in the two plots above, the positive-negative pattern (0 and 7 positive, the rest negative) tells us that we are in case 4; depending on the result of the center test, we have to select either subcase 4.1 (points not connected) or case 4.2 (points connected).

Symmetry

The cases are given up to symmetry, i.e. the combination (vertices 1 and 6 positive, the rest negative) should be turned into (0 and 7 positive, the rest negative), which we recognize as corresponding to case 4. Additionally, the sub-cases are also given up to symmetry.

We’ll distinguish between “original” coordinates, which is how the cube actually is, and the “canonical” coordinates, which is how the current case or subcase is given in the literature. In the example above, (1,6) are the original coordinates of the positive vertices; then we turn the cube to map them into (0,7), which are the canonical positive vertices of case 4. Let’s take the most complicated case, case 13.

We first have to recognize the pattern of positive edges, and turn the cube so that it corresponds to the canonical case 13. Then, we perform 6 face tests and one interior test. If there are more negative face tests than positive ones, we negate the value of each vertex (this also negates the values of the face tests) and rotate the cube to be in the canonical case 13 again, but now with more positive than negative tests.

If none of the tests are positive, we are in case 13.1. If one of them is positive, case 13.2 — but it has to be rotated to match which face is positive. For two positives, we assume that the two positives are on adjacent faces and rotate case 13.3 to fit. (Why are we allowed to assume that the positive faces are adjacent? Because the assumption of tri-linearity means that if there were two positives on opposite faces, then all faces would be positive, not only these two.)

What to precompute

We have to decide on a trade-off: what do we precompute and what do we compute at runtime? In other words, where do we use lookup tables and where do we use if-statements? One extreme would be to build a lookup table of 256 × 2^7 (the number of possible test results) entries, where each cell is a list of triangles. The other extreme would be, whenever we encounter a bit-pattern, to try all rotations of it and see whether it matches any canonical case.

My compromise is the following:

There is a lookup table that maps positive/negative vertex patterns to a rotation + a case (1-14).
We already compute the points where the isosurface intersects the edge (there are 12 edges, but not all of them are crossed by the isosurface). At this point, we have every geometrical information about the cube available as an array. This means that rotations can be implemented as permutations (of both the vertices and edges).
The CPU performs the rotation, then looks up the number of tests for the case and performs all these tests (on the rotated cube).
There is a per-case lookup table that maps test results to a rotation + a subcase.
The CPU performs the rotation again, then draws triangles between the points computed in step 2. We computed the 3D points back when the cube was in “original” coordinates, but the array containing them has been permuted to correspond to the “canonical” list of 3D points for this case. So we can just construct triangles from those points in the order in which they are given, without having to permute things back.

How to generate cases

To describe the triangle patterns for each case, I used Python. The data was then translated to C by introspecting the Python data structures, and writing them to a C header; then writing the data as a .c file. The files are then compiled as usual by the compiler.

A nice thing is that I now also have the data structures available in Python. Translating the algorithm from C++ to Python is trivial, which allows me to plot Marching Cubes in Python, and in particular to get an interactive visualization of different cube patterns.

To showcase this, I’ve uploaded the visualization publicly: https://warggr.github.io/waves-on-cuda/scripts/marching_cubes/html/. Thanks to Pyodide, it is possible to run Python (including Matplotlib and a lot of other third-party modules) in the browser as a static webpage, without needing a server.

Other implementations

How does my implementation compare to others?

I added two other Marching Cubes implementations into my CMake project, writing a tiny wrapper so that they are replacements for my marching_cubes function.

I could not find the source code for the original implementation of Marching Cubes 33 by Lewiner et al. [^3] (they provide a link that is now dead.)
The first implementation is from Custodio et al., based on their 2013 paper [^4]. They fix three issues in the MC33 algorithm.
The second implementation is from Vega et al., the authors of [^1], which was published in 2019. They state that the main contribution of their paper is the correct execution of the interior test, and the efficiency of their implementation.

Comparison

Both implementations are much more efficient than mine:

Implementation	Runtime [ms]
Mine	173
Custodio et al.	59
Vega et al.	54

I noticed that both implementations return early when all vertices have the same sign — this is the most common case, so this makes a big difference. After adding an early-return to my code, the runtime decreased to 100ms.

A likely cause for the slowness of my implementation is that I tried to separate the algorithm from the data; each case is processed the same, but based on the case-specific data in the lookup table. The code from the literature (e.g. L. Custodio’s code here) is longer and more branch-heavy. My code was easier (and less boring) to write, but this has a cost in terms of speed.

Future steps

First, there are still some bugs in my implementation. For example, I always compute the face test between the bottom-left and top-right vertex of a face, no matter how the cube is rotated; this is of course incorrect.

Second, it might be possible to implement Marching Cubes on the GPU, using a compute shader and/or a geometry shader. This might make the program faster.

Finally, there are many ways how the Marching Cubes implementation could be sped up. For example, we currently save different sub-cases as lists of triangles, where each triangle is saved as a list of vertices. But the vertices of the triangles are the same for each sub-case of the same case. Maybe vertices could be saved for a case, and then each sub-case could only contain indices indicating how these vertices are connected.

Conclusion

The code is now able to visualize 3D grids thanks to a Marching Cubes implementation. Despite the apparent simplicity of Marching Cubes, there are many details to pay attention to. Open-source implementations are much faster and so should be preferred.

Cover image: a sphere plotted with Marching Cubes, more precisely a visualization of the field

$$(x - 5)^2 + (y-5)^2 + (z-5)^2$$

I don’t know why three sides are cut off; I probably made an off-by-one error somewhere.

References

[^1]: Vega, D., Abache, J., & Coll, D. (2019). A fast and memory-saving marching cubes 33 implementation with the correct interior test. In Journal of Computer Graphics Techniques 8.3.

[^2]: Lorensen, W. E., & Cline, H. E. (1998). Marching cubes: A high resolution 3D surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field (pp. 347-353).

[^3] Lewiner, T., Lopes, H., Vieira, A. W., & Tavares, G. (2003). Efficient Implementation of Marching Cubes’ Cases with Topological Guarantees. Journal of Graphics Tools, 8(2), 1–15. https://doi.org/10.1080/10867651.2003.10487582

[^4] Custodio, L., Etiene, T., Pesco, S., & Silva, C. (2013). Practical considerations on Marching Cubes 33 topological correctness. Comput. Graph. 37, 7 (November, 2013), 840–850. https://doi.org/10.1016/j.cag.2013.04.004

CUDA for Wave Simulation - Part 3: Efficient CUDA

Pierre Ballif — Sun, 31 Aug 2025 14:22:19 GMT

Non-trivial CUDA

The current snippet of code calling CUDA is:

cuda_step<<< 1, 1 >>>

This uses only one CUDA thread, and is probably extremely inefficient. To verify this, let’s add some way to time the program. I could use NVIDIA’s nsys, or just the Linux time command (or some kind of time counters in the program). In any case, I’ll need a new “run configuration”: run the program for a fixed number of cycles to check its efficiency. I don’t think this justifies a new executable, so let’s make this a program flag. For this, I’ll need to integrate the Boost Program Options library (I could also write program options checking by hand, but Boost is easier, better tested and probably more extensible).

This pulls in an additional dependency to our code, but Boost is already pretty widely installed, and requiring it is rather trivial in CMake. (A future TODO: have a fall-back in case the user does not have Boost installed.)

The new run configuration has the following differences with the original:

There is no visualization, as it could be a bottleneck (I’m not confident in the current communication between UI and simulation) and I don’t want a window to pop up when I run tests.
The program doesn’t call cudaDeviceSynchronize after each step, as that is no longer necessary and can become a bottleneck. (In fact, it is; ./src/waves --perf 1000 takes 3ms without cudaDeviceSynchronize and 2080ms with it).

NSys analysis

First, let's run the program with nsys:

$ nsys profile src/waves --perf 100
$ nsys stats report1.nsys-rep
** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ----------------------
     98.0      128,491,979          2  64,245,989.5  64,245,989.5    72,052  128,419,927  90,755,652.8  cudaMallocManaged     
      1.2        1,553,190        100      15,531.9       5,443.5     5,074      962,247      95,655.4  cudaLaunchKernel
[...more lines left out...]
 ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                Name                                              
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ------------------------------------------------------------------------------------------------
    100.0        1,923,059        100  19,230.6  16,096.0    16,063   330,014     31,392.3  cuda_step(const double *, double *, double, double, unsigned long, unsigned long, unsigned long)

Okay, so the allocations are still too big in comparison with the actual work: we spend 128ms (128,491,979ns) allocating memory, 1.55 ms launching the kernel, and 1.92 ms doing the kernel work. Notes:

This is particularly bad because as of 265a7e2, the --perf configuration doesn't measure the initialization time, incl. memory allocation — so our results are going to be wildly different from the actual runtime if the allocation runtime is significant. (I should probably fix this.)
I don't know if the 1.55ms and 1.92ms should be added or subtracted from each other (i.e. whether kernel launch time counts as part of kernel execution time), but it doesn't really matter for this example.
Sanity check: nsys confirms that we did 2 allocations and launched 100 kernels, as expected. (We allocate one "front" and one "back" matrix buffers and swap them back and forth.)

Let's increase N. I won't copy the output of nsys again, instead here's a summary. All times are in ms.

N	10 (previous)	1000
`cudaMallocManaged` total time	128.49	126.56
`cudaMallocManaged` calls	2	2
`cudaLaunchKernel` total time	1.55	127.88
`cudaLaunchKernel` number of calls	100	10000
Kernel run time	1.92	161.19

Runtime analysis

Now we can do a few tests with the --perf configuration. All times are in ms.

N	10	100	1000	10000
Without CUDA	0.07	0.50	5.04	50.08
With CUDA	0.41	0.27	125.95	11314.5

Again, results for $N \le 10000$ are not really significant in CUDA, because the runtime is dominated by malloc, outside the region tracked by perf. The no-CUDA version scales linearly with N, as expected.

The CUDA version decreases in runtime with higher N, then scales very badly with large N. A possibly related bug is that I forgot to sync the CUDA state after all steps were done. Could this be the explanation?

A nice post on CUDA kernel launching explains the process of launching a kernel, including this interesting detail:

The driver doesn’t immediately send your kernel to the GPU. Instead, it [...] allocates space in a command buffer — think of this as a todo list for the GPU

How large is this command buffer? Unfortunately, this is difficult to find out. So for all I know, the following explanation is reasonable: we don't sync the CUDA state and launch a lot of kernels without waiting, so the first 100 or so land in the command buffer. We don't sync, so we measure only the time for launching the kernel, which is tiny. When we add more kernels, at some point the command buffer becomes full and we have to wait for work to be processed. So for higher N's, we effectively add a synchronization that was not present for lower N's, which makes the runtime much higher.

In any case, the fix is trivial: Call cudaDeviceSynchronize at the end of each experiment. (See commit cae55ce.) With this, we have the following results:

N	10	100	1000	10000
Without CUDA	0.07	0.50	5.04	50.08
With CUDA	15.20	140.90	1128.67	11285.2

This is more sensible, although the performance is terrible. So now that we've ironed out all the bugs, we can do what we wanted to do in the first place: add more CUDA threads to actually take advantage of the GPU.

Increasing threads and grid size

Using more threads can be done by simply calling the kernel as

cuda_step<<< 1, NUMBER_OF_THREADS >>>

Then the kernel will be called NUMBER_OF_THREADS times, with a different value of threadId to give each thread an identity. We need to decide what each thread does. Fortunately, in our simple problem, each row of the grid is independent, so we can just give one or more rows to each thread. This is done with the following code:

 void cuda_step(
     const double* in, PlainCGrid out,
     double t, double c,
+    std::size_t block_size,
     std::size_t grid_width, std::size_t grid_height
 ) {
+    int start = threadIdx.x * block_size;
+    int end = (threadIdx.x + 1) * block_size;
-    for(int i = 0; i < grid_height; i++) {
+    for(int i = start; i < end; i++) {

where the caller has to specify the block size and to ensure that NUMBER_OF_THREADS * block_size == grid_height. In the documentation, it seems allowed to use an arbitrary number of threads, e.g. specify NUMBER_OF_THREADS = grid_height and block_size = 1. In fact, this works on my laptop even for NUMBER_OF_THREADS = 10000, even though the Programming Guide states that

On current GPUs, a thread block may contain up to 1024 threads.

This limit can also be confirmed on my GPU with deviceQuery. I assume that the additional threads are run sequentially on different "physical" threads.

A few experiments with the grid size. I use different values of $S$, the number of elements per side; so there are $S^2$ elements on the full grid. The time step $dt$ is adapted to fulfill the CFL condition, but the number of time steps stays constant, namely N=100000 timesteps. The number of time steps has to be chosen sufficiently large, otherwise cudaDeviceSynchronize takes up most of the runtime (since it now has to sync $O(S^2)$ memory, which can be very large).

S	10	50	100	500	1000
Without CUDA	7.67	112.70	495.12	154523
With CUDA, 1 thread	1697.91	29064.3	111808	3902360
With CUDA, S threads	372.36	1125.84	2002.39	41444	210764

In theory, the work scales with the number of elements, i.e. with $S^2$. We observe the following:

The non-CUDA version scales worse than $O(S^2)$. This could be due to cache effects (for S = 500, each grid takes 500*500*sizeof(double) i.e. 2MB; on my machine, this fits into L3 cache but not into L2.)
The CUDA 1-thread has worse performance than the non-CUDA version in the beginning, but scales like $S^2$, i.e. slightly better. This might be due to different cache effects.
The CUDA S-thread version scales like $S^2$ for high values of $S^2$. I would have expected it to scale like $S$: Each thread takes one column ($S$ elements), and there should be enough hardware threads so that no columns have to be processed sequentially.

Merging kernels

As seen previously, launching a kernel is a somewhat complex process. Can we put the for-loop over N in the kernel, so that we only launch the kernel once? Each thread's work is independent, so there should be no synchronization issues. This is implemented in 6bd0ee7. The runtime is as follows:

S	10	50	100	500	1000
Without CUDA	7.67	112.70	495.12	154523
With CUDA, 1 thread	1697.91	29064.3	111808	3902360
With CUDA, S threads	372.36	1125.84	2002.39	41444	210764
With CUDA, S threads, one kernel	178.77	661.35	1848.84	41206	210291

So this is an improvement, but becomes less relevant for high $S$, as the actual kernel execution and memory transfer will be more prevalent than the kernel launching overhead.

Future work

Bugs

Even though the --perf version runs fine, there's now a SEGFAULT when running without --perf, i.e. with the GUI.
CUDA with $S = 10^4$ threads and $N = 1000$ reports 0.64 ms runtime, where $S=10^3$ took 2162ms. There is probably something somewhere that immediately fails and exits.

Features

I never actually checked whether the program transformations were correct, i.e. whether the result is the same for every commit so far.
Use thread groups, i.e. put a non-1 number in the syntax

cuda_step<<< 1, NUMBER_OF_THREADS >>>

Part 2: Basic CUDA

Pierre Ballif — Sun, 17 Aug 2025 19:14:50 GMT

There’s a good first CUDA tutorial here: https://developer.nvidia.com/blog/even-easier-introduction-cuda/. Then there are the docs here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#.

The docs describe three levels of the thread hierarchy — thread blocks, thread block clusters, and grids. We will use only one thread block, because it seems easiest — in particular, sharing memory outside thread blocks seems difficult. This will limit the level of parallelism we’ll be able to achieve; a future version could try to use multiple thread blocks.

The current simulation code ( World::step ) looks like this:

for(int i = 0; i < GRID_HEIGHT; i++) {
    (*other_grid)[i][0] = 1.0;
    for(int j = 1; j-1];
    }
}

This can be easily converted to a CUDA kernel:

using PlainCGrid = double*;

__global__
void cuda_step(PlainCGrid in, PlainCGrid out) {
    for(int i = 0; i < GRID_HEIGHT; i++) {
        out[i*GRID_WIDTH] = 1.0;
        for(int j = 1; j-1];
        }
    }
}

void World::step() {
    cuda_step<<<1, 1>>>(
        reinterpret_cast<double*>(current_grid->_data.data()),
        reinterpret_cast<double*>(other_grid->_data.data())
    );
    std::swap(other_grid, current_grid);
}

Now we’ll have to compile it, ideally not manually but by integrating it into the CMake project. A CUDA/CMake tutorial can be found here: https://developer.nvidia.com/blog/building-cuda-applications-cmake/

This is where I notice it doesn’t work; I just get some garbled triangles. The error is not with the C-style rewrite, as removing the __global__ and the <<<1,1>>> works perfectly (see commit 1b6d3b9d). Probably I am doing something wrong with CUDA memory management.

One problem is that we were using std::arrays , but now we need CUDA-managed memory, not “regular” memory. CUDA can’t interact with the “regular” memory stored in the std::array. All of the following do not work:

most C++ containers take as a template parameter an “allocator” which can allocate memory in non-standard ways. However std::array doesn’t (probably because it doesn’t actually allocate memory on the heap, but is just a wrapper around stack memory)
A std::array also cannot be constructed by providing the memory (probably for the same reason).
I could use an std::vector, but I want the size of the grid not to change after initialization, so using a resizable vector seems inelegant to me.
In C++23, I could use std::span, but I’m still targeting C++17
I could write some custom class that is an array with managed memory, but I already have Grid that provides most of that functionality. So I’ll adapt the Grid class.

After adapting Grid, there’s another problem. The program flow (in main) looks like this: some memory is dynamically allocated by Grid, GLFW accesses that memory, then Grid is deleted before GLFW, the memory is freed, and we get a segfault. The short-term solution is to initialize Grid before GLFW. The long-term solution would be to use a smart pointer, or to use Rust, where that kind of thing can’t happen ;)

By c3173dd, after another trivial bug fix, this works and there’s again an unit-height wave coming from the right side. So everything works as it did before, but with CUDA. (Something that was helpful for the trivial bug fix: I can set the GRID_WIDTH and GRID_HEIGHT constant to 10 each and then print the grid to see what happens.)

Now I would like to have some animation in the visualization. The first step is to replace the for-loop in main (that only does 20 timesteps) with a while loop that runs forever. However, that breaks the program flow. The previous program flow was:

20 simulation iterations are performed and visualized at the same time.
main exits and waits for the destructor of MyGLFW, i.e. the visualization, which waits for the UI thread to exit.
When the user presses Exit or closes the window, the UI thread exits and the program stops.

Now we need a more complex flow, which can handle both the user closing the UI thread, the main thread getting an interrupt signal, or the main thread exiting for any other reason. My solution looks like this:

The UI (class MyGLFW) has a state that can be RUNNING or KILLED (i.e. the user wants to close the window but it hasn’t been closed yet). The window is started in the constructor and closed in the destructor, so there’s no other state (if the MyGLFW exists, then it has a GLFW window somewhere).
The state can be accessed by both threads, and is protected by a mutex. When the UI thread wants to exit, it sets the state to KILLED . When the main thread calls the destructor, the state is also set to KILLED.
On the other side, the UI thread regularly checks the state. If it’s killed, it exits the render loop. Whenever the main loop tries to render something and the UI has been killed, this raises an exception (which is caught but breaks the main loop).

This is done by commit bde64e4. There seems still to be a bug with some huge triangles in the rendering; I’ll debug that later.

I notice that the MyGLFW class does a bit too much and has essentially two separate responsibilities:

Wrap the GLFW state and ensure the window is properly closed in the destructor
Provide synchronization between the main loop and the UI window

To have a clearer (and thereby probably more robust) program, I’ve split the class into two classes (MyGLFW and renderer). This is done in a8a27e3.

Now that the UI and simulation can run in parallel, let’s add some more complicated boundary conditions, such as a sine. Some considerations to take into account:

We can vary the wave speed $v$ and the sine period. To view an interesting sine, I need a high sine period, a high grid size, and a low wave speed. Probably I’ll need to take into account the CFL condition later.
For debugging, I can set a small grid size, replace the while loop back with a for loop, and print the grid for the first few iterations. Prior to printing the grid (which is in GPU memory but will be mostly copied on the CPU), the program needs to call cudaDeviceSynchronize. (I found that out through this doc, which wisely states “We all make this mistake once".)
I can also add an option to not use CUDA, to can distinguish between CUDA usage errors and logic errors (see commit 2cb856f)

With 49f7293, the grid size also becomes variable — this goes against what I said previously about having “the container implicitly contain its size”, but it’s a pretty convenient feature. (Since that is no longer a reason, could we use an std container? No — there’s no 2D vector in the STD. There’s a vector of vectors, but it wouldn’t be in continuous memory. So our solution is still the most elegant.)

With f356508, the boundary condition becomes a sine wave, so we can see the simulation actually doing something.

This brings us to the remaining tasks / directions to continue the program:

Fix the OpenGL bugs (rectangle getting outside the screen all the time, too large triangles, etc.)
Flesh out the simulation: use different schemes, different equations, different grids, and different boundary conditions
Learn more about CUDA by adding more “difficult” parallelization, e.g. the grids and thread blocks clusters mentioned in the intro. Right now we only use one CUDA thread (AFAIK). This is sufficient, but the goal of this project is to learn about CUDA, so we should use something more complex than that. We can artificially create the need for more complex structures by using more complex grids (e.g. split up the grid into different domains).

Part 1: OpenGL

Pierre Ballif — Sun, 17 Aug 2025 18:56:55 GMT

Motivation

In this series of posts, I'll try to simulate waves on a GPU using CUDA.

First, I'll describe the exact scope of the project. The goal is to implement some computational fluid dynamics method to solve some fluid dynamics problem - for now, let's take the upstream finite-difference scheme for the advection equation, because it's one of the easiest setups I can think of. The method will be implemented on the GPU using CUDA. There should also be some sort of visualization to ensure that the results are sensible, but it can be rather basic.

One thing that is not a goal is rendering photo-realistic waves. I know that this is possible on a GPU, and I might look into it more once this project is finished, but for now this is outside my scope.

As a first guess, the steps of the implementation will be as follows:

basic project setup with a C++ main function, CMake, defining the data structures, and a very basic simulation loop.
getting some visualization of the results
find some numerical method and how to implement it in CUDA
couple the numerical solver in CUDA with the rest of the project
implement more difficult methods, equations, and boundary conditions (possibly using config files)

Once I actually started the project, the first step turned out to be very straightforward, while the second step turned out to be much more difficult than I thought. However, it was also very instructive - I effectively learned the basics of OpenGL. So this first post is going to focus on the first two steps and is essentially an introduction to OpenGL. Once this is done, I will focus on the actual "waves on CUDA" part and write the results in a second post.

The code is version-controlled (of course) and publicly available on my GitHub: https://github.com/Warggr/waves-on-cuda/ . A lot of content is copied and commented here, but often (more often in the later parts) I will just write here a summary of what I did and why. If you want to learn more, I will point you to my repository and some tutorials that I (mostly) followed.

Basic infrastructure without CUDA

I define the minimum viable product as follows:

C++ code to define a 2D surface on which waves can travel
some wave movement
wave visualization

The first two should be pretty straightforward. The second will require a front-end library. I'll use GLFW as I've seen it used in a similar project in a lecture; there are probably some better choices. GLFW is based on OpenGL, which is a cross-manufacturer graphics rendering interface. You might say "why use OpenGL if this project is going to be specific to CUDA/NVIDIA?". I could actually use (some CUDA-specific GL) for graphics rendering as well, but for now I'll use CUDA only for the fluid dynamics part.

Basic C++ code

Let's jump into the C++ code. Here's a simple grid with 100x100 double cells, all initialized to 0 (put this into a file called src/grid.hpp):

#pragma once
#include 

constexpr int GRID_WIDTH = 100,
    GRID_HEIGHT = 100;

struct Grid {
    std::array<std::array<double, GRID_WIDTH>, GRID_HEIGHT> _data;
    Grid() {
        for(auto& row: _data) {
            for(double& cell: row) {
                cell = 0;
            }
        }
    }
};

During simulation, we're going to compute the state at time step n+1 based on the state at time step n - so we need to store both. For now I implement a super simple time scheme where the fluid travels exactly once cell each time step, and the boundary condition is always 1 (i.e. we'll have a wave of 1 going from the left to the right, with the grid being 0 outside the wave).

#pragma once
#include 
#include 

struct Grid {
  ...
}

class World {
    Grid grid1, grid2;
    Grid* current_grid, * other_grid;
public:
    World() {
        current_grid = &grid1;
        other_grid = &grid2;
    }
    void step() {
        for(int i = 0; i < GRID_HEIGHT; i++) {
            (*other_grid)[i][0] = 1.0;
            for(int j = 1; j-1];
            }
        }
        std::swap(other_grid, current_grid);
    }
};

To iterate over Grid objects as the code does, we'll need to implement a few methods on the Grid (the const version of the iterators are going to be useful later):

struct Grid {
    using GridArray = std::array<std::array<double, GRID_WIDTH>, GRID_HEIGHT>;
    GridArray _data;
    Grid() {
        for(auto& row: _data) {
            for(double& cell: row) {
                cell = 0;
            }
        }
    }
    GridArray::iterator begin() { return _data.begin(); }
    GridArray::iterator end() { return _data.end(); }
    GridArray::const_iterator begin() const { return _data.begin(); }
    GridArray::const_iterator end() const { return _data.end(); }
    std::size_t size() { return _data.size(); }
    std::array<double, GRID_WIDTH>& operator[] (int i) { return _data[i]; }
};

(at first I forgot to return a reference (&) in the operator[] - so it didn't work because it always modified a copy of the grid - make sure that doesn't happen to you :) ) Now let's create these files: src/main.cpp

#include "grid.hpp"
#include 

int main() {
    World world;
    for(int t = 0; t<50; t++) {
        world.step();
    }
    for(const auto& row: world.grid()) {
        for(const auto& cell: row) {
            if(cell > 0) std::cout << cell;
            else std::cout << "  ";
        }
        std::cout << std::endl;
    }
}

I then compile the project using CMake (I'll leave out the steps because all of it is boilerplate) and get, as expected, a front of 1 's advancing from the right. The program is now good enough to be committed (commit).

Integrating GLFW

I have GLFW installed, so I'll use the local version. A later TODO would be to fetch it if not already installed.

CMakeLists.txt:

cmake_minimum_required(VERSION 3.0.0)
project(waves_on_cuda VERSION 0.1.0 LANGUAGES CXX)

find_package(glfw3 3.4 REQUIRED)

add_subdirectory(src)

src/CMakeLists.txt:

add_executable(waves main.cpp)
target_link_libraries(waves glfw)

GLFW expects to be in full control of the program - you need to write your main loop as while(!glfwWindowShouldClose(window)) and call glfwPollEvents() regularly. However, I would find it cleaner to have GLFW not be in control of the program - the main loop should be the wave simulation, and then we pass the results to the rendering after each time step.

I think the cleanest solution is to run GLFW in a separate thread, the UI thread. We'll have a one-way channel of communication, a queue where the main thread posts events whenever a new simulation timestep becomes available, and a final event if the interrupt signal has been received. In the future, we might make the communication more complex to e.g. skip simulating time steps if the renderer is too slow. Having this setup has multiple advantages:

we decouple GLFW from the main program logic; if we want later to switch to another rendering (or have the option to choose rendering at runtime), this will be easy to refactor
we make no assumptions on whether rendering or simulation are faster; and we can make both as fast as possible (either the rendering will render multiple times the same simulation timestep, or the simulation will simulate timesteps that won't be able to be rendered).

Making GLFW work actually took a couple of hours as there were some bugs to fix. It turns out that GLFW keeps some thread-internal state, and you can't just initialize it in one thread and make it do something in another thread. Furthermore, you typically need an extension loader, such as Glad or GLEW (I used GLEW), which is also not quite straightforward to integrate. Finally, I wanted my assets to be compiled, instead of pasting OpenGL domain specific language as strings into the C++ program. Some more details on this in the following section.

Some links that are generally useful as introductions into OpenGL: Anton's OpenGL4 tutorials The GLFW quickstart guide

Compiled (SPIR-V) shaders

OpenGL uses so-called shaders for basically everything (perspective projection, transformation, coloring). They are essentially small functions that are loaded into the GPU and executed there. Shaders are written in a C-like language called GLSL. The easiest way of loading a shader looks like this:

const char* fragment_shader =
"#version 410 core\n"
"out vec4 frag_colour;"
"void main() {"
"  frag_colour = vec4( 0.5, 0.0, 0.5, 1.0 );"
"}";
GLuint vs = glCreateShader( GL_VERTEX_SHADER );
glShaderSource( vs, 1, &vertex_shader, NULL );
glCompileShader( vs );

I find this incredibly ugly for two reasons. First, you're pasting code as a string in a C program, so you miss all the benefits of e.g. GLSL syntax highlighting. This is rather easy to fix: you could load the string from a file. Pretty much all tutorials do that after they've taught you how to use strings.

Second, the GLSL code is compiled whenever you run the program. This means that if you have an error in it, you will only know at runtime of the C++ program. It would be much more convenient to compile it ahead-of-time.

It turns out that complete compilation is not possible, but it can at least be compiled ahead-of-time into an intermediary format called SPIR-V. More details and examples can be found here and here.

So I followed those examples and used SPIR-V shaders instead of GLSL ones. I therefore have another compilation step in the CMakefile. In Make adding a new type of target would be straightfoward - Make is agnostic to what language is used and how things are compiled - but in CMake, adding a non-C++ target needs the add_custom_command command:

function(compile_spirv in_file out_file)
    add_custom_command(
        OUTPUT ${out_file}
        COMMAND glslc ${in_file} -o ${out_file}
        DEPENDS ${in_file}
        VERBATIM # enables escaping; generally a good practice
    )
endfunction()

# see https://jeremimucha.com/2021/05/cmake-managing-resources/
compile_spirv(${CMAKE_CURRENT_SOURCE_DIR}/shader.frag fragment_shader.spv)
compile_spirv(${CMAKE_CURRENT_SOURCE_DIR}/shader.vert vertex_shader.spv)

add_custom_target(resources ALL DEPENDS vertex_shader.spv fragment_shader.spv)

By commit c7c3849, everything works and I have a GLFW window. So far the window displays precisely nothing, but it's already a good first step.

Displaying things with OpenGL

We'll follow Anton's triangle tutorials. As the name says, this is used to display a triangle; however, at the end of tutorial, he mentions that displaying a square can be done easily by replacing 3 by 4 in some places. I tried to display a square, but it didn't work, so I decided to display 2 triangles per cell instead.

When displaying the grid, the following question comes up: Are we using a finite volume method (i.e. each point in the Grid is a cell) or a finite difference or element method (i.e. each point is a node?) For now I will pretend they are cells. If we end up using a finite volume method, just pretend we're using the dual grid.

There are one fewer row and one fewer column of grid cells than there are nodes (we have 100x100 nodes and therefore a 99x99 grid of cells delimited by these nodes). Therefore, the list of triangles to display is an array

GLfloat triangles[grid_to_render->rows()-1][grid_to_render->cols()-1][2][3][3];

i.e. (rows - 1) x (columns - 1) cells, each cell having two triangles, each triangle consisting of 3 vertices, each vertex consisting of 3 coordinates. (At first I forgot the -1 and had some weird values displayed in the OpenGL window.)

We then pass the array to OpenGL and render them all using

glDrawArrays(GL_TRIANGLES, 0, sizeof(triangles) / sizeof(triangles[0][0][0][0])); //should be [0][0][0][0]

You will notice that sizeof(triangles[0][0][0][0]) represents one vertex (i.e. 3 coordinates), therefore the number passed to OpenGL is the number of vertices, not the number of triangles. At first I passed the number of triangles ( sizeof(triangles) / sizeof(triangles[0][0][0])) and was surprised why only 100x33.3 columns were rendered, instead of the 100x100 I was expecting.

We can add optimizations later - for example, right now each node is a vertex of 6 different triangles and is therefore passed 6 times to OpenGL. Maybe there's a way to optimize this (using an option called GL_TRIANGLE_FAN seems to be a possible solution), but I haven't found any easy way, so for now I stuck with the unoptimized version.

One improvement that is necessary, however, is perspective. Right now I don' t even see what vertices are at what height. Perspective is typically done using the vertex shader.

Tutorials such as https://learnopengl.com/Getting-Started/Coordinate-Systems usually do not have a fixed transformation in the shader; they pass this as a parameter to the shader. In shader language, this looks like this

#version 410 core
layout (location = 0) in vec3 aPos;

uniform mat4 view; // parameters
uniform mat4 projection;

void main()
{
   gl_Position = projection * view * vec4(aPos, 1.0);
}

It turns out that this is not possible when compiling for Vulkan - so I need to change it a bit, see here for more details and rationale. I also needed to bump up the #version to 420 because otherwise "binding" wasn't supported.

#version 420 core
layout (location = 0) in vec3 aPos;

layout(binding = 0) uniform Projection {
    mat4 view;
    mat4 projection;
} projection;

void main()
{
   gl_Position = projection.projection * projection.view * vec4(aPos, 1.0);
}

We need first to create these parameters (which are all represented as 4x4 rotation matrices) on the CPU, then upload them to the GPU. To create the matrices, the easiest way is to use glm, a header-only math library designed for working with OpenGL. The library and instructions on how to integrate it with CMake can be found here.

It turns out that the changes we made to the vertex shader affect also how we pass data to it. Typically, tutorials upload single parameters using a function called glUniformMatrix4fv; however we can't use it, because we have a parameter block, and therefore have to use another API called uniform buffer object (UBO). The UBO-related code can be found here.

Finally, adding a moveable camera (following the learnopengl.com tutorial) was faster and worked better than just trying to add a perspective on my own. Two notable changes between my code and the tutorial:

The tutorial is not object-oriented, while my code tries to do everything with the MyGLFW object. This is a problem for callbacks: GLFW callbacks must be functions, while I need them to be methods bound to the MyGLFW object to access (and update) its attributes. In C++, we can't just get a bound method and use it as a function pointer (contrarily to Python, where I could just use setCallback(self.callback) where self.callback is a bound method and can access all attributes of self). The solution was to use the function glfwSetWindowUserPointer (credits: https://stackoverflow.com/a/59633789) to associate the MyGLFW object with the window.
I found the scrolling with the mouse, as suggested in the tutorial, unnatural, so I inverted the directions:

  yaw -= xoffset;
  pitch -= yoffset;

With this (commit 6701233), I have a good enough visualization that I can recognize what's happening:

(remember, we have a field that is 0 everywhere at first, and then a wave of height 1 enters from the right and moves leftwards for 20 timesteps - so this is exactly the visualization we were expecting). Now that the rendering is finished, I will be able to actually do some simulation / scientific computing. That will be described in a the next part.