I recently released BubbleByte on Steam, my second commercial game built with my fork of SFML.
It’s an incremental/clicker/idle game where – eventually – the player will see thousands upon thousands of particles on screen simultaneously.
Even with a basic AoS (Array of Structures) layout, the game’s performance is great thanks to the draw batching system. However, I began wondering how much performance I might be leaving on the table by not adopting a SoA (Structure of Arrays) layout. Let’s figure that out in this article!
try the benchmark!
The benchmark simulates a large number of 2D particles that continuously change position, scale, opacity, and rotation. Through an ImGui-based UI1, you can choose the number of particles, toggle multithreading, and switch between AoS and SoA on the fly.
A demo is worth a thousand words, and since my fork of SFML supports Emscripten, you can try the benchmark directly in your browser. Play around with all the options – I’m curious to hear what results you get!
Note that the drawing step is not optimized at all – each particle is turned into a sf::Sprite
instance on the fly. This approach is only viable thanks to batching.
The source code for the benchmark is available here.
particle layout
In the AoS (Array of Structures) approach, each particle is encapsulated in a single structure:
struct ParticleAoS
{
sf::Vector2f position, velocity, acceleration;
float scale, scaleGrowth;
float opacity, opacityGrowth;
float rotation, torque;
};
std::vector<ParticleAoS> particlesAoS;
Every particle’s complete set of properties is stored contiguously. While this layout is intuitive, it can be less cache-friendly when processing specific properties across all particles.
In contrast, the SoA (Structure of Arrays) layout stores each property into its own contiguous array. Using a custom template (SoAFor
)2, the particle data is organized as follows:
using ParticleSoA = SoAFor<sf::Vector2f, // position
sf::Vector2f, // velocity
sf::Vector2f, // acceleration
float, // scale
float, // scaleGrowth
float, // opacity
float, // opacityGrowth
float, // rotation
float>; // torque
ParticleSoA particlesSoA;
This columnar layout ensures that, when updating a specific field (e.g., adding acceleration to velocity), the memory accesses are more sequential and cache-friendly. The performance benefits become particularly evident when processing millions of particles in tight loops.
particle update
Every frame, the system processes each particle – applying acceleration, updating velocity and position, modifying scale and opacity, and adjusting rotation. The update method varies significantly between three approaches: “AoS”, “SoA”, and “SoA Unified”.
In the AoS approach, the update loop simply iterates through a contiguous vector of ParticleAoS
objects, modifying each field:
for (ParticleAoS& p : particlesAoS)
{
p.velocity += p.acceleration;
p.position += p.velocity;
p.scale += p.scaleGrowth;
p.opacity += p.opacityGrowth;
p.rotation += p.torque;
}
While straightforward, this approach may suffer from scattered memory accesses since it loads all properties for each particle even if only a subset is being updated at a time.
With SoA, each property is stored in its own contiguous array. The system updates one field across all particles before moving on to the next:
particlesSoA.with<1, 2>(
[](sf::Vector2f& vel, const sf::Vector2f& acc) { vel += acc; });
particlesSoA.with<0, 1>(
[](sf::Vector2f& pos, sf::Vector2f& vel) { pos += vel; });
particlesSoA.with<3, 4>(
[](float& scale, const float growth) { scale += growth; });
particlesSoA.with<5, 6>(
[](float& opacity, const float growth) { opacity += growth; });
particlesSoA.with<7, 8>(
[](float& rotation, const float torque) { rotation += torque; });
This method minimizes cache misses and opens up opportunities for SIMD optimizations. However, it still requires multiple passes over the data.
The “SoA Unified” approach fuses all updates into a single loop:
particlesSoA.withAll(
[](sf::Vector2f& pos, sf::Vector2f& vel, const sf::Vector2f acc,
float& scale, const float scaleGrowth,
float& opacity, const float opacityGrowth,
float& rotation, const float torque)
{
vel += acc;
pos += vel;
scale += scaleGrowth;
opacity += opacityGrowth;
rotation += torque;
});
By reducing the iteration count, this approach minimizes loop overhead. However, accessing multiple attributes of a single particle in one pass may limit memory prefetching benefits and could inhibit SIMD optimizations.
repopulation
As particles fade (i.e., when opacity falls below a threshold), they are removed and new particles are spawned to maintain a constant count. The repopulation is handled by resizing the vectors every frame (as needed):
const auto populateParticlesAoS = [&](const std::size_t n)
{
if (n < particlesAoS.size())
{
particlesAoS.resize(n);
return;
}
particlesAoS.reserve(n);
for (std::size_t i = particlesAoS.size(); i < n; ++i)
pushParticle([&](auto&&... xs) { particlesAoS.emplace_back(xs...); });
};
// ...equivalent version for SoA...
multithreading
With millions of particles in play, a single-threaded update loop can become a bottleneck. To address this, the simulation leverages a thread pool3 to parallelize the update work. A helper lambda distributes particle processing across available CPU cores:
const auto doInBatches = [&](const std::size_t totalParticles, auto&& task)
{
const std::size_t particlesPerBatch = totalParticles / nWorkers;
std::latch latch(static_cast<std::ptrdiff_t>(nWorkers));
for (std::size_t i = 0; i < nWorkers; ++i)
{
pool.post([&, i]
{
const std::size_t batchStart = i * particlesPerBatch;
const std::size_t batchEnd =
(i == nWorkers - 1) ? totalParticles
: (i + 1) * particlesPerBatch;
task(i, batchStart, batchEnd);
latch.count_down();
});
}
latch.wait();
};
This lambda divides the particle array into non-overlapping chunks that are processed concurrently. When multithreading is enabled, the repopulation step becomes the bottleneck – I’m sure there’s a clever way to parallelize that step too (for example, by processing in chunks and compressing at the end), but that’s an exercise for the reader :)
benchmark results
(Hardware used: i9 13900k, RTX 4090.)
The results confirm that SoA consistently outperforms AoS, especially as the number of particles increases. The “Unified” SoA update method yields mixed results—sometimes providing further gains by reducing iteration overhead, though not always enough to be included in later benchmarks.
Incorporating a repopulation routine adds extra overhead because particles that reach zero opacity are removed and new ones are spawned. This extra work increases update times in both single-threaded and multi-threaded modes. Even so, when drawing is included – “Multi-Threaded + Repopulation + Draw” – the benefit of using SoA over AoS remains significant, despite the additional bottleneck from rendering calls.
Lower update time (ms) is better. Higher FPS is better.
shameless self-promotion
I offer training, mentoring, and consulting services. If you are interested, check out romeo.training, alternatively you can reach out at
mail (at) vittorioromeo (dot) com
or on Twitter.Check out my newly-released game on Steam: BubbleByte – it’s only $3.99 and one of those games that you can either play actively as a timewaster or more passively to keep you company in the background while you do something else.
My book “Embracing Modern C++ Safely” is available from all major resellers.
- For more information, read this interview: “Why 4 Bloomberg engineers wrote another C++ book”
If you enjoy fast-paced open-source arcade games with user-created content, check out Open Hexagon, my VRSFML-powered game available on Steam and on itch.io.
- Open Hexagon is a community-driven spiritual successor to Terry Cavanagh’s critically acclaimed Super Hexagon.
My fork of SFML supports ImGui out of the box!↩︎
The
SoAFor<Ts...>
utility is a wrapper over a collection ofstd::vector<Ts>...
that I wrote which provides a (somewhat) nice API to iterate over arbitrary subsets of “fields” by specifying their indices as template parameters. Code here.↩︎I made it myself. It’s pretty basic, but gets the job done. Code here.↩︎