mirror of
https://github.com/flutter/flutter.git
synced 2026-02-20 02:29:02 +08:00
Moving docs to be co-located with other docs + updating links. This has the benefit of not including docs in engine content hash semantics.
290 lines
11 KiB
Markdown
290 lines
11 KiB
Markdown
# Writing efficient shaders
|
|
|
|
When it comes to optimizing shaders for a wide range of devices, there is no
|
|
perfect strategy. The reality of different drivers written by different vendors
|
|
targeting different hardware is that they will vary in behavior. Any attempt at
|
|
optimizing against a specific driver will likely result in a performance loss
|
|
for some other drivers that end users will run Flutter apps against.
|
|
|
|
That being said, newer graphics devices have architectures that allow for both
|
|
simpler shader compilation and better handling of traditionally slow shader
|
|
code. In fact, ostensibly "unoptimized" shader code filled with branches may
|
|
significantly outperform the equivalent branchless optimized shader code when
|
|
targeting newer GPU architectures. (See the "Don't flatten simple varying
|
|
branches" recommendation for an explanation of this with respect to different
|
|
architectures).
|
|
|
|
Flutter actively supports mobile devices that are more than a decade old, which
|
|
requires us to write shaders that perform well across multiple generations of
|
|
GPU architectures featuring radically different behavior. Most optimization
|
|
choices are direct tradeoffs between these GPU architectures, and so having an
|
|
accurate mental model for how these common architectures maximize parallelism is
|
|
essential for making good decisions while authoring shaders.
|
|
|
|
For these reasons, it's also important to profile shaders against some of the
|
|
older devices that Flutter can target (such as the iPhone 6s) when making
|
|
changes intended to improve shader performance.
|
|
|
|
Also, even though the branching behavior is largely architecture dependent and
|
|
should remain the same when using different graphics APIs, it's still also a
|
|
good idea to test changes against the different backends supported by Impeller
|
|
(Metal and GLES). Early stage shader compilation (as well as the high level
|
|
shader code generated by ImpellerC) may vary quite a bit between APIs.
|
|
|
|
## GPU architecture primer
|
|
|
|
GPUs are designed to have functional units running single instructions over many
|
|
elements (the "data path") each clock cycle. This is the fundamental aspect of
|
|
GPUs that makes them work well for massively parallel compute work; they're
|
|
essentially specialized SIMD engines.
|
|
|
|
GPU parallelism generally comes in two broad architectural flavors:
|
|
**Instruction-level parallelism** and **Thread-level parallelism** -- these
|
|
architecture designs handle shader branching very differently and are covered
|
|
in the sections below. In general, older GPU architectures (on some products
|
|
released before ~2015) leverage instruction-level parallelism, while most if not
|
|
all newer GPUs leverage thread-level parallelism.
|
|
|
|
Some of the earliest GPU architectures had no runtime control flow primitives at
|
|
all (i.e. jump instructions), and compilers for these architectures needed to
|
|
handle branches ahead of time by unrolling loops, compiling a different program
|
|
for every possible branch combination, and then executing all of them. However,
|
|
virtually all GPU architectures in use today have instruction-level support for
|
|
dynamic branching, and it's quite unlikely that we'll come across a mobile
|
|
device capable of running Flutter that doesn't. For example, the old devices we
|
|
test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
|
|
runtime branching. For these reasons, the optimization advice in this document
|
|
isn't aimed at branchless architectures.
|
|
|
|
### Instruction-level parallelism
|
|
|
|
Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
|
|
SIMD vector or array instructions to maximize the number of computations
|
|
performed per clock cycle on each functional unit. This means that the shader
|
|
compiler must figure out which parts of the program are safe to parallelize
|
|
ahead of time and emit appropriate instructions. This presents a problem for
|
|
certain kinds of branches: If the compiler doesn't know that the same decision
|
|
will always be taken by all of the data lanes at runtime (meaning the branch is
|
|
_varying_), it can't safely emit SIMD instructions while compiling the branch.
|
|
The result is that instructions within non-uniform branches incur a
|
|
`1/[data width]` performance penalty when compared to non-branched instructions
|
|
because they can't be parallelized.
|
|
|
|
VLIW ("Very Long Instruction Width") is another common instruction-level
|
|
parallelism design that suffers from the same compile time reasoning
|
|
disadvantage that SIMD does.
|
|
|
|
### Thread-level parallelism
|
|
|
|
Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
|
|
Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
|
|
parallelize instructions at runtime by running the same instruction over many
|
|
threads in groups often referred to as "warps" (Nvidia terminology) or
|
|
"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
|
|
warp/wavefront. This design is also commonly referred to as SIMT ("Single
|
|
Instruction Multiple Thread").
|
|
|
|
To handle branching, SIMT programs use special instructions to write a thread
|
|
mask that determines which threads are activated/deactivated in the warp; only
|
|
the warp's activated threads will actually execute instructions. Given this
|
|
setup, the program can first deactivate threads that failed the branch
|
|
condition, run the positive path, invert the mask, run the negative path, and
|
|
finally restore the mask to its original state prior to the branch. The compiler
|
|
may also insert mask checks to skip over branches when all of the threads have
|
|
been deactivated.
|
|
|
|
Therefore, the best case scenario for a SIMT branch is that it only incurs the
|
|
cost of the conditional. The worst case scenario is that some of the warp's
|
|
threads fail the conditional and the rest succeed, requiring the program to
|
|
execute both paths of the branch back-to-back in the warp. Note that this is
|
|
very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
|
|
is able to retain significant parallelism in all cases, whereas SIMD cannot.
|
|
|
|
## Recommendations
|
|
|
|
### Don't flatten uniform or constant branches
|
|
|
|
Uniforms are pipeline variables accessible within a shader which are guaranteed
|
|
to not vary during a GPU program's invocation.
|
|
|
|
Example of a uniform branch in action:
|
|
|
|
```glsl
|
|
uniform struct FrameInfo {
|
|
mat4 mvp;
|
|
bool invert_y;
|
|
} frame_info;
|
|
|
|
in vec2 position;
|
|
|
|
void main() {
|
|
gl_Position = frame_info.mvp * vec4(position, 0, 1)
|
|
|
|
if (frame_info.invert_y) {
|
|
gl_Position *= vec4(1, -1, 1, 1);
|
|
}
|
|
}
|
|
```
|
|
|
|
While it's true that driver stacks have the opportunity to generate multiple
|
|
pipeline variants ahead of time to handle these branches, this advanced
|
|
functionality isn't actually necessary to achieve for good runtime performance
|
|
of uniform branches on widely used mobile architectures:
|
|
* On SIMT architectures, branching on a uniform means that every thread in every
|
|
warp will resolve to the same path, so only one path in the branch will ever
|
|
execute.
|
|
* On VLIW/SIMD architectures, the compiler can be certain that all of the
|
|
elements in the data path for every functional unit will resolve to the same
|
|
path, and so it can safely emit fully parallelized instructions for the
|
|
contents of the branch!
|
|
|
|
### Don't flatten simple varying branches
|
|
|
|
Widely used mobile GPU architectures generally don't benefit from flattening
|
|
simple varying branches. While it's true that compilers for VLIW/SIMD-based
|
|
architectures can't emit efficient instructions for these branches, the
|
|
detrimental effects of this are minimal with small branches. For modern SIMT
|
|
architectures, flattened branches can actually perform measurably worse than
|
|
straight forward branch solutions. Also, some shader compilers can collapse
|
|
small branches automatically.
|
|
|
|
Instead of this:
|
|
|
|
```glsl
|
|
vec3 ColorBurn(vec3 dst, vec3 src) {
|
|
vec3 color = 1 - min(vec3(1), (1 - dst) / src);
|
|
color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
|
|
color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
|
|
return color;
|
|
}
|
|
```
|
|
|
|
...just do this:
|
|
|
|
```glsl
|
|
vec3 ColorBurn(vec3 dst, vec3 src) {
|
|
vec3 color = 1 - min(vec3(1), (1 - dst) / src);
|
|
if (1 - dst.r < kEhCloseEnough) {
|
|
color.r = 1;
|
|
}
|
|
if (1 - dst.g < kEhCloseEnough) {
|
|
color.g = 1;
|
|
}
|
|
if (1 - dst.b < kEhCloseEnough) {
|
|
color.b = 1;
|
|
}
|
|
if (src.r < kEhCloseEnough) {
|
|
color.r = 0;
|
|
}
|
|
if (src.g < kEhCloseEnough) {
|
|
color.g = 0;
|
|
}
|
|
if (src.b < kEhCloseEnough) {
|
|
color.b = 0;
|
|
}
|
|
return color;
|
|
}
|
|
```
|
|
|
|
It's easier to understand, doesn't prevent compiler optimizations, runs
|
|
measurably faster on SIMT devices, and works out to be at most marginally slower
|
|
on older VLIW devices.
|
|
|
|
### Avoid complex varying branches
|
|
|
|
Consider the following fragment shader:
|
|
|
|
```glsl
|
|
in vec4 color;
|
|
out vec4 frag_color;
|
|
|
|
void main() {
|
|
vec4 result;
|
|
|
|
if (color.a == 0) {
|
|
result = vec4(0);
|
|
} else {
|
|
result = DoExtremelyExpensiveThing(color);
|
|
}
|
|
|
|
frag_color = result;
|
|
}
|
|
```
|
|
|
|
Note that `color` is _varying_. Specifically, it's an interpolated output from a
|
|
vertex shader -- so the value may change from fragment to fragment (as opposed
|
|
to a _uniform_ or _constant_, which will remain the same for the whole draw
|
|
call).
|
|
|
|
On SIMT architectures, this branch incurs very little overhead because
|
|
`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
|
|
the threads in a given warp.
|
|
However, architectures that use instruction-level parallelism (VLIW or SIMD)
|
|
can't handle this branch efficiently because the compiler can't safely emit
|
|
parallelized instructions on either side of the branch.
|
|
|
|
To achieve maximum parallelism across all of these architectures, one possible
|
|
solution is to unbranch the more complex path:
|
|
|
|
```glsl
|
|
in vec4 color;
|
|
out vec4 frag_color;
|
|
|
|
void main() {
|
|
frag_color = DoExtremelyExpensiveThing(color);
|
|
|
|
if (color.a == 0) {
|
|
frag_color = vec4(0);
|
|
}
|
|
}
|
|
```
|
|
|
|
However, this may be a big tradeoff depending on how this shader is used -- this
|
|
solution will perform worse on SIMT devices in cases where `color.a == 0` across
|
|
all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
|
|
skipped with this solution! So if the cheap branch path covers a large solid
|
|
portion of a draw call's coverage area, alternative designs may be favorable.
|
|
|
|
### Beware of return branching
|
|
|
|
Consider the following glsl function:
|
|
```glsl
|
|
vec4 FrobnicateColor(vec4 color) {
|
|
if (color.a == 0) {
|
|
return vec4(0);
|
|
}
|
|
|
|
return DoExtremelyExpensiveThing(color);
|
|
}
|
|
```
|
|
|
|
At first glance, this may appear cheap due to its simple contents, but this
|
|
branch has two exclusive paths in practice, and the generated shader assembly
|
|
will reflect the same behavior as this code:
|
|
|
|
```glsl
|
|
vec4 FrobnicateColor(vec4 color) {
|
|
vec4 result;
|
|
|
|
if (color.a == 0) {
|
|
result = vec4(0);
|
|
} else {
|
|
result = DoExtremelyExpensiveThing(color);
|
|
}
|
|
|
|
return result;
|
|
}
|
|
```
|
|
|
|
The same concerns and advice apply to this branch as the scenario under "Avoid
|
|
complex varying branches".
|
|
|
|
### Use lower precision whenever possible
|
|
|
|
Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
|
|
operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
|
|
according to the
|
|
[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
|
|
using lower precision floating point operations is more efficient on these
|
|
devices.
|