Frequently Asked Questions

How can I see the code generated by Devito?

After you build an op=Operator(...) implementing one or more equations, you can use print(op) to see the generated low level code. The example below builds an operator that takes a 1/2 cell forward shifted derivative of the Function f and puts the result in the Function g.

import numpy as np
import devito
from devito import Grid, Function, Eq, Operator
grid = Grid(shape=(21,), extent=(1.0,), origin=(0.0,), dtype=np.float32)
x = grid.dimensions[0]
f = Function(name='f', grid=grid, space_order=8)
g = Function(name='g', grid=grid, space_order=8)
eq_dfdx = Eq(g, getattr(f, 'dx')(x+0.5*x.spacing))
op = Operator(eq_dfdx)
print(op)

And the output:

#define _POSIX_C_SOURCE 200809L
#include "stdlib.h"
#include "math.h"
#include "sys/time.h"
#include "xmmintrin.h"
#include "pmmintrin.h"

struct dataobj
{
  void *restrict data;
  int * size;
  int * npsize;
  int * dsize;
  int * hsize;
  int * hofs;
  int * oofs;
} ;

struct profiler
{
  double section0;
} ;


int Kernel(struct dataobj *restrict f_vec, struct dataobj *restrict g_vec, const float h_x, struct profiler * timers, const int x_M, const int x_m)
{
  float (*restrict f) __attribute__ ((aligned (64))) = (float (*)) f_vec->data;
  float (*restrict g) __attribute__ ((aligned (64))) = (float (*)) g_vec->data;

  /* Flush denormal numbers to zero in hardware */
  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

  START(section0)
  for (int x = x_m; x <= x_M; x += 1)
  {
    g[x + 8] = (3.57142857e-3F*(f[x + 4] - f[x + 12]) + 3.80952381e-2F*(-f[x + 5] + f[x + 11]) + 2.0e-1F*(f[x + 6] - f[x + 10]) + 8.0e-1F*(-f[x + 7] + f[x + 9]))/h_x;
  }
  STOP(section0,timers)

  return 0;
}

How can I see the compilation command with which Devito compiles the generated code

Set the environment variable DEVITO_LOGGING=DEBUG. When an Operator gets compiled, the used compilation command will be emitted to stdout.

If nothing seems to change, it is possible that no compilation is happening under-the-hood as all kernels have already been compiled in a previous run. You will then have to clear up the Devito kernel cache. From the Devito root directory, run:

python scripts/clear_devito_cache.py

Where does the generated code go and how do I look at it

Devito stores the generated code as well as the jit-compiled libraries in a temporary directory. By setting the environment variable DEVITO_LOGGING=DEBUG, Devito will show, amongst other things, the absolute path to the generated code.

Can I change the directory where Devito stashes the generated code

Yes, just set the environment variable TMPDIR to your favorite location.

I create an Operator, look at the generated code, and the equations appear in a different order than I expected.

The Devito compiler computes a topological ordering of the input equations based on data dependency analysis. Heuristically, some equations might be moved around to improve performance (e.g., data locality). Therefore, the order of the equations in the generated code might be different than that used as input to the Operator.

Do Devito Operators release the GIL when executing C code?

Yes. Devito uses ctypes.CDLL to call the JIT C code which releases the GIL.

What performance optimizations does Devito apply

Take a look here and in particular at this notebook.

Does Devito optimize complex expressions

Devito applies several performance optimizations to improve the number of operations (“operation count”) in complex expressions. These optimizations are designed to do a really good job but also be reasonably fast. One such pass attempts to factorize as many common terms as possible in expressions in order to reduce the operation count. We will construct a demonstrative example below that has a common term that is not factored out by the Devito optimization. The difference in floating-point operations per output point for the factoring of that term is about 10 percent, and the generated C is different, but numerical outputs of running the two different operators are indistinguishable to machine precision. In terms of actual performance, the (few) missed factorization opportunities may not necessarily be a relevant issue: as long as the code is not heavily compute-bound, the runtimes may only be slightly higher than in the optimally-factorized version.

Operator 1:

ux_update = t.spacing**2 * b * \
    ((c33 * u_x.dx(x0=x+x.spacing/2)).dx(x0=x-x.spacing/2) +
     (c55 * u_x.dz(x0=z+z.spacing/2)).dz(x0=z-z.spacing/2) +
     (c13 * u_z.dz(x0=z+z.spacing/2)).dx(x0=x-x.spacing/2) +
     (c55 * u_z.dx(x0=x+x.spacing/2)).dz(x0=z-z.spacing/2)) + \
    (2 - t.spacing * wOverQ) * u_x + (t.spacing * wOverQ - 1) * u_x.backward 
stencil_x = Eq(u_x.forward, ux_update)
print("\n", stencil_x)
op = Operator([stencil_x])

Operator 2:

ux_update = \
    t.spacing**2 * b * (c33 * u_x.dx(x0=x+x.spacing/2)).dx(x0=x-x.spacing/2) + \
    t.spacing**2 * b * (c55 * u_x.dz(x0=z+z.spacing/2)).dz(x0=z-z.spacing/2) + \
    t.spacing**2 * b * (c13 * u_z.dz(x0=z+z.spacing/2)).dx(x0=x-x.spacing/2) + \
    t.spacing**2 * b * (c55 * u_z.dx(x0=x+x.spacing/2)).dz(x0=z-z.spacing/2) + \
    (2 - t.spacing * wOverQ) * u_x + (t.spacing * wOverQ - 1) * u_x.backward 
stencil_x = Eq(u_x.forward, ux_update)
print("\n", stencil_x)
op = Operator([stencil_x])

Output 1:

Eq(u_x(t + dt, x, z), dt**2*(Derivative(c13(x, z)*Derivative(u_z(t, x, z), z), x) + Derivative(c33(x, z)*Derivative(u_x(t, x, z), x), x) + Derivative(c55(x, z)*Derivative(u_x(t, x, z), z), z) + Derivative(c55(x, z)*Derivative(u_z(t, x, z), x), z))*b(x, z) + (-dt*wOverQ(x, z) + 2)*u_x(t, x, z) + (dt*wOverQ(x, z) - 1)*u_x(t - dt, x, z))
Operator `Kernel` generated in 1.26 s
  * lowering.Expressions: 0.61 s (48.7 %)
  * lowering.Clusters: 0.52 s (41.5 %)
     * specializing.Clusters: 0.31 s (24.8 %)
Flops reduction after symbolic optimization: [1160 --> 136]

Output 2:

Eq(u_x(t + dt, x, z), dt**2*b(x, z)*Derivative(c13(x, z)*Derivative(u_z(t, x, z), z), x) + dt**2*b(x, z)*Derivative(c33(x, z)*Derivative(u_x(t, x, z), x), x) + dt**2*b(x, z)*Derivative(c55(x, z)*Derivative(u_x(t, x, z), z), z) + dt**2*b(x, z)*Derivative(c55(x, z)*Derivative(u_z(t, x, z), x), z) + (-dt*wOverQ(x, z) + 2)*u_x(t, x, z) + (dt*wOverQ(x, z) - 1)*u_x(t - dt, x, z))
Operator `Kernel` generated in 1.12 s
  * lowering.Expressions: 0.59 s (53.0 %)
  * lowering.Clusters: 0.40 s (35.9 %)
     * specializing.Clusters: 0.31 s (27.8 %)
Flops reduction after symbolic optimization: [1169 --> 149]

Other optimizations include common sub-expressions elimination, hoisting of loop-invariant code, and detection of cross-iteration redundancies (e.g., due to high order derivatives). Below we show the cumulative impact of all these optimizations in a number of seismic operators.

Flops reduction

For more info, take a look at this notebook.

How are abstractions used in the seismic examples

Many Devito examples are provided that demonstrate application for specific problems, including e.g. fluid mechanics and seismic modeling. We focus in this question on seismic modeling examples that provide convenience wrappers to build differential equations and create Devito Operators for various types of modeling physics including isotropic and anisotropic, acoustic and elastic.

These examples (link) use abstractions to remove details from the methods that actually build the operators. The idea is that at the time you build a Devito operator, you don’t need specific material parameter arrays (e.g. velocity or density or Thomsen parameter), and you don’t need specific locations of sources and receiver instruments. All you need to build the operator is a placeholder that can provide the dimensionality and (for example) the spatial order of finite difference approximation you wish to employ. In this way you can build and return functioning highly optimized operators to which you can provide the specific implementation details at runtime via command line arguments.

An example of this abstraction (or placeholder design pattern) in operation is the call to the isotropic acoustic AcousticWaveSolver.forward method that returns a Devito operator via the ForwardOperator method defined in operators.py.

You will note that this method uses placeholders for the material parameter arrays and the source and receiver locations, and then at runtime uses arguments provided to the returned Operator to provide state to the placeholders. You can see this happen on lines 112-113 in wavesolver.py.

What environment variables control how Devito works

How to get the list of Devito environment variables

You can get the list of environment variables and all their possible values with the following python code:

from devito import print_defaults
print_defaults()

How to set Devito environment variables

These environment variables can either be set from the shell or programmatically. Note that when setting these variables programmatically, you need to use lower case, and the leading DEVITO is omitted. Values are case sensitive, meaning openmp is accepted and OPENMP will throw an error. Below are examples of setting these variables in the shell (before running python) and from python (before executing devito code).

Method Example
bourne shell DEVITO_LANGUAGE=“openmp”
c shell setenv DEVITO_LANGUAGE “openmp”
programmatically configuration[‘language’] = ‘openmp’

Description of Devito environment variables

DEVITO_ARCH

Used to select a specific “backend compiler”. The backend compiler takes as input the code generated by Devito and produces a shared object. Supported backend compilers are gcc, icc, pgcc, clang. For each of these compilers, Devito uses some preset compilation flags (e.g., -O3, -march=native, -fast-math). If this environment variable is left unset, Devito will attempt auto-detection of the most suitable backend compiler available on the underlying system.

DEVITO_PLATFORM

This environment variable is mostly needed when running on GPUs, to ask Devito to generate code for a particular device (see for example this tutorial). Can be also used to specify CPU architectures such as Intel’s – Haswell, Broadwell, SKL and KNL – ARM, AMD, and Power. Often one can ignore this variable because Devito typically does a decent job at auto-detecting the underlying platform.

DEVITO_LANGUAGE

Specify the generated code language. The default is C, which means sequential C. Use openmp to emit C+OpenMP or openacc for C+OpenACC.

DEVITO_PROFILING

Choose the performance profiling level. This is also automatically increased with DEVITO_LOGGING=PERF or DEVITO_LOGGING=DEBUG, in which case this environment variable can be ignored.

DEVITO_DEVELOP

Mostly useful for developers when chasing segfaults.

DEVITO_OPT

Choose the performance optimization level. By default set to the maximum level, advanced.

DEVITO_MPI

Controls MPI in Devito. Use 1 to enable MPI. The most powerful MPI mode is called “full”, and is activated setting DEVITO_MPI=full. The “full” mode implements a number of optimizations including computation/communication overlap.

DEVITO_AUTOTUNING

Search across a set of block shapes to maximize the effectiveness of loop tiling (aka cache blocking). You can choose between off (default), basic, aggressive, max. A more aggressive autotuning should eventually result in better runtime performance, though the search phase will take longer.

DEVITO_LOGGING

Run with DEVITO_LOGGING=DEBUG to find out the specific performance optimizations applied by an Operator, how auto-tuning is getting along, to emit the command used to compile the generated code, to emit more performance metrics, and much more.

DEVITO_FIRST_TOUCH

Use DEVITO_FIRST_TOUCH=1 in combination with DEVITO_LANGUAGE=openmp to use an OpenMP parallel Operator for initialization of Function data. This should ensure NUMA locality for data access when running “flatten” OpenMP across multiple sockets on the same node (as opposed to using MPI – one MPI process per socket – plus OpenMP, which is the recommended way).

DEVITO_JIT_BACKDOOR

You can set DEVITO_JIT_BACKDOOR=1 to test custom modifications to the generated code. For more info, take a look at this FAQ.

DEVITO_IGNORE_UNKNOWN_PARAMS

Set DEVITO_IGNORE_UNKNOWN_PARAMS=1 to avoid Devito raising an exception if one attempts to pass an unknown argument to op.apply().

What are the accepted combinations of PLATFORM, ARCH, and LANGUAGE

LANGUAGE={C,openmp}

These two languages can be used with virtually any PLATFORM and ARCH.

With a device PLATFORM (e.g., nvidiaX, amdgpuX, or intelgpuX), the compiler will generate OpenMP code for device offloading.

When using OpenMP offloading, it is recommended to stick to the corresponding vendor compiler, so ARCH=amdclang for AMD, ARCH={icc,icx,intel} for Intel, and ARCH=nvc for NVidia.

LANGUAGE=openacc

Requires: PLATFORM=nvidiaX and ARCH=nvc.

The legacy PGI compiler is also supported via ARCH=pgcc.

LANGUAGE=cuda

DevitoPRO only.

Requires: PLATFORM=nvidiaX and ARCH=cuda.

LANGUAGE=hip

DevitoPRO only.

Requires: PLATFORM=amdgpuX and ARCH=hip.

LANGUAGE=sycl

DevitoPRO only.

Requires: PLATFORM=intelgpuX or PLATFORM=intel64, and ARCH=sycl.

How do you run the unit tests from the command line

In addition to the tutorials, the unit tests provide an excellent way to see how the Devito API works with small self-contained examples. You can exercise individual unit tests with the following python code:

pytest <test.py>
pytest -vs <test.py>  [more detailed log]

What is the difference between f() and f[] notation

Devito offers a functional language to express finite difference operators. This is introduced here and systematically used throughout our examples and tutorials. The language relies on what in jargon we call the “f() notation”.

>>> from devito import Grid, Function
>>> grid = Grid(shape=(5, 6))
>>> f = Function(name='f', grid=grid, space_order=2)
>>> f.dx
Derivative(f(x, y), x)
>>> f.dx.evaluate
-f(x, y)/h_x + f(x + h_x, y)/h_x

Sometimes, one wishes to escape the constraints of the language. Instead of taking derivatives, other special operations are required. Or perhaps, a specific grid point needs to be accessed. In such a case, one could use the “f[] notation” or “indexed notation”. Following on from the example above:

>>> x, y = grid.dimensions
>>> f[x + 1000, y]
f[x + 1000, y]

The indexed object can be used at will to construct Eqs, and they can be mixed up with objects stemming from the “f() notation”.

>>> f.dx + f[x + 1000, y]
Derivative(f(x, y), x) + f[x + 1000, y]

However, while the “f() notation” is substantially safe – the language is designed such that only legal stencil expressions are built – the “f[] notation” is not, and one can easily end up creating operators performing out-of-bounds accesses. So use it judiciously!

What is the indexed notation

The indexed notation, or “f[] notation”, is discussed here

Is there a flow chart

needs links 1. Equation Lowering - Indexification - Substitutions - Domain alignment - Eq -> LoweredEq 2. Local Analysis 3. Clustering 4. Symbolic Optimization via Devito Symbolic Engine (DSE) 5. IET (Iteration/Expression Tree) Construction 6. IET Analysis 7. IET Optimization (DLE/YLE) - Loop optimization via Devito Loop Engine (DLE) - Loop Tiling / Cache Blocking - SIMD - OpenMP - MPI 8. Synthetic 9. JIT (Just In Time) Compilation

What’s up with object.data

The .data property which is associated with objects such as Constant, Function and SparseFunction (along with their derivatives) represents the ‘numerical’ value of the ‘data’ associated with that particular object. For example, a Constant will have a single numerical value associated with it as shown in the following snippet

from devito import Constant

c = Constant(name='c')
c.data = 2.7

print(c.data)
2.7

Then, a Function defined on a Grid will have a data value associated with each of the grid points (as shown in the snippet below) and so forth.

import numpy as np
from devito import Grid, Function

grid = Grid(shape=(4, 4), extent=(1, 1))
f = Function(name='f', grid=grid)
f.data[:] = np.arange(16).reshape(grid.shape)

print(f.data)
[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]
 [12. 13. 14. 15.]]

How do I create and N-dimensional grid

Grids are often created via, e.g.,

grid = Grid(shape=(5, 5))

where printing the grid object then returns:

Grid[extent=(1.0, 1.0), shape=(5, 5), dimensions=(x, y)]

Here we see the grid has been created with the ‘default’ dimensions x and y. If a grid is created and passed a shape of (5, 5, 5) we’ll see that in addition it has a z dimension. However, what if we want to create a grid with, say, a shape of (5, 5, 5, 5)? For this case, we’ve now run out of the dimensions defined by default and hence need to create our own dimensions to achieve this. This can be done via, e.g.,

a = SpaceDimension('a')
b = SpaceDimension('b')
c = SpaceDimension('c')
d = SpaceDimension('d')
grid = Grid(shape=(5, 5, 5, 5), dimensions=(a, b, c, d))

where now, printng grid we get

Grid[extent=(1.0, 1.0, 1.0, 1.0), shape=(5, 5, 5, 5), dimensions=(a, b, c, d)]

and grid.shape returns

(5, 5, 5, 5)

What are the keys to fast code

The code generated by devito is designed to run fast on CPU, GPU and clusters thereof. Broadly outlined, some of the mechanics for generating fast code are: ### CPU * Loop tiling (or “cache blocking”) * Loop fusion (to maximize data locality) Loop fission (to maximize parallelism) * Flop-reducing transformations (CSE, cross-stencil redundancies, factorization, hoisting) * OpenMP threading * OpenMP-based SIMD * Alignment to promote SIMD vectorization ### GPU * Longer pipelines, less travel to host (do more work on the GPU before communicating data between host and GPU) ### Clusters of CPUs/GPUs * Computation/communication overlap with prodding of the asynchronous progress engine * Avoidance of unnecessary halo exchanges * Reshuffling of halo exchanges * Threaded data packing/unpacking

As time increases in the finite difference evolution, are wavefield arrays “swapped” as you might see in c/c++ code

In c/c++ code using two wavefield arrays for second order acoustics, you might see code like the following to “swap” the wavefield arrays at each time step:

    float *p_tmp = p_old;
    p_old = p_cur;
    p_cur = p_tmp;

Instead of swapping arrays, devito uses the modulus of a time index to map increasing temporal indices [0, 1, 2, 3, 4, 5, …] into cyclic indices [0, 1, 2, 0, 1, 2, …].

What units are typically used in Devito examples

  • Sampling rates: msec
  • Frequency: KHz
  • Velocity: km/sec

Can I subclass basic types such as TimeFunction

Yes, just like we did for our seismic examples, for example, the PointSource class. A few caveats are necessary, though.

First, classes such as Function or SparseTimeFunction are inherently complex. In fact, SparseTimeFunction itself is a subclass of Function. The whole class hierarchy is modular and well-structured, but at the same time, it’s depth and offers several hooks to allow specialization by subclasses – for example, all methods starting with __ such as __init_finalize__ or __shape_setup__. It will take some time to digest it. Luckily, you will only need to learn a little of this, at least for simple subclasses.

Second, you must know that these objects are subjected to so-called reconstruction during compilation. Objects are immutable inside Devito; therefore, even a straightforward symbolic transformation such as f[x] -> f[y] boils down to performing a reconstruction, that is, creating a whole new object. Since f carries around several attributes (e.g., shape, grid, dimensions), each time Devito performs a reconstruction, we only want to specify which attributes are changing – not all of them, as it would make the code ugly and incredibly complicated. The solution to this problem is that all the base symbolic types inherit from a common base class called Reconstructable; a Reconstructable object has two special class attributes, called __rargs__ and __rkwargs__. If a subclass adds a new positional or keyword argument to its __init_finalize__, it must also be added to __rargs__ or __rkwargs__, respectively. This will provide Devito with enough information to perform a reconstruction when it’s needed during compilation. The following example should clarify:

class Foo(Reconstructable):
    __rargs__ = ('a', 'b')
    __rkwargs__ = ('c',)

    def __init__(self, a, b, c=4):
           self.a = a
           self.b = b
           self.c = c

    def __repr__(self):
        return "x(%d, %d, %d)" % (self.a, self.b, self.c)

class Bar(Foo):
    __rkwargs__ = Foo.__rkwargs__ + ('d',)

    def __init__(self, *args, d=6, **kwargs)
            super().__init__(*args, **kwargs)
            self.d = d

You are unlikely to care about how reconstruction works in practice, but here are a few examples for a = Foo(3, 5) to give you more context.

a._rebuild() -> "x(3, 5, 4)" (i.e., copy of `a`).
a._rebuild(4) -> "x(4, 5, 4)"
a._rebuild(4, 7) -> "x(4, 7, 4)"
a._rebuild(c=5) -> "x(3, 5, 5)"
a._rebuild(1, c=7) -> "x(1, 5, 7)"

How can I change the compilation flags (for example, I want to change the optimization level from -O3 to -O0)

There is currently no API to achieve this straightforwardly. However, there are three workarounds:

  • hacky way: change the flags explicitly in the Devito source code. In Devito v4.0, you can do that here
  • via env vars: use a CustomCompiler – just leave the DEVITO_ARCH environment variable unset or set it to 'custom'. Then, export CFLAGS="..." to tell Devito to use the exported flags in place of the default ones.
  • programmatically: subclass one of the compiler classes and set self.cflags to whatever you need. Do not forget to add the subclass to the compiler registry. For example, you could do
from devito import configuration, compiler_registry
from devito.compiler import GNUCompiler

class MyOwnCompiler(GNUCompiler):
    def __init__(self, *args, **kwargs):
        super(MyOwnCompiler, self).__init__(*args, **kwargs)
        # <manipulate self.cflags here >


### Make sure Devito is aware of this new Compiler class
compiler_registry['mycompiler'] = MyOwnCompiler
configuration.add("compiler", "custom", list(compiler_registry), callback=lambda i: compiler_registry[i]())


### Then, what remains to be done is asking Devito to use MyOwnCompiler

configuration['compiler'] = 'mycompiler'

Is the jitted code IEEE compliant?

By default, Devito compiles the generated code using flags that maximize the runtime performance. Some of these flags, such as GCC’s -ffast-math, break IEEE compliance by, for example, reordering operations based on associativity. This may alter the numerical output, including even making some codes produce incorrect results. If you think your code is affected by -ffast-math or similar flags that break IEEE compliance, you can disable them by setting the environment variable DEVITO_SAFE_MATH=1. You can achieve the same effect in Python by setting configuration['safe-math'] = 1.

<This FAQ was co-authored by @FabioLuporini and @ofmla>

Can I control the MPI domain decomposition?

By default, domain decomposition starts along the slowest axis. This means that for an n-dimensional grid, the outermost dimension is decomposed first. For example, in a three-dimensional grid (x, y, z), the x dimension is split into chunks first, followed by the y dimension, and finally the z dimension. Then the process restarts from the outermost dimension again, ensuring an n-dimensional decomposition, until as many partitions as MPI ranks are created.

Customization

The Grid class accepts an optional topology argument, allowing users to specify custom domain decompositions as an n-tuple, where n is the number of distributed dimensions. For instance, for a two-dimensional grid, the topology (4, 1) will decompose the slowest axis into four partitions (one per MPI rank), while the fastest axis will be replicated across all MPI ranks. So, for example, given a Grid(shape=(16, 16), topology=(4, 1)), the x dimension would be split into 4 chunks of 4, resulting in partitions of shape (4, 16) for each MPI rank.

Wildcard-based Decomposition

Consider a domain with three distributed dimensions: x, y, and z, and an MPI communicator with N processes. Here are some examples of specifying a custom topology:

  • With N known (e.g., N=4):

    • (1, 1, 4): Decomposes the z dimension into 4 chunks.
    • (2, 1, 2): Decomposes the x dimension into 2 chunks and the z dimension into 2 chunks.
  • With N unknown:

    • (1, '*', 1): The wildcard '*' indicates that the runtime should decompose the y dimension into N chunks.
    • ('*', '*', 1): The runtime decomposes both x and y dimensions into factors of N, prioritizing the outermost dimension.

    Assuming the number of ranks N cannot be evenly decomposed into the requested stars, decomposition is as even as possible, prioritizing the outermost dimension:

    • For N=3:
      • ('*', '*', 1) results in (3, 1, 1).
      • ('*', 1, '*') results in (3, 1, 1).
      • (1, '*', '*') results in (1, 3, 1).
    • For N=6:
      • ('*', '*', 1) results in (3, 2, 1).
      • ('*', 1, '*') results in (3, 1, 2).
      • (1, '*', '*') results in (1, 3, 2).
    • For N=8:
      • ('*', '*', '*') results in (2, 2, 2).
      • ('*', '*', 1) results in (4, 2, 1).
      • ('*', 1, '*') results in (4, 1, 2).
      • (1, '*', '*') results in (1, 4, 2).

The DEVITO_TOPOLOGY Environment Variable

As of Devito v4.8.11, the domain decomposition topology can also be specified globally using the environment variable DEVITO_TOPOLOGY. Accepted values are:

  • x: Corresponds to the topology ('*', 1, 1), decomposing the x dimension.
  • y: Corresponds to the topology (1, '*', 1), decomposing the y dimension.
  • z: Corresponds to the topology (1, 1, '*'), decomposing the z dimension.
  • xy: Corresponds to the topology ('*', '*', 1), decomposing both x and y dimensions.

How should I use MPI on multi-socket machines

In general you should use one MPI rank per NUMA node on a multi-socket machine. You can find the number of numa nodes with the lscpu command. For example, here is the relevant part of the output from the lscpu command on an AMD 7502 2 socket machine with 2 NUMA nodes:

Architecture:          x86_64
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    1
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          2
Vendor ID:             AuthenticAMD
Model name:            AMD EPYC 7502 32-Core Processor
NUMA node0 CPU(s):     0-31
NUMA node1 CPU(s):     32-63

How do I make sure my code is “MPI safe”

There are a few things you may want to check

  • To refer to the actual (“global”) shape of the domain, you should always use grid.shape (or analogously through a Function, f.grid.shape). And unless you know well what you’re doing, you should never use the function shape, namely f.shape or f.data.shape, as that will return the “local” domain shape, that is the data shape after domain decomposition, which might differ across the various MPI ranks.

Why does my Operator kernel die suddenly

This is likely due to an out-of-bounds (OOB) array access while running the generated code. This shouldn’t really ever happen! The compiler and runtime should catch any DSL misuses, ultimately leading to such OOB accesses, well before jumping to C-land via op.apply(...) to run the generated code. However, there are currently two open issues when this might unfortunately happen:

  • When misusing SubDimensions or SubDomains. See this issue
  • When misusing ConditionalDimensions. See this issue

Can I manually modify the C code generated by Devito and test these modifications

Yes, as of Devito v3.5 it is possible to modify the generated C code and run it inside Devito. First you need to get the C file generated for a given Operator. Run your code in DEBUG mode:

DEVITO_LOGGING=DEBUG python your_code.py

The generated code path will be shown as in the excerpt below:

CustomCompiler: compiled `/tmp/devito-jitcache-uid1000/ed41e9373af1bc129471b7ae45e1c3740b60a856.c` [0.29 s]

You can now open the C file, do the modifications you like, and save them. Finally, rerun the same program but this time with the Devito JIT backdoor enabled:

DEVITO_JIT_BACKDOOR=1 python your_code.py

This will force Devito to recompile and link the modified C code.

If you have a large codebase with many Operators, here’s a trick to speed up your hacking with the JIT backdoor.

How do I find the source of my bug quickly and get support

Assuming this is actually an issue in Devito, we distinguish between three types of bugs:

  1. a Python-level bug, e.g. some exceptions raised during compilation
  2. a C-level bug causing abrupt termination, e.g. a segmentation fault (ouch)
  3. a C-level bug, e.g. the numerical output “looks wrong”

Typically such a bug occurs in a moderately big code, so how should we proceed?

If you are in the cases 1 or 2 above, the first thing to do, regardless of who fixes it (either you directly if feeling brave, or most likely someone from the Devito team), is to create an MFE – a Minimal Failing Example. An interesting read about MFEs in general is available here. The typical workflow in Devito is as follows:

  • If the failure is from a Jupyter Notebook, first of all, convert it into a Python file. It’s trivial, you can do it directly from within the Jupyter Notebook itself, under the “File” tab there’s an option to convert the notebook into a Python file.
  • Then it’s time to trim down the code. The idea is to incrementally remove parts of the code until the bug disappears, i.e., Devito compilation and execution reach completion without failures. In doing so, you don’t care about things such as breaking the physics, making the method unstable, and so on. So …
  • If your Operator has many Eqs, start with removing some of them. Is the bug still there? Great, keeping on removing the unnecessary. Some Functions may now become unnecessary because they were only appearing in the removed Eqs. Then, drop them too from the Python file.
  • Here’s another trick. You have many SubDomains – try with removing some of them, both their definition and where they’re used.
  • The MFE must have a super quick turnaround time. Avoid solve at all costs. Run for only a few timesteps, possibly just one; this will also help with avoiding NaN or Inf since your MFE is most likely unstable.
  • If you have just one Eq in your Operator, but it’s a big one, perhaps with lots of operations over many Functions, then you should trim it down. Try to mitigate the arithmetic complexity, for example by dropping derivatives or entire sub-expressions.
  • There are other tricks for the creation of an MFE. If your code by default uses MPI, or OpenMP, or OpenACC, or combinations thereof, but the bug appears even when running sequentially, then explicitly disable parallelism. Also try with disabling the performance optimizations applied by an Operator – Operator(..., opt='noop').
  • There are a few other tricks and things you may wanna play with to create an MFE…

At the end of the day, an MFE should really be small, ideally not more than 10 lines of code, and definitely not more than 20~30. If you managed to create an actual MFE, then we’re in business. File an issue on GitHub or talk to us on Slack, attach the MFE, and someone will quickly look into it.

Now, stepping back: you may unfortunately be in the case 3 above – the numerical output “looks wrong”, despite the code really looks sane. This is in our experience an extremely rare circumstance. You may try to look at the generated C, to see if you spot anything odd; or you may just file an issue on GitHub with instructions to reproduce the problem, or talk to us directly or Slack.

Is there a way to get the performance of an Operator

Yes, any logging level below or equal to PERF will do the trick. For example, if you run:

DEVITO_LOGGING=PERF python your_code.py

you will see that Devito emits lots of useful information concerning the performance of an Operator. The following is reported:

  • the code generation, compilation, and execution times;
  • for each section in the generated code, its execution time, operational intensity (OI), GFlops/s and GPts/s performance;
  • global GFlops/s and GPts/s performance of the Operator (i.e., cumulative across all sections);
  • in the case of an MPI run, per-rank GFlops/s and GPts/s performance.

About the GPts/s metric, that is number of gigapoints per seconds. The “points” we refer to here are the actual grid points – so if the grid is an N**3 cube, the number of timesteps is T, and the Operator runs for S secs, then we have N**3*T/S GPts/s. This is the typical metric used for finite difference codes.

An excerpt of the performance profile emitted by Devito upon running an Operator is provided below. In this case, the Operator has two sections, section0 and section1, and section1 consists of two consecutive 6D iteration spaces whose size is given between angle brackets.

Global performance: [OI=0.16, 8.00 GFlops/s, 0.04 GPts/s]
Local performance:
  * section0<136,136,136> run in 0.10 s [OI=0.16, 0.14 GFlops/s]
  * section1<<341,16,16,14,14,136>,<341,16,16,8,8,130>> run in 35.91 s [OI=5.36, 8.01 GFlops/s, 0.05 GPts/s]

Note that section0 doesn’t show the GPts/s. This is because no TimeFunction is written in this section.

How does Devito compute the performance of an Operator

The section execution time is computed directly in the generated code through cheap timers. The cumulative Operator execution time is computed through Python-level timers and includes the overhead inherent in the processing of the arguments supplied to op.apply(...).

The floating-point operations are counted once all of the symbolic flop-reducing transformations have been performed during compilation. Devito uses an in-house estimate of cost, rather than SymPy’s estimate, to take care of some low-level intricacies. For example, Devito’s estimate ignores the cost of integer arithmetic used for offset indexing into multi-dimensional arrays. Examples of how the Devito estimator works are available here.

To calculate the GFlops/s performance, Devito multiplies the floating-point operations calculated at compile time by the size of the iteration space, and it does that at the granularity of individual expressions. For example, consider the following snippet:

<section0 start>
for x = x_m to x_M
  for y = y_m to y_M
    u[x, y] = 2*v[x, y] + 1
    for z = z_m to z_M
       w[x, y, z] += 4*h[x, y, z] + 5 - f[x, y, z]
<section 0 end>

At compile time, Devito produces the following estimate: 2*(x_M-x_m+1)*(y_M-y_m+1) + 4*(x_M-x_m+1)*(y_M-y_m+1)*(z_M-z_m+1) = 2*(x_M-x_m+1)*(y_M-y_m+1)*(1 + 2*(z_M-z_m+1)). Upon op.apply(...), the values of x_M, x_m, ... become known and are instantied. This gives the total number of operations performed in section0, which is eventually divided by the section execution time to give the GFlops/s performance.

The produced GFlops/s has been checked against that reported by Intel Advisor in a number of problems and the results were extremely close, which gives confidence about the soundness of the Devito estimate. Clearly, when it gets to articles or presentations, we encourage the use of profilers such as Intel Advisor to back these numbers up. The metrics emitted by Devito are only intended to give an initial yet fairly realistic indication of the Operator performance.

Compiler ninjas may wonder about the counting of loop-invariant sub-expressions, which might produce an over-estimate of the actual performance. Thanks to aggressive code motion, the amount of innermost-loop-invariant sub-expressions in a Devito Operator is typically negligible, so the Devito estimate doesn’t basically suffer from this issue, or at least not in a tangible way to the best of our knowledge.

How do I fairly compare the performance to that of my in-house code

It’s challenging! Here’s a potentially non-exhaustive list of things to check:

  • Check the physics — e.g., there are many TTI variations out there, are the two codes using the same one? You wanna ensure the PDEs are the same.
  • Check the discretization — do the two codes use the same spatial order? most importantly, is it the same spatial order in all problem dimensions? carefully optimized codes often use lower order stencils in some dimensions. This significantly impacts performance
  • Did you try tuning the performance of the Devito Operator? E.g., on GPUs it is worth giving the par-tile option a go.
  • Snapshotting: compression, serialization, asynchronous streaming if running on device… Many legacy codes have these features enabled out-of-the-box. Are you using them?
  • Time sub-sampling for snapshotting – yes or no?
  • Expanding box – yes or no.
  • Mixed precision for model parameters – yes or no?
  • Other minor potential devito optimizations: use of Buffer(2) to reduce working set size; Trigonometric identities to avoid some temporaries (e.g., cos2 = 1 - sin2), etc…

How do I cite Devito

Please see https://www.devitoproject.org/citing

Where did the name Devito come from?

The precursor project that led to Devito was named by @ggorman using a backronym generator. He put in some keywords like “open source performant seismic codes”, and chose the (weird) name “Opesci”. No one knew how to pronounce this, so a common conversation was “How do you pronounce this?” “Opesci, like Joe Pesci”. So for the next version we chose to go with a famous Joe Pesci character - Tommy Devito from Goodfellas. The name came up in a discussion between @navjotk and @mlange05 (mostly the latter) and we stuck with it.

Back to top