Frequently Asked Questions

What Devito is
What Devito is not
How can I see the code generated by Devito?
How can I see the compilation command with which Devito compiles the generated code?
Where does the generated code go and how do I look at it?
Can I change the directory where Devito stashes the generated code?
I create an Operator, look at the generated code, and the equations appear in a different order than I expected.
Do Devito Operators release the GIL when executing C code?
What performance optimizations does Devito apply?
Does Devito optimize complex expressions
How do I get an estimate of the memory consumption for an Operator?
How are abstractions used in the seismic examples?
What environment variables control how Devito works?
What are the accepted combinations of PLATFORM, ARCH, and LANGUAGE?
How do you run the unit tests from the command line?
What is the difference between f() and f[] notation?
What is the indexed notation?
Is there a flow chart?
What’s up with object.data?
How do I create an N-dimensional grid?
What are the keys to fast code?
As time increases in the finite difference evolution, are wavefield arrays “swapped” as you might see in c/c++ code?
What units are typically used in Devito examples?
Can I subclass basic types such as TimeFunction?
How can I change the compilation flags (for example, I want to change the optimization level from -O3 to -O0)?
Is the jitted code IEEE-compliant?
Can I control the MPI domain decomposition?
How should I use MPI on multi-socket machines?
How do I make sure my code is “MPI safe”?
My code is generating different results when running with MPI?
Why does my Operator kernel die suddenly?
Can I manually modify the C code generated by Devito and test these modifications?
How do I find the source of my bug quickly and get support?
Is there a way to get the performance of an Operator?
How does Devito compute the performance of an Operator?
How do I fairly compare the performance to that of my in-house code?
Is there is list of refereed papers related to the Devito project?
How do I cite Devito?
Where did the name Devito come from?

What Devito is

Devito is a Python-based domain-specific language (DSL) and code generation framework for solving any system of partial differential equations (PDEs) through symbolic computation. Instead of bundling fixed “black-box” solvers, Devito empowers you to build your own, tailored to your physics and hardware.

Here’s what Devito brings to the table:

Symbolic PDE definitions in pure Python
Express your governing equations exactly as you would on paper, using a concise, SymPy-powered syntax — no boilerplate, no glue code.
Automatic generation of fast, low-level code
Devito turns your high-level symbolic equations into highly optimized C/C++ code via a sophisticated compiler stack, automatically applying an array of performance optimizations during the lowering process.
Portable performance across architectures
From laptops to supercomputers: Devito targets multi-core CPUs (with OpenMP and MPI) and accelerators (CUDA, HIP, SYCL, OpenACC, and MPI) from the same source. GPU support for CUDA, HIP, and SYCL is included in DevitoPRO.
Custom finite-difference and stencil kernel design
Design bespoke numerical schemes for acoustics, elasticity, electromagnetics, fluid dynamics, or whatever PDE-driven physics your research demands.

By providing high-level abstractions for numerical scheme specification, Devito combines productivity, portability, and performance, avoiding the need to hand-implement and optimize finite-difference solvers, or rely on pre-existing implementations. This enables domain scientists to rapidly specify the solvers they require, whilst remaining focused on their areas of expertise - typically the overarching problem.

Focus on modeling and math — Devito handles the performance.

What Devito is not

As is hopefully implied by the preceeding section, Devito is not a seismic modelling or imaging framework, nor is it a set of pre-baked solvers. Whilst Devito is often associated with seismic imaging, this is just one application domain. Our RTM and FWI examples use Devito under the hood, but the engine itself is physics-agnostic. There’s no built-in assumption about waves, seismics, or domains — you provide the physics, and Devito delivers efficient, architecture-aware implementations.

Furthermore, the examples provided with Devito are often conflated with the core DSL and compiler. However, the purpose of this collection of tutorials, code samples, and helper functions is fourfold:

Elucidate the usage of the Devito DSL and compiler
Demonstrate example implementations of various PDE-driven problems
Demonstrate integration of Devito into workflows of interest
Provide helper objects and functions for learning the DSL

Whilst components of the provided examples may be well-suited to user applications, it is worth bearing in mind that they represent a higher level of abstraction atop of Devito. We do, however, encourage the extension of example objects, and their use as templates for implementing your own higher-level functionalities.

How can I see the code generated by Devito?

After you build an op=Operator(...) implementing one or more equations, you can use print(op) to see the generated low level code. The example below builds an operator that takes a 1/2 cell forward shifted derivative of the Function f and puts the result in the Function g.

import numpy as np
import devito
from devito import Grid, Function, Eq, Operator
grid = Grid(shape=(21,), extent=(1.0,), origin=(0.0,), dtype=np.float32)
x = grid.dimensions[0]
f = Function(name='f', grid=grid, space_order=8)
g = Function(name='g', grid=grid, space_order=8)
eq_dfdx = Eq(g, getattr(f, 'dx')(x+0.5*x.spacing))
op = Operator(eq_dfdx)
print(op)

And the output:

#define _POSIX_C_SOURCE 200809L
#include "stdlib.h"
#include "math.h"
#include "sys/time.h"
#include "xmmintrin.h"
#include "pmmintrin.h"

struct dataobj
{
  void *restrict data;
  int * size;
  int * npsize;
  int * dsize;
  int * hsize;
  int * hofs;
  int * oofs;
} ;

struct profiler
{
  double section0;
} ;


int Kernel(struct dataobj *restrict f_vec, struct dataobj *restrict g_vec, const float h_x, struct profiler * timers, const int x_M, const int x_m)
{
  float (*restrict f) __attribute__ ((aligned (64))) = (float (*)) f_vec->data;
  float (*restrict g) __attribute__ ((aligned (64))) = (float (*)) g_vec->data;

  /* Flush denormal numbers to zero in hardware */
  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

  START(section0)
  for (int x = x_m; x <= x_M; x += 1)
  {
    g[x + 8] = (3.57142857e-3F*(f[x + 4] - f[x + 12]) + 3.80952381e-2F*(-f[x + 5] + f[x + 11]) + 2.0e-1F*(f[x + 6] - f[x + 10]) + 8.0e-1F*(-f[x + 7] + f[x + 9]))/h_x;
  }
  STOP(section0,timers)

  return 0;
}

How can I see the compilation command with which Devito compiles the generated code?

Set the environment variable DEVITO_LOGGING=DEBUG. When an Operator gets compiled, the used compilation command will be emitted to stdout.

If nothing seems to change, it is possible that no compilation is happening under-the-hood as all kernels have already been compiled in a previous run. You will then have to clear up the Devito kernel cache. From the Devito root directory, run:

python scripts/clear_devito_cache.py

Where does the generated code go and how do I look at it?

Devito stores the generated code as well as the jit-compiled libraries in a temporary directory. By setting the environment variable DEVITO_LOGGING=DEBUG, Devito will show, amongst other things, the absolute path to the generated code.

Can I change the directory where Devito stashes the generated code?

Yes, just set the environment variable TMPDIR to your favorite location.

I create an Operator, look at the generated code, and the equations appear in a different order than I expected.

The Devito compiler computes a topological ordering of the input equations based on data dependency analysis. Heuristically, some equations might be moved around to improve performance (e.g., data locality). Therefore, the order of the equations in the generated code might be different than that used as input to the Operator.

Do Devito Operators release the GIL when executing C code?

Yes. Devito uses ctypes.CDLL to call the JIT C code which releases the GIL.

What performance optimizations does Devito apply?

Take a look here and in particular at this notebook.

Does Devito optimize complex expressions?

Devito applies several performance optimizations to improve the number of operations (“operation count”) in complex expressions. These optimizations are designed to do a really good job but also be reasonably fast. One such pass attempts to factorize as many common terms as possible in expressions in order to reduce the operation count. We will construct a demonstrative example below that has a common term that is not factored out by the Devito optimization. The difference in floating-point operations per output point for the factoring of that term is about 10 percent, and the generated C is different, but numerical outputs of running the two different operators are indistinguishable to machine precision. In terms of actual performance, the (few) missed factorization opportunities may not necessarily be a relevant issue: as long as the code is not heavily compute-bound, the runtimes may only be slightly higher than in the optimally-factorized version.

Operator 1:

ux_update = t.spacing**2 * b * \
    ((c33 * u_x.dx(x0=x+x.spacing/2)).dx(x0=x-x.spacing/2) +
     (c55 * u_x.dz(x0=z+z.spacing/2)).dz(x0=z-z.spacing/2) +
     (c13 * u_z.dz(x0=z+z.spacing/2)).dx(x0=x-x.spacing/2) +
     (c55 * u_z.dx(x0=x+x.spacing/2)).dz(x0=z-z.spacing/2)) + \
    (2 - t.spacing * wOverQ) * u_x + (t.spacing * wOverQ - 1) * u_x.backward 
stencil_x = Eq(u_x.forward, ux_update)
print("\n", stencil_x)
op = Operator([stencil_x])

Operator 2:

ux_update = \
    t.spacing**2 * b * (c33 * u_x.dx(x0=x+x.spacing/2)).dx(x0=x-x.spacing/2) + \
    t.spacing**2 * b * (c55 * u_x.dz(x0=z+z.spacing/2)).dz(x0=z-z.spacing/2) + \
    t.spacing**2 * b * (c13 * u_z.dz(x0=z+z.spacing/2)).dx(x0=x-x.spacing/2) + \
    t.spacing**2 * b * (c55 * u_z.dx(x0=x+x.spacing/2)).dz(x0=z-z.spacing/2) + \
    (2 - t.spacing * wOverQ) * u_x + (t.spacing * wOverQ - 1) * u_x.backward 
stencil_x = Eq(u_x.forward, ux_update)
print("\n", stencil_x)
op = Operator([stencil_x])

Output 1:

Eq(u_x(t + dt, x, z), dt**2*(Derivative(c13(x, z)*Derivative(u_z(t, x, z), z), x) + Derivative(c33(x, z)*Derivative(u_x(t, x, z), x), x) + Derivative(c55(x, z)*Derivative(u_x(t, x, z), z), z) + Derivative(c55(x, z)*Derivative(u_z(t, x, z), x), z))*b(x, z) + (-dt*wOverQ(x, z) + 2)*u_x(t, x, z) + (dt*wOverQ(x, z) - 1)*u_x(t - dt, x, z))
Operator `Kernel` generated in 1.26 s
  * lowering.Expressions: 0.61 s (48.7 %)
  * lowering.Clusters: 0.52 s (41.5 %)
     * specializing.Clusters: 0.31 s (24.8 %)
Flops reduction after symbolic optimization: [1160 --> 136]

Output 2:

Eq(u_x(t + dt, x, z), dt**2*b(x, z)*Derivative(c13(x, z)*Derivative(u_z(t, x, z), z), x) + dt**2*b(x, z)*Derivative(c33(x, z)*Derivative(u_x(t, x, z), x), x) + dt**2*b(x, z)*Derivative(c55(x, z)*Derivative(u_x(t, x, z), z), z) + dt**2*b(x, z)*Derivative(c55(x, z)*Derivative(u_z(t, x, z), x), z) + (-dt*wOverQ(x, z) + 2)*u_x(t, x, z) + (dt*wOverQ(x, z) - 1)*u_x(t - dt, x, z))
Operator `Kernel` generated in 1.12 s
  * lowering.Expressions: 0.59 s (53.0 %)
  * lowering.Clusters: 0.40 s (35.9 %)
     * specializing.Clusters: 0.31 s (27.8 %)
Flops reduction after symbolic optimization: [1169 --> 149]

Other optimizations include common sub-expressions elimination, hoisting of loop-invariant code, and detection of cross-iteration redundancies (e.g., due to high order derivatives). Below we show the cumulative impact of all these optimizations in a number of seismic operators.

For more info, take a look at this notebook.

How do I get an estimate of the memory consumption for an Operator?

The memory consumption of an Operator can be estimated using the Operator.estimate_memory() utility. This returns an estimate of consumption of both host and device memory for the Operator, including that associated with any array temporaries introduced by optimizations applied by the compiler. A deeper overview can be found toward the end of this notebook.

How are abstractions used in the seismic examples?

Many Devito examples are provided that demonstrate application for specific problems, including e.g. fluid mechanics and seismic modeling. We focus in this question on seismic modeling examples that provide convenience wrappers to build differential equations and create Devito Operators for various types of modeling physics including isotropic and anisotropic, acoustic and elastic.

These examples (link) use abstractions to remove details from the methods that actually build the operators. The idea is that at the time you build a Devito operator, you don’t need specific material parameter arrays (e.g. velocity or density or Thomsen parameter), and you don’t need specific locations of sources and receiver instruments. All you need to build the operator is a placeholder that can provide the dimensionality and (for example) the spatial order of finite difference approximation you wish to employ. In this way you can build and return functioning highly optimized operators to which you can provide the specific implementation details at runtime via command line arguments.

An example of this abstraction (or placeholder design pattern) in operation is the call to the isotropic acoustic AcousticWaveSolver.forward method that returns a Devito operator via the ForwardOperator method defined in operators.py.

You will note that this method uses placeholders for the material parameter arrays and the source and receiver locations, and then at runtime uses arguments provided to the returned Operator to provide state to the placeholders. You can see this happen on lines 112-113 in wavesolver.py.

What environment variables control how Devito works?

How to get the list of Devito environment variables

You can get the list of environment variables and all their possible values with the following python code:

from devito import print_defaults
print_defaults()

How to set Devito environment variables

These environment variables can either be set from the shell or programmatically. Note that when setting these variables programmatically, you need to use lower case, and the leading DEVITO is omitted. Values are case sensitive, meaning openmp is accepted and OPENMP will throw an error. Below are examples of setting these variables in the shell (before running python) and from python (before executing devito code).

Method	Example
bourne shell	DEVITO_LANGUAGE=“openmp”
c shell	setenv DEVITO_LANGUAGE “openmp”
programmatically	configuration[‘language’] = ‘openmp’

Description of Devito environment variables

DEVITO_ARCH

Used to select a specific “backend compiler”. The backend compiler takes as input the code generated by Devito and produces a shared object. Supported backend compilers are gcc, icc, pgcc, clang. For each of these compilers, Devito uses some preset compilation flags (e.g., -O3, -march=native, -fast-math). If this environment variable is left unset, Devito will attempt auto-detection of the most suitable backend compiler available on the underlying system.

DEVITO_PLATFORM

This environment variable is mostly needed when running on GPUs, to ask Devito to generate code for a particular device (see for example this tutorial). Can be also used to specify CPU architectures such as Intel’s – Haswell, Broadwell, SKL and KNL – ARM, AMD, and Power. Often one can ignore this variable because Devito typically does a decent job at auto-detecting the underlying platform.

DEVITO_DEVICEID

Specify the default device to use when running on a GPU. This is useful when multiple GPUs are available on the system. The default is 0, which means the first device. Use 1 for the second device, and so on.

DEVITO_LANGUAGE

Specify the generated code language. The default is C, which means sequential C. Use openmp to emit C+OpenMP or openacc for C+OpenACC.

DEVITO_PROFILING

Choose the performance profiling level. This is also automatically increased with DEVITO_LOGGING=PERF or DEVITO_LOGGING=DEBUG, in which case this environment variable can be ignored.

DEVITO_DEVELOP

Mostly useful for developers when chasing segfaults.

DEVITO_OPT

Choose the performance optimization level. By default set to the maximum level, advanced.

DEVITO_MPI

Controls MPI in Devito. Use 1 to enable MPI. The most powerful MPI mode is called “full”, and is activated setting DEVITO_MPI=full. The “full” mode implements a number of optimizations including computation/communication overlap.

DEVITO_AUTOTUNING

Search across a set of block shapes to maximize the effectiveness of loop tiling (aka cache blocking). You can choose between off (default), basic, aggressive, max. A more aggressive autotuning should eventually result in better runtime performance, though the search phase will take longer.

DEVITO_LOGGING

Run with DEVITO_LOGGING=DEBUG to find out the specific performance optimizations applied by an Operator, how auto-tuning is getting along, to emit the command used to compile the generated code, to emit more performance metrics, and much more.

DEVITO_FIRST_TOUCH

Use DEVITO_FIRST_TOUCH=1 in combination with DEVITO_LANGUAGE=openmp to use an OpenMP parallel Operator for initialization of Function data. This should ensure NUMA locality for data access when running “flatten” OpenMP across multiple sockets on the same node (as opposed to using MPI – one MPI process per socket – plus OpenMP, which is the recommended way).

DEVITO_JIT_BACKDOOR

You can set DEVITO_JIT_BACKDOOR=1 to test custom modifications to the generated code. For more info, take a look at this FAQ.

DEVITO_IGNORE_UNKNOWN_PARAMS

Set DEVITO_IGNORE_UNKNOWN_PARAMS=1 to avoid Devito raising an exception if one attempts to pass an unknown argument to op.apply().

What are the accepted combinations of PLATFORM, ARCH, and LANGUAGE?

LANGUAGE={C,openmp}

These two languages can be used with virtually any PLATFORM and ARCH.

With a device PLATFORM (e.g., nvidiaX, amdgpuX, or intelgpuX), the compiler will generate OpenMP code for device offloading.

When using OpenMP offloading, it is recommended to stick to the corresponding vendor compiler, so ARCH=amdclang for AMD, ARCH={icc,icx,intel} for Intel, and ARCH=nvc for NVidia.

LANGUAGE=openacc

Requires: PLATFORM=nvidiaX and ARCH=nvc.

The legacy PGI compiler is also supported via ARCH=pgcc.

LANGUAGE=cuda

DevitoPRO only.

Requires: PLATFORM=nvidiaX and ARCH=cuda.

LANGUAGE=hip

DevitoPRO only.

Requires: PLATFORM=amdgpuX and ARCH=hip.

LANGUAGE=sycl

DevitoPRO only.

Requires: PLATFORM=intelgpuX or PLATFORM=intel64, and ARCH=sycl.

How do you run the unit tests from the command line?

In addition to the tutorials, the unit tests provide an excellent way to see how the Devito API works with small self-contained examples. You can exercise individual unit tests with the following python code:

pytest <test.py>
pytest -vs <test.py>  [more detailed log]

What is the difference between f() and f[] notation?

Devito offers a functional language to express finite difference operators. This is introduced here and systematically used throughout our examples and tutorials. The language relies on what in jargon we call the “f() notation”.

>>> from devito import Grid, Function
>>> grid = Grid(shape=(5, 6))
>>> f = Function(name='f', grid=grid, space_order=2)
>>> f.dx
Derivative(f(x, y), x)
>>> f.dx.evaluate
-f(x, y)/h_x + f(x + h_x, y)/h_x

Sometimes, one wishes to escape the constraints of the language. Instead of taking derivatives, other special operations are required. Or perhaps, a specific grid point needs to be accessed. In such a case, one could use the “f[] notation” or “indexed notation”. Following on from the example above:

>>> x, y = grid.dimensions
>>> f[x + 1000, y]
f[x + 1000, y]

The indexed object can be used at will to construct Eqs, and they can be mixed up with objects stemming from the “f() notation”.

>>> f.dx + f[x + 1000, y]
Derivative(f(x, y), x) + f[x + 1000, y]

However, while the “f() notation” is substantially safe – the language is designed such that only legal stencil expressions are built – the “f[] notation” is not, and one can easily end up creating operators performing out-of-bounds accesses. So use it judiciously!

What is the indexed notation?

The indexed notation, or “f[] notation”, is discussed here

Is there a flow chart?

needs links 1. Equation Lowering - Indexification - Substitutions - Domain alignment - Eq -> LoweredEq 2. Local Analysis 3. Clustering 4. Symbolic Optimization via Devito Symbolic Engine (DSE) 5. IET (Iteration/Expression Tree) Construction 6. IET Analysis 7. IET Optimization (DLE/YLE) - Loop optimization via Devito Loop Engine (DLE) - Loop Tiling / Cache Blocking - SIMD - OpenMP - MPI 8. Synthetic 9. JIT (Just In Time) Compilation

What’s up with object.data?

The .data property which is associated with objects such as Constant, Function and SparseFunction (along with their derivatives) represents the ‘numerical’ value of the ‘data’ associated with that particular object. For example, a Constant will have a single numerical value associated with it as shown in the following snippet

from devito import Constant

c = Constant(name='c')
c.data = 2.7

print(c.data)

2.7

Then, a Function defined on a Grid will have a data value associated with each of the grid points (as shown in the snippet below) and so forth.

import numpy as np
from devito import Grid, Function

grid = Grid(shape=(4, 4), extent=(1, 1))
f = Function(name='f', grid=grid)
f.data[:] = np.arange(16).reshape(grid.shape)

print(f.data)

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]
 [12. 13. 14. 15.]]

How do I create and N-dimensional grid?

Grids are often created via, e.g.,

grid = Grid(shape=(5, 5))

where printing the grid object then returns:

Grid[extent=(1.0, 1.0), shape=(5, 5), dimensions=(x, y)]

Here we see the grid has been created with the ‘default’ dimensions x and y. If a grid is created and passed a shape of (5, 5, 5) we’ll see that in addition it has a z dimension. However, what if we want to create a grid with, say, a shape of (5, 5, 5, 5)? For this case, we’ve now run out of the dimensions defined by default and hence need to create our own dimensions to achieve this. This can be done via, e.g.,

a = SpaceDimension('a')
b = SpaceDimension('b')
c = SpaceDimension('c')
d = SpaceDimension('d')
grid = Grid(shape=(5, 5, 5, 5), dimensions=(a, b, c, d))

where now, printng grid we get

Grid[extent=(1.0, 1.0, 1.0, 1.0), shape=(5, 5, 5, 5), dimensions=(a, b, c, d)]

and grid.shape returns

(5, 5, 5, 5)

What are the keys to fast code?

The code generated by devito is designed to run fast on CPU, GPU and clusters thereof. Broadly outlined, some of the mechanics for generating fast code are: ### CPU * Loop tiling (or “cache blocking”) * Loop fusion (to maximize data locality) Loop fission (to maximize parallelism) * Flop-reducing transformations (CSE, cross-stencil redundancies, factorization, hoisting) * OpenMP threading * OpenMP-based SIMD * Alignment to promote SIMD vectorization ### GPU * Longer pipelines, less travel to host (do more work on the GPU before communicating data between host and GPU) ### Clusters of CPUs/GPUs * Computation/communication overlap with prodding of the asynchronous progress engine * Avoidance of unnecessary halo exchanges * Reshuffling of halo exchanges * Threaded data packing/unpacking

As time increases in the finite difference evolution, are wavefield arrays “swapped” as you might see in c/c++ code?

In c/c++ code using two wavefield arrays for second order acoustics, you might see code like the following to “swap” the wavefield arrays at each time step:

    float *p_tmp = p_old;
    p_old = p_cur;
    p_cur = p_tmp;

Instead of swapping arrays, devito uses the modulus of a time index to map increasing temporal indices [0, 1, 2, 3, 4, 5, …] into cyclic indices [0, 1, 2, 0, 1, 2, …].

What units are typically used in Devito examples?

Sampling rates: msec
Frequency: KHz
Velocity: km/sec

Can I subclass basic types such as TimeFunction?

Yes, just like we did for our seismic examples, for example, the PointSource class. A few caveats are necessary, though.

First, classes such as Function or SparseTimeFunction are inherently complex. In fact, SparseTimeFunction itself is a subclass of Function. The whole class hierarchy is modular and well-structured, but at the same time, it’s depth and offers several hooks to allow specialization by subclasses – for example, all methods starting with __ such as __init_finalize__ or __shape_setup__. It will take some time to digest it. Luckily, you will only need to learn a little of this, at least for simple subclasses.

Second, you must know that these objects are subjected to so-called reconstruction during compilation. Objects are immutable inside Devito; therefore, even a straightforward symbolic transformation such as f[x] -> f[y] boils down to performing a reconstruction, that is, creating a whole new object. Since f carries around several attributes (e.g., shape, grid, dimensions), each time Devito performs a reconstruction, we only want to specify which attributes are changing – not all of them, as it would make the code ugly and incredibly complicated. The solution to this problem is that all the base symbolic types inherit from a common base class called Reconstructable; a Reconstructable object has two special class attributes, called __rargs__ and __rkwargs__. If a subclass adds a new positional or keyword argument to its __init_finalize__, it must also be added to __rargs__ or __rkwargs__, respectively. This will provide Devito with enough information to perform a reconstruction when it’s needed during compilation. The following example should clarify:

class Foo(Reconstructable):
    __rargs__ = ('a', 'b')
    __rkwargs__ = ('c',)

    def __init__(self, a, b, c=4):
           self.a = a
           self.b = b
           self.c = c

    def __repr__(self):
        return "x(%d, %d, %d)" % (self.a, self.b, self.c)

class Bar(Foo):
    __rkwargs__ = Foo.__rkwargs__ + ('d',)

    def __init__(self, *args, d=6, **kwargs)
            super().__init__(*args, **kwargs)
            self.d = d

You are unlikely to care about how reconstruction works in practice, but here are a few examples for a = Foo(3, 5) to give you more context.

a._rebuild() -> "x(3, 5, 4)" (i.e., copy of `a`).
a._rebuild(4) -> "x(4, 5, 4)"
a._rebuild(4, 7) -> "x(4, 7, 4)"
a._rebuild(c=5) -> "x(3, 5, 5)"
a._rebuild(1, c=7) -> "x(1, 5, 7)"

How can I change the compilation flags (for example, I want to change the optimization level from -O3 to -O0)?

There is currently no API to achieve this straightforwardly. However, there are three workarounds:

hacky way: change the flags explicitly in the Devito source code. In Devito v4.0, you can do that here
via env vars: use a CustomCompiler – just leave the DEVITO_ARCH environment variable unset or set it to 'custom'. Then, export CFLAGS="..." to tell Devito to use the exported flags in place of the default ones.
programmatically: subclass one of the compiler classes and set self.cflags to whatever you need. Do not forget to add the subclass to the compiler registry. For example, you could do

from devito import configuration, compiler_registry
from devito.compiler import GNUCompiler

class MyOwnCompiler(GNUCompiler):
    def __init__(self, *args, **kwargs):
        super(MyOwnCompiler, self).__init__(*args, **kwargs)
        # <manipulate self.cflags here >


### Make sure Devito is aware of this new Compiler class
compiler_registry['mycompiler'] = MyOwnCompiler
configuration.add("compiler", "custom", list(compiler_registry), callback=lambda i: compiler_registry[i]())


### Then, what remains to be done is asking Devito to use MyOwnCompiler

configuration['compiler'] = 'mycompiler'

Is the jitted code IEEE compliant?

By default, Devito compiles the generated code using flags that maximize the runtime performance. Some of these flags, such as GCC’s -ffast-math, break IEEE compliance by, for example, reordering operations based on associativity. This may alter the numerical output, including even making some codes produce incorrect results. If you think your code is affected by -ffast-math or similar flags that break IEEE compliance, you can disable them by setting the environment variable DEVITO_SAFE_MATH=1. You can achieve the same effect in Python by setting configuration['safe-math'] = 1.

Can I control the MPI domain decomposition?

By default, domain decomposition starts along the slowest axis. This means that for an n-dimensional grid, the outermost dimension is decomposed first. For example, in a three-dimensional grid (x, y, z), the x dimension is split into chunks first, followed by the y dimension, and finally the z dimension. Then the process restarts from the outermost dimension again, ensuring an n-dimensional decomposition, until as many partitions as MPI ranks are created.

Customization

The Grid class accepts an optional topology argument, allowing users to specify custom domain decompositions as an n-tuple, where n is the number of distributed dimensions. For instance, for a two-dimensional grid, the topology (4, 1) will decompose the slowest axis into four partitions (one per MPI rank), while the fastest axis will be replicated across all MPI ranks. So, for example, given a Grid(shape=(16, 16), topology=(4, 1)), the x dimension would be split into 4 chunks of 4, resulting in partitions of shape (4, 16) for each MPI rank.

Wildcard-based Decomposition

Consider a domain with three distributed dimensions: x, y, and z, and an MPI communicator with N processes. Here are some examples of specifying a custom topology:

With N known (e.g., N=4):
- (1, 1, 4): Decomposes the z dimension into 4 chunks.
- (2, 1, 2): Decomposes the x dimension into 2 chunks and the z dimension into 2 chunks.
With N unknown:
- (1, '*', 1): The wildcard '*' indicates that the runtime should decompose the y dimension into N chunks.
- ('*', '*', 1): The runtime decomposes both x and y dimensions into factors of N, prioritizing the outermost dimension.
Assuming the number of ranks N cannot be evenly decomposed into the requested stars, decomposition is as even as possible, prioritizing the outermost dimension:
- For N=3:
  - ('*', '*', 1) results in (3, 1, 1).
  - ('*', 1, '*') results in (3, 1, 1).
  - (1, '*', '*') results in (1, 3, 1).
- For N=6:
  - ('*', '*', 1) results in (3, 2, 1).
  - ('*', 1, '*') results in (3, 1, 2).
  - (1, '*', '*') results in (1, 3, 2).
- For N=8:
  - ('*', '*', '*') results in (2, 2, 2).
  - ('*', '*', 1) results in (4, 2, 1).
  - ('*', 1, '*') results in (4, 1, 2).
  - (1, '*', '*') results in (1, 4, 2).

The `DEVITO_TOPOLOGY` Environment Variable

As of Devito v4.8.11, the domain decomposition topology can also be specified globally using the environment variable DEVITO_TOPOLOGY. Accepted values are:

x: Corresponds to the topology ('*', 1, 1), decomposing the x dimension.
y: Corresponds to the topology (1, '*', 1), decomposing the y dimension.
z: Corresponds to the topology (1, 1, '*'), decomposing the z dimension.
xy: Corresponds to the topology ('*', '*', 1), decomposing both x and y dimensions.

How should I use MPI on multi-socket machines?

In general you should use one MPI rank per NUMA node on a multi-socket machine. You can find the number of numa nodes with the lscpu command. For example, here is the relevant part of the output from the lscpu command on an AMD 7502 2 socket machine with 2 NUMA nodes:

Architecture:          x86_64
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    1
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          2
Vendor ID:             AuthenticAMD
Model name:            AMD EPYC 7502 32-Core Processor
NUMA node0 CPU(s):     0-31
NUMA node1 CPU(s):     32-63

How do I make sure my code is “MPI safe”?

There are a few things you may want to check

To refer to the actual (“global”) shape of the domain, you should always use grid.shape (or analogously through a Function, f.grid.shape). And unless you know well what you’re doing, you should never use the function shape, namely f.shape or f.data.shape, as that will return the “local” domain shape, that is the data shape after domain decomposition, which might differ across the various MPI ranks.

My code is generating different results when running with MPI?

There are a few things you may want to check

Users are expected to execute the MPI launch command, specifying the desired number of ranks and other possible arguments: mpirun, mpiexec, srun, e.g.: mpirun -np <num processes> [options] <executable> [arguments], and get exactly the same results. However some pre- and post-processing may be rank-specific (e.g., plotting on a given MPI rank only).
In case you used hardcoded boundary conditions to model your problem, consider switching to SubDomains. Devito’s support for distributed NumPy arrays enables domain decomposition without requiring changes to user code. The data is physically distributed, but from the user’s perspective, it remains a logically centralized entity. Users can interact with data using familiar indexing schemes (e.g., slicing) without concern about the underlying layout. You can find related tutorials here:.

For example, instead of

t = grid.stepping_dim
x, y = grid.dimensions
op = Operator([Eq(u.forward, u + 1),
               Eq(u[t+1, x, 0], 0),
               Eq(u[t+1, x, 2], 0),
               Eq(u[t+1, 0, y], 0),
               Eq(u[t+1, 2, y], 0)])

one should use a semantically equivalent computation can be expressed exploiting SubDomains.

u.data[:] = 0
op = Operator(Eq(u.forward, u + 1, subdomain=grid.interior))

Why does my Operator kernel die suddenly?

This is likely due to an out-of-bounds (OOB) array access while running the generated code. This shouldn’t really ever happen! The compiler and runtime should catch any DSL misuses, ultimately leading to such OOB accesses, well before jumping to C-land via op.apply(...) to run the generated code. However, there are currently two open issues when this might unfortunately happen:

When misusing SubDimensions or SubDomains. See this issue
When misusing ConditionalDimensions. See this issue

Can I manually modify the C code generated by Devito and test these modifications?

Yes, as of Devito v3.5 it is possible to modify the generated C code and run it inside Devito. First you need to get the C file generated for a given Operator. Run your code in DEBUG mode:

DEVITO_LOGGING=DEBUG python your_code.py

The generated code path will be shown as in the excerpt below:

CustomCompiler: compiled `/tmp/devito-jitcache-uid1000/ed41e9373af1bc129471b7ae45e1c3740b60a856.c` [0.29 s]

You can now open the C file, do the modifications you like, and save them. Finally, rerun the same program but this time with the Devito JIT backdoor enabled:

DEVITO_JIT_BACKDOOR=1 python your_code.py

This will force Devito to recompile and link the modified C code.

If you have a large codebase with many Operators, here’s a trick to speed up your hacking with the JIT backdoor.

How do I find the source of my bug quickly and get support?

Assuming this is actually an issue in Devito, we distinguish between three types of bugs:

a Python-level bug, e.g. some exceptions raised during compilation
a C-level bug causing abrupt termination, e.g. a segmentation fault (ouch)
a C-level bug, e.g. the numerical output “looks wrong”

Typically such a bug occurs in a moderately big code, so how should we proceed?

If you are in the cases 1 or 2 above, the first thing to do, regardless of who fixes it (either you directly if feeling brave, or most likely someone from the Devito team), is to create an MFE – a Minimal Failing Example. An interesting read about MFEs in general is available here. The typical workflow in Devito is as follows:

If the failure is from a Jupyter Notebook, first of all, convert it into a Python file. It’s trivial, you can do it directly from within the Jupyter Notebook itself, under the “File” tab there’s an option to convert the notebook into a Python file.
Then it’s time to trim down the code. The idea is to incrementally remove parts of the code until the bug disappears, i.e., Devito compilation and execution reach completion without failures. In doing so, you don’t care about things such as breaking the physics, making the method unstable, and so on. So …
If your Operator has many Eqs, start with removing some of them. Is the bug still there? Great, keeping on removing the unnecessary. Some Functions may now become unnecessary because they were only appearing in the removed Eqs. Then, drop them too from the Python file.
Here’s another trick. You have many SubDomains – try with removing some of them, both their definition and where they’re used.
The MFE must have a super quick turnaround time. Avoid solve at all costs. Run for only a few timesteps, possibly just one; this will also help with avoiding NaN or Inf since your MFE is most likely unstable.
If you have just one Eq in your Operator, but it’s a big one, perhaps with lots of operations over many Functions, then you should trim it down. Try to mitigate the arithmetic complexity, for example by dropping derivatives or entire sub-expressions.
There are other tricks for the creation of an MFE. If your code by default uses MPI, or OpenMP, or OpenACC, or combinations thereof, but the bug appears even when running sequentially, then explicitly disable parallelism. Also try with disabling the performance optimizations applied by an Operator – Operator(..., opt='noop').
There are a few other tricks and things you may wanna play with to create an MFE…

At the end of the day, an MFE should really be small, ideally not more than 10 lines of code, and definitely not more than 20~30. If you managed to create an actual MFE, then we’re in business. File an issue on GitHub or talk to us on Slack, attach the MFE, and someone will quickly look into it.

Now, stepping back: you may unfortunately be in the case 3 above – the numerical output “looks wrong”, despite the code really looks sane. This is in our experience an extremely rare circumstance. You may try to look at the generated C, to see if you spot anything odd; or you may just file an issue on GitHub with instructions to reproduce the problem, or talk to us directly or Slack.

Is there a way to get the performance of an Operator?

Yes — any logging level at or below PERF will display performance-related information. For example, running:

DEVITO_LOGGING=PERF python your_code.py

will cause Devito to emit detailed performance metrics about the Operator. The output includes:

Time taken for code generation, compilation, and execution;
For each section in the generated code: execution time, operational intensity (OI), GFlops/s, and GPts/s;
Overall (global) GFlops/s and GPts/s performance of the Operator (i.e., aggregated across all sections);
In MPI runs, per-rank GFlops/s and GPts/s performance.

About the GPts/s Metric

GPts/s stands for “gigapoints per second.” Here, “points” refers to grid points. For a grid shaped like an N**3 cube, run for T timesteps and completed in S seconds, the performance in GPts/s is N**3 * T / S. This is a common metric for finite-difference workloads.

Example Output

Below is an example of the performance profile Devito emits when an Operator is executed. In this case, the Operator has four sections, with the first two dominating execution time:

Operator `MyOperator` ran in 0.68 s
Global performance: [OI=0.01, 50.59 GFlops/s, 1.16 GPts/s]
Global performance <w/o setup>: [0.24 s, 3.27 GPts/s]
Local performance:
  * section0 ran in 0.07 s [OI=0.06, 2063.30 GFlops/s, 11.28 GPts/s]
  * section1 ran in 0.17 s [OI=0.01, 4984.02 GFlops/s, 4.60 GPts/s]
  * section2 ran in 0.01 s
  * section3 ran in 0.01 s

Notes: * The first line shows the total runtime of the Operator, including Python overhead, transition into executing the JIT-compiled binary, and any CPU-GPU data transfers.

The two “Global performance” lines differ as follows: the version excludes overhead (e.g., CPU-GPU data transfers) and better reflects actual compute performance, since such overheads are easily amortized way in production-grade runs when a simulation is executed over hundreds/thousands of timesteps.
The “Local performance” section provides a breakdown by code section. These sections are determined by the Devito compiler based on internal heuristics. Typically, loops related to finite-difference evaluation are grouped together, while operations such as source injection or receiver interpolation are placed in separate sections.

How does Devito compute the performance of an Operator?

The section execution time is computed directly in the generated code through cheap timers. The cumulative Operator execution time is computed through Python-level timers and includes the overhead inherent in the processing of the arguments supplied to op.apply(...).

The floating-point operations are counted once all of the symbolic flop-reducing transformations have been performed during compilation. Devito uses an in-house estimate of cost, rather than SymPy’s estimate, to take care of some low-level intricacies. For example, Devito’s estimate ignores the cost of integer arithmetic used for offset indexing into multi-dimensional arrays. Examples of how the Devito estimator works are available here.

To calculate the GFlops/s performance, Devito multiplies the floating-point operations calculated at compile time by the size of the iteration space, and it does that at the granularity of individual expressions. For example, consider the following snippet:

<section0 start>
for x = x_m to x_M
  for y = y_m to y_M
    u[x, y] = 2*v[x, y] + 1
    for z = z_m to z_M
       w[x, y, z] += 4*h[x, y, z] + 5 - f[x, y, z]
<section 0 end>

At compile time, Devito produces the following estimate: 2*(x_M-x_m+1)*(y_M-y_m+1) + 4*(x_M-x_m+1)*(y_M-y_m+1)*(z_M-z_m+1) = 2*(x_M-x_m+1)*(y_M-y_m+1)*(1 + 2*(z_M-z_m+1)). Upon op.apply(...), the values of x_M, x_m, ... become known and are instantied. This gives the total number of operations performed in section0, which is eventually divided by the section execution time to give the GFlops/s performance.

The produced GFlops/s has been checked against that reported by Intel Advisor in a number of problems and the results were extremely close, which gives confidence about the soundness of the Devito estimate. Clearly, when it gets to articles or presentations, we encourage the use of profilers such as Intel Advisor to back these numbers up. The metrics emitted by Devito are only intended to give an initial yet fairly realistic indication of the Operator performance.

Compiler ninjas may wonder about the counting of loop-invariant sub-expressions, which might produce an over-estimate of the actual performance. Thanks to aggressive code motion, the amount of innermost-loop-invariant sub-expressions in a Devito Operator is typically negligible, so the Devito estimate doesn’t basically suffer from this issue, or at least not in a tangible way to the best of our knowledge.

How do I fairly compare the performance to that of my in-house code?

It’s challenging! Here’s a potentially non-exhaustive list of things to check:

Check the physics — e.g., there are many TTI variations out there, are the two codes using the same one? You wanna ensure the PDEs are the same.
Check the discretization — do the two codes use the same spatial order? most importantly, is it the same spatial order in all problem dimensions? carefully optimized codes often use lower order stencils in some dimensions. This significantly impacts performance
Did you try tuning the performance of the Devito Operator? E.g., on GPUs it is worth giving the par-tile option a go.
Snapshotting: compression, serialization, asynchronous streaming if running on device… Many legacy codes have these features enabled out-of-the-box. Are you using them?
Time sub-sampling for snapshotting – yes or no?
Expanding box – yes or no.
Mixed precision for model parameters – yes or no?
Other minor potential devito optimizations: use of Buffer(2) to reduce working set size; Trigonometric identities to avoid some temporaries (e.g., cos2 = 1 - sin2), etc…

How do I cite Devito?

Please see https://www.devitoproject.org/citing

Where did the name Devito come from?

The precursor project that led to Devito was named by @ggorman using a backronym generator. He put in some keywords like “open source performant seismic codes”, and chose the (weird) name “Opesci”. No one knew how to pronounce this, so a common conversation was “How do you pronounce this?” “Opesci, like Joe Pesci”. So for the next version we chose to go with a famous Joe Pesci character - Tommy Devito from Goodfellas. The name came up in a discussion between @navjotk and @mlange05 (mostly the latter) and we stuck with it.

Frequently Asked Questions

What Devito is

What Devito is not

How can I see the code generated by Devito?

How can I see the compilation command with which Devito compiles the generated code?

Where does the generated code go and how do I look at it?

Can I change the directory where Devito stashes the generated code?

I create an Operator, look at the generated code, and the equations appear in a different order than I expected.

Do Devito Operators release the GIL when executing C code?

What performance optimizations does Devito apply?

Does Devito optimize complex expressions?

Operator 1:

Operator 2:

Output 1:

Output 2:

How do I get an estimate of the memory consumption for an Operator?

How are abstractions used in the seismic examples?

What environment variables control how Devito works?

How to get the list of Devito environment variables

How to set Devito environment variables

Description of Devito environment variables

DEVITO_ARCH

DEVITO_PLATFORM

DEVITO_DEVICEID

DEVITO_LANGUAGE

DEVITO_PROFILING

DEVITO_DEVELOP

DEVITO_OPT

DEVITO_MPI

DEVITO_AUTOTUNING

DEVITO_LOGGING

DEVITO_FIRST_TOUCH

DEVITO_JIT_BACKDOOR

DEVITO_IGNORE_UNKNOWN_PARAMS

What are the accepted combinations of PLATFORM, ARCH, and LANGUAGE?

LANGUAGE={C,openmp}

LANGUAGE=openacc

LANGUAGE=cuda

LANGUAGE=hip

LANGUAGE=sycl

How do you run the unit tests from the command line?

What is the difference between f() and f[] notation?

What is the indexed notation?

Is there a flow chart?

What’s up with object.data?

How do I create and N-dimensional grid?

What are the keys to fast code?

As time increases in the finite difference evolution, are wavefield arrays “swapped” as you might see in c/c++ code?

What units are typically used in Devito examples?

Can I subclass basic types such as TimeFunction?

How can I change the compilation flags (for example, I want to change the optimization level from -O3 to -O0)?

Is the jitted code IEEE compliant?

Can I control the MPI domain decomposition?

Customization

Wildcard-based Decomposition

The DEVITO_TOPOLOGY Environment Variable

How should I use MPI on multi-socket machines?

How do I make sure my code is “MPI safe”?

My code is generating different results when running with MPI?

Why does my Operator kernel die suddenly?

Can I manually modify the C code generated by Devito and test these modifications?

How do I find the source of my bug quickly and get support?

Is there a way to get the performance of an Operator?

About the GPts/s Metric

Example Output

How does Devito compute the performance of an Operator?

How do I fairly compare the performance to that of my in-house code?

Is there a list of refereed papers related to the Devito project?

How do I cite Devito?

Where did the name Devito come from?

The `DEVITO_TOPOLOGY` Environment Variable