Making Julia as Fast as C++ · FLOW Lab

F flow.byu.edu ↗

▲ 86 points • 62 comments • by d_tr • 3w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

11 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 8 of 8

SEGMENTS · AI 0 of 8

WORD COUNT 1,565

PEAK AI % 28% · §3

Analyzed

May 9

backend: pangram/v3.3

Segments scanned

8 windows

avg 196 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,565 words · 8 segments analyzed

Human AI-generated

§1 Human · 1%

18 Apr 2019, Eduardo Alvarez IntroductionThe rumor says that Julia can achieve the same computing performance as any other compiled language like C++ and FORTRAN. After coding in Julia for the past two years I have definitely fell in love with its pythonic syntax, multiple dispatch, and MATLAB-like handiness in linear algebra, while being able to use compilation features like explicit type declaration for bug-preventive programming. In summary, Julia’s philosophy brings all the flexibility of an interpreted language, meanwhile its Just-In-Time (JIT) compilation makes it a defacto compiled language.Julia’s high level syntax makes the language easygoing for programmers from any background, however, achieving high performance is sort of an art. In this post I summarize some of the things I’ve learned while crafting my Julia codes for high-performance computing. I will attempt to show the process of code optimization through a real-world computing application from aerodynamics: the vortex particle method\(^{[1,\,2]}\).Problem DefinitionIn the vortex particle method we are interested in calculating the velocity \(\mathbf{u}\) and velocity Jacobian \(\mathbf{J}\) that a field of \(N\) vortex particles induces at an arbitrary position \(\mathbf{x}\). This is calculated as \[{\bf u}\left( {\bf x} \right) = \sum\limits_p^N g_\sigma\left( {\bf x}-{\bf x}_p \right) {\bf K}\left( {\bf x}-{\bf x}_p \right) \times \boldsymbol\Gamma_p\] \[\frac{\partial {\bf u}}{\partial x_j}\left( {\bf x} \right) = \sum\limits_p^N \left[ \left( \frac{1}{\sigma } \frac{\Delta x_j}{\Vert \Delta \mathbf{x} \Vert} \frac{\partial g}{\partial r} \left( \frac{\Vert \Delta\mathbf{x} \Vert}{\sigma} \right) - 3 g_\sigma\left( \Delta{\bf x} \right) \frac{\Delta x_j}{\Vert \Delta\mathbf{x} \Vert^2}

§2 Human · 13%

\right) {\bf K}\left( \Delta\mathbf{x} \right) \times \boldsymbol\Gamma_p - \frac{g_\sigma\left( \Delta{\bf x} \right) }{4\pi \Vert \Delta{\bf x} \Vert^3} \delta_{ij} \times \boldsymbol\Gamma_p \right] ,\] with \({\bf K}\) the singular Newtonian kernel \({\bf K}\left( {\bf x}\right)=-\frac{ {\bf x} }{4\pi \Vert{\bf x}\Vert^3}\), \(g_\sigma\) a regularizing function of smoothing radius \(\sigma\), and \(\mathbf{x}_p\) and \(\boldsymbol{\Gamma}_p\) the position and vectorial strength of the \(p\)-th particle, respectively. Furthermore, the governing equations of the method require evaluating the velocity \(\mathbf{u}\) and Jacobian \(\mathbf{J}\) that the collection of particles induces on itself, leading to the well-known \(N\)-body problem of computational complexity \(\mathcal{O}(N^2)\).Choosing Winckelmans’ regularizing kernel\(^{[1]}\) \[g(r) = r^3 \frac{r^2 + 5/2}{\left( r^2 + 1 \right)^{5/2}}\] \[\frac{\partial g}{\partial r} (r) = \frac{15}{2} \frac{r^2}{\left( r^2 + 1 \right)^{7/2}} ,\] the above equations are implemented in C++ as follows:// Particle-to-particle interactions void P2P(Particle * P, const int numParticles) { real_t r, ros, aux, g_sgm, dgdr; vec3 dX; for (int i=0; i<numParticles; i++) { for (int j=0; j<numParticles; j++) { dX[0] = P[i].X[0] - P[j].X[0]; dX[1] = P[i].X[1] - P[j].X[1]; dX[2] =

§3 Human · 28%

P[i].X[2] - P[j].X[2]; r = sqrt(dX[0]*dX[0] + dX[1]*dX[1] + dX[2]*dX[2]); if (r!=0){ ros = r/P[j].sigma; // Evaluate g_σ and ∂gσ∂r aux = pow(ros*ros + 1.0, 2.5); g_sgm = ros*ros*ros * (ros*ros + 2.5) / aux; dgdr = 7.5 * ros*ros / ( aux * (ros*ros + 1.0) ); // u(x) = ∑g_σ(x−xp) K(x−xp) × Γp aux = (- const4 / (r*r*r)) * g_sgm; P[i].U[0] += ( dX[1]*P[j].Gamma[2] - dX[2]*P[j].Gamma[1] ) * aux; P[i].U[1] += ( dX[2]*P[j].Gamma[0] - dX[0]*P[j].Gamma[2] ) * aux; P[i].U[2] += ( dX[0]*P[j].Gamma[1] - dX[1]*P[j].Gamma[0] ) * aux; // ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp] aux = (- const4 / (r*r*r)) * (dgdr/(P[j].sigma*r) - 3.0*g_sgm/(r*r)); for (int k=0; k<3; k++){ P[i].J[3*k + 0] += ( dX[1]*P[j].Gamma[2] - dX[2]*P[j].Gamma[1] )*aux*dX[k]; P[i].J[3*k + 1] += ( dX[2]*P[j].Gamma[0] -

§4 Human · 13%

dX[0]*P[j].Gamma[2] )*aux*dX[k]; P[i].J[3*k + 2] += ( dX[0]*P[j].Gamma[1] - dX[1]*P[j].Gamma[0] )*aux*dX[k]; } // Adds the Kronecker delta term aux = - const4 * g_sgm / (r*r*r); // k=1 P[i].J[3*0 + 1] -= aux * P[j].Gamma[2]; P[i].J[3*0 + 2] += aux * P[j].Gamma[1]; // k=2 P[i].J[3*1 + 0] += aux * P[j].Gamma[2]; P[i].J[3*1 + 2] -= aux * P[j].Gamma[0]; // k=3 P[i].J[3*2 + 0] -= aux * P[j].Gamma[1]; P[i].J[3*2 + 1] += aux * P[j].Gamma[0]; } } } } The figure at the top of the page is a simulation of a 6x6x6 box of particles with vorticity initially concentrated at its center, diffusing as the simulation progresses due to viscous effects: C++ BenchmarkHere I have coded a benchmark of the C++ code shown above, evaluating P2P on a box of 6x6x6=216 particles, and made it callable in this notebook through the CxxWrap Julia package:In [1]:#= This cell loads auxiliary benchmarking functions available here: https://github.com/EdoAlvarezR/post-juliavscpp =# using LinearAlgebra # Load C++ code wrapped as module CxxVortexTest include("code/vortextest.jl") # Load benchmarking tools include("code/benchmarking.jl");In [2]:ntests = 1000 # Tests to run n = 6 # Particles per side lambda = 1.5 # Core overlap verbose = true # Run

§5 Human · 8%

C++ benchmark cpptime = CxxVortexTest.benchmarkP2P_wrap(ntests, n, Float32(lambda), verbose) # Store benchmark result benchtime["C++"] = cpptime;Samples: 1000 min time: 3.99555 ms ave time: 4.77778 ms max time: 6.73414 ms This was ran in my Dell Latitude 5580 laptop (Intel® Core™ i7-7820HQ CPU @ 2.90GHz × 8 ) in only one process, and we see that the C++ kernel, best-case scenario, is evaluated in ~4 ms. Let’s move on and code this up in Julia.NOTE: The C++ code was compiled with the -O3 flag for code optimization.Optimizing JuliaJulia Baseline: Pythonic ProgrammingTempted by the Python-like syntax available in Julia, our first inclination is to make the code as general, simple, and easy to understand as possible. Here is the most general implementation where no types are specified:In [3]:""" This is a particle struct made up of ambiguous (unspecified) types """ struct ParticleAmbiguous # User inputs X # Position Gamma # Vectorial circulation sigma # Smoothing radius # Properties U # Velocity at particle J # Jacobian at particle J[i,j]=dUi/dxj ParticleAmbiguous(X, Gamma, sigma; U=zeros(3), J=zeros(3,3) ) = new(X, Gamma, sigma, U, J ) end # Empty initializer Base.zero(::Type{<:ParticleAmbiguous}) = ParticleAmbiguous(zeros(3), zeros(3), 0.0) # Winckelmans regularizing kernel g_wnk(r) = r^3 * (r^2 + 2.5) / (r^2 + 1)^2.5 dgdr_wnk(r) = 7.5 * r^2 / (r^2 + 1)^3.5 const const4 = 1/(4*pi) """ Pythonic programming approach """ function P2P_pythonic(particles, g, dgdr) for Pi in particles for Pj in particles dX = Pi.

§6 Human · 15%

X - Pj.X r = norm(dX) if r != 0 # g_σ and ∂gσ∂r gsgm = g(r / Pj.sigma) dgsgmdr = dgdr(r / Pj.sigma) # K × Γp crss = cross(-const4 * (dX/r^3), Pj.Gamma) # U = ∑g_σ(x-xp) * K(x-xp) × Γp Pi.U[:] += gsgm * crss # ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp ] for j in 1:3 Pi.J[:, j] += ( dX[j] / (Pj.sigma*r) * (dgsgmdr*crss) - gsgm * (3*dX[j]/r^2) * crss - gsgm * (const4/r^3) * cross([i==j for i in 1:3], Pj.Gamma) ) end end end end endIn [4]:# Create a 6x6x6 box of ambiguous particles particles_amb = generate_particles(ParticleAmbiguous, n, lambda) # Run benchmark args = (particles_amb, g_wnk, dgdr_wnk) compare(P2P_pythonic, "C++", args; reverse=false) @benchmark P2P_pythonic(args...);C++ is 58.48 times faster than P2P_pythonic (3.996ms vs 233.679ms) BenchmarkTools.Trial: memory estimate: 217.57 MiB allocs estimate: 4087152 -------------- minimum time: 233.679 ms (5.04% GC) median time: 238.482 ms (4.99% GC) mean time: 251.540 ms (9.28% GC) maximum time:

§7 Human · 1%

295.517 ms (19.89% GC) -------------- samples: 4 evals/sample: 1 Here we see that in our pythonic attempt we’ve got code that is pretty neat and concise, but ~58x slower than the C++ implementation.Fix #1: Always work with concrete typesThe problem with the pythonic approach is that variable types are never declared, and without foreknowledge of the types to be handled, Julia can’t optimize the function during compilation. In order to help us catch ambiguous (abstract) types in our code, the Julia Base package provides the macro @code_warntype, which prints the lowered and type-inferred AST used during compilation highlighting any abstract types encountered.The output is pretty lengthy, so here I have copied and pasted only a snippet:@code_warntype P2P_pythonic(args...) Body::Nothing │╻╷╷ iterate34 1 ── %1 = (Base.arraylen)(particles)::Int64 ││╻╷ iterate │ %2 = (Base.sle_int)(0, %1)::Bool │││╻ < │ %3 = (Base.bitcast)(UInt64, %1)::UInt64 ││││╻ < │ %4 = (Base.ult_int)(0x0000000000000000, %3)::Bool ││││╻ & │ %5 = (Base.and_int)(%2, %4)::Bool . . . │ 11 ┄ %33 = φ (#10 => %28, #34 => %149)::ParticleAmbiguous │ │ %34 = φ (#10 => %29, #34 => %150)::Int64 │╻ getproperty37 │ %35 = (Base.getfield)(%16, :X)::Any ││ │ %36 = (Base.getfield)(%33, :X)::Any │ │ %37 = (%35 - %36)::Any │ 38 │ %38 = (Main.norm)(%37)::Any . . .

§8 Human · 1%

Understanding this lowered AST syntax is sort of an art, but you’ll soon learn that @code_warntype is your best friend when optimizing code. As we scroll down the AST we see that code encounters types Any in the properties of our ParticleAmbiguous type, which immediately should raise a red flag to us (Any is an abstract type). In fact, when running @code_warntype the output automatically highlights those ::Any asserts in red.We can take care of those abstract types by defining the properties of the struct parametrically:In [5]:""" This is a particle struct with property types explicitely/parametrically defined. """ struct Particle{T} # User inputs X::Array{T, 1} # Position Gamma::Array{T, 1} # Vectorial circulation sigma::T # Smoothing radius # Properties U::Array{T, 1} # Velocity at particle J::Array{T, 2} # Jacobian at particle J[i,j]=dUi/dxj end # Another initializer alias Particle{T}(X, Gamma, sigma) where {T} = Particle(X, Gamma, sigma, zeros(T,3), zeros(T, 3, 3)) # Empty initializer Base.zero(::Type{<:Particle{T}}) where {T} = Particle(zeros(T, 3), zeros(T, 3), zero(T), zeros(T, 3), zeros(T, 3, 3))No modifications further are needed in our P2P_pythonic function since Julia’s multiple dispatch and JIT automatically compiles a version of the function specialized for our new Particle{T} type on the fly. Still, we will define an alias to help us compare benchmarks:In [6]:P2P_concretetypes(args...) = P2P_pythonic(args...)In [7]:# Create a 6x6x6 box of concrete Float64 particles particles = generate_particles(Particle{Float64}, n, lambda)