da/dc9/optimizations_8txt.html

Performance Notes


0) Consider file writing and other bottlenecks


PRODUCTION==0 vs. 1 changes performance by about 20% with debugfail==2 set

E.g.: Type of problem can change performance.  For example, same 64x32 model with R0=0 Rout=40 gets 70K ZCPS while R0=-8.3 and Rout=1E10 gets 62K ZCPS.  Both have grid sectioning.  The change in performance here occurs because one has more failures than the other.  Set PRODUCTION 0 allows no file writing of those failures and gets that 62K ZCPS run back to 70K.


DODIAGS==0 or 1 can change things alot due to per-substep diagnostics being enabled.

E.g.: File writing in general should be stretched in period to avoid slowing down code.  File writing can severely slow things down if too frequent.  Consider ROMIO and Jon's non-blocking file writing.

E.g.: Same model described above w.r.t. PRODUCTION=0/1 goes to 80K ZCPS with DODIAGS=0.  Ensure that ENERDUMPTYPE (and image/etc. creation) does not have too small a period.


1) Consider cache use.


For example, on 4 core system (Intel Core2), found that 512x32 model runs at 62K ZCPS while otherwise similar 64x32 model runs at 64K ZCPS.  As below states, could fit about 100 zones of data into cache, and having N1=64 (or N1M=72) allows entire line of data to fit into cache.  This reduces cache misses.


On the other hand, if doing grid sectioning with large number of radial zones (e.g. 1024), then optimal to have entire N1M line on one node.  This will necessarily cause additional cache misses, but otherwise nodes won't be used and the performance drop is factors of 2X or more instead of 5% as above.


2) Consider each system's cache.


e.g. TACC Ranger uses 4 sockets of Barcelona CPU:

http://images.anandtech.com/reviews/cpu/amd/phenom2/barcelona-block-diagram.jpg

Each core has 64KB private L1 cache (32K L1I and 32K L1D) (I=insruction D=data)

Each core has *private* 512KB L2 cache.  This is nice compared to the shared L2 cache of my ki-rh42, even if that cache is 2X larger.

Each Socket of 4 cores has *shared* 2MB L3 cache.

Each node has 4 sockets each with 4 cores = 16 cores/node

Node has 16 cores with 32GB/node

The memory bus runs at 533Mhz with 2 channels total for *all* 16 cores!

So staying in cache is *critical* since otherwise all 16 cores compete for 2 channels of memory.

If using OpenMP, then one avoids otherwise MPI-excessive boundary cells, so good. [So this really makes OpenMP save on memory!!! Very important!]

Can't fit entire problem in cache.  In reality, need to know how much memory use over longer periods of time.  Perhaps think per-line.  Then only need to ensure that can do a single line.  Then can do up to about N1=160 total on those 16 cores before cache-misses will occur.


TACC Lonestar has 4MB L2 cache with 2.66Ghz Intel Xeon 5150.  TACC page calls it "smart" L2 cache.

Appears to have private 2MB per core.

http://www-dr.cps.intel.com/products/processor/xeon5000/specifications.htm?iid=products_xeon5000+tab_specs

http://services.tacc.utexas.edu/index.php/lonestar-user-guide

http://processorfinder.intel.com/List.aspx?ProcFam=528&sSpec=&OrdCode=

http://processorfinder.intel.com/details.aspx?sSpec=SL9RU

http://www.linuxdevices.com/files/misc/intel_5100.jpg

http://www.hardwarezone.com/img/data/articles/2006/2002/Bensley-block-diagram-2.jpg


Teragrid LONI QueenBee (Queen Bee):

http://www.loni.org/systems/

http://www.loni.org/teragrid/users_guide.php

1) Connect to (e.g.) abe: ssh jmckinne@abe.ncsa.uiuc.edu

2) Setup proxy: myproxy-logon -l jmckinne  -s myproxy.teragrid.org

3) Use NCSA Portal Password

4) Connect to QB: gsissh login1-qb.loni-lsu.teragrid.org


QUEENBEE: Must set GETTIMEOFDAYPROBLEM 1


Teragrid NCSA Abe:

http://www.teragrid.org/userinfo/hardware/resources.php?select=single&id=50&PHPSESSID=0

http://www.dell.com/content/products/productdetails.aspx/pedge_1955?c=us&l=en&s=corp

5300 quad-core Xeon 5000 series

2*4MB L2 cache (shared like ki-rh42).


tgusage -u jmckinne

tgusage -a TG-AST080026N

tgusage -a TG-AST080025N


Intel Core2 has same(?) as Barcelona but no L3 cache.

JCM's system is: Intel(R) Core(TM)2 Extreme CPU Q6800  @ 2.93GHz stepping 0b

Appears to have 2x4MB=8MB (shared) L2 cache and 32K/32K D/I L1 caches

This 8MB of L2 cache is (however) shared so that 2 cores share 4MB and the other 2 shared 4MB.  So unless tune which core gets which process, each use of a new core slows down the L2 cache bandwidth by roughly 2X.  May be consistent with performance results below.

So 2X cache compared to Ranger.

http://www.intel.com/design/processor/datashts/316852.htm


Intel Core i7:

http://en.wikipedia.org/wiki/Intel_Core_3

Note it has 32KB L1 D cache per core

256KB L2 cache (I+D) per core

8MB L3 cache (I+D) shared between *all* cores.

Tri-channel memory


Testing memory contention and whether sit entirely within L2 cache on JCM's system without dual-channel access currently (so strong test of whether run fits in L2 cache or not):


Multiple runs (performance for each core):

#Cores:        1     2     3       4

MAXBND==2 DONOR    64x32:   60K                 27K-30K

MAXBND==4 PARALINE 64x32:   50K   45K           24K-27K

MAXNBD==2 DONOR    4x2:     30K   30K   30K     30K       [so finally fit entirely within L2 cache]

MAXNBD==4 PARALINE 4x2:     18K   18K   18K     18K       [so finally fit entirely within L2 cache]

MAXBND==4 PARALINE 4x8:     30K   29.5K  29.3K  29K       [so finally fit entirely within L2 cache]

MAXBND==4 PARALINE 16x8:    50K   48K    48K    48K       [so kinda hits memory]

MAXBND==4 PARALINE 32x16:   56K   55K    36K    31K       [so 32x16*(2cores) is ok in cache, but not more (i.e. not this run with 32*16*3 or 32*16*4)]

This suggests an optimal OpenMP choice is N1*N2*N3=32*32 per *node* to avoid memory contention and stay within L2 cache.  For elongated cells this is 512x2 in 2D or 256x2x2 in 3D.  But small cells leads to inefficiency in extra computations near boundary cells.  So probably want 256x4 in 2D and 64x4x4 in 3D.  On Ranger have to take (1/2) of this since L2+L3 is half the size!  So on Ranger use 128x4 in 2D and 32x4x4 in 3D with fit into L2+L3 cache.


For above, probably good 4 core performance because staying in *L1* cache mostly, not L2.  This is consistent with below results.

Then hit is because ki-rh42's 4 cores feed off 2 L2 cache's so that bandwidth is half when 3-4 cores are running.


New code with even more optimizations (esp. fluxct and no eomfunc[NPR] data and now also reduction of symmetric matrices):


Multiple runs (performance for each core):

#Cores:        1     2     3       4

MAXBND==4 PARALINE 16x8:    51K   51K    51K    50K       [4 core perf not helped with symm. matrix reductions]

MAXBND==4 PARALINE 32x16:   60K   60K    54K    50K       [so cache misses not as bad with new code. Reducing symmetric matrices increased performance at 4 cores by 25%, so good.]

MAXBND==4 PARALINE 64x16:   60K   59K    48K    36K       [sym matrix fix increased perf for 4 cores by 25%]


OpenMP (i.e. USEOPENMP==1 in makehead.inc) tests:


#Cores:        1     2     3       4

MAXBND==4 PARALINE 16x8:    37K   48K   58K     66K       [horrendous 4 core performance -- 10% improvement with minchunk of 10 or static schedule compared to guided.  Didn't help with 1 core overhead.  Reducing memory overhead didn't help, suggesting it's all OpenMP overhead!]

MAXBND==4 PARALINE 64x16:   48K   78K   88K     99K       [Unsure what's memory vs. OpenMP overhead, but big hit compared to above runs!  Reducing memory overhead didn't help at all for 4 cores!  Suggests all OpenMP overhead!  Removed as much of private() [all essentialy] and copyin() [as much as I could] and little change! Even ALLOWKAZEOS is off! -- Must be EOS ptr functions?  Just stupid addresses!]

MAXBND==4 PARALINE 32x16:   45K   69K   83K     92K


Even with memory contention, 4 cores here should get about 4*40K=160K or 4*30K=120K.  Appears to be 20%-25% overhead per core no matter how many (even 1) cores.


Also appears to be extreme overhead no matter what when doing <200 or so iterations.  Requires per node memory be (say) 64x16 or larger.  But L2 cache requires 64*16*2 or smaller.  So seems 64x16 - 64x32 (or 32x32) is sweet spot for minimizing OpenMP overhead and L2 cache misses.


To test OpenMP overhead, I commented out all #pragma's but left -openmp during compilation and checked 1 core:


#Cores:        1     2     3       4

MAXBND==4 PARALINE 64x16:   61K


So not -openmp itself, but #pragma's cause the problem.


Now I tried only commenting out inner loop pragmas:


#Cores:        1     2     3       4

MAXBND==4 PARALINE 64x16:   46K


So not -openmp or #pragma omp for, but #pragma omp parallel is problem.


Now I tried commenting out loops and choosing OPENMPGLOBALPRIVATE (etc.) to be nothing (can do this since globals preserved on master):

Also commented out #pragma omp threadprivate....

And commented out parallel copyin( -> //copyin(


#Cores:        1     2     3       4

MAXBND==4 PARALINE 64x16:   61K


So problem appears to be copyin() .... with same 64x16 test:


TEST1CORE: without copyin(EOS ptr's) reach 52K for 1 core, still overhead?

TEST1CORE: without copyin(EOS ptrs' + all loop stuff except ploop): reach 52K : still overhead!

TEST1CORE: without copyin(EOS ptrs' + all loop stuff): reach 57K : finally little overhead!


NEWCODE1: with all required things as thread private and copyin() [had too many things for loops]: 51K

NEWCODE2: with EOS ptrs totally fixed (outside parallel and not thread private anymore): 54K-55K for 1 core [good!] 102K or up to 112K for 4 cores [still bad]

NEWCODE3: with no {ijk}curr, whichuconcalc, etc. that required extensive code adjustments for the EOS-related stuff to force modularity: 57K for 1 core and 104K for 4 cores : So quite little OpenMP overhead now for 1 core, but still there for 4 cores.


So 1 core overhead is quite small.  Now unsure why 4 core bad since not memory limited since changing cores doesn't change problem size.


Installed all memory into ki-rh42 (8GB total) so uses dual-channel memory access:

1) OpenMP 4 cores went from 86K -> 107K for 512x32 testreality problem.

2) MPI 4 cores (no grid sectioning so uses all cores) went from 90K -> 110K

So despite all Jon's hard work with OpenMP, MPI still faster?  Can't be large overhead with OpenMP at 512x32.

Overall, performance is nearly as expected for simulation that doesn't fit entirely into L2 cache that would be roughly 31K/core * 4 = 120K.


#Cores:        1     2     3       4

MPI MAXBND==4 PARALINE 64x16:                       72K (64x16 total with 32x8 per core of 4 cores) -- only runs at 40% of core per process!

OMP MAXBND==4 PARALINE 64x16:                       130K (only a bit over 2X expected performance from 57K/core from NEWCODE3.  Not sure what's limiting since should be all in L2 cache!)


Clearly for small problem sizes per core, OpenMP is much more efficient than MPI.  For large problem sizes per core (e.g. 256x16 per core for 512x32 total), MPI is just slightly more efficient.  Thus, in general one is benefited by using OpenMP.  Unclear how things scale with many more processores or cores.


Still problem is that efficiency from 1 core to 4 cores is 50% for either case, and again not because of L2 cache [Well, ki-rh42's L2 cache is effectively shared, so may explain it.]


Suggests that, for Ranger at least, should try to fit into private 512KB L2 cache.  Can't use above tests to conclude what size this is for ki-rh42.  Can only run 2 cores and see when adding 2nd core slows down (assumes each core uses essentially its own nearby cache).  Once 2nd core slows things down significantly, then must have gone to memory.


2CORE ki-rh42 test (latest NEWCOD3) for fitting into L2 cache:

Multiple runs (performance for each core, OpenMP disabled):

#Cores:        1     2    3     4

MAXBND==4 PARALINE  64x16:   63K   63K

MAXBND==4 PARALINE  64x32:   61K   60K

MAXBND==4 PARALINE  64x64:   56K   58K

MAXBND==4 PARALINE 128x128:  61K   57K             [Starting to hit memory]

MAXBND==4 PARALINE 256x256:  60K   56K 50K    40K  [Starting to hit memory]


OpenMP:

#Cores:        1     2    3     4

MAXBND==4 PARALINE 256x256: 59K   95K  114K  116K  [Consistent with (roughly) 2 cores sharing 4MB of L2 cache, so with 3-4 cores goes much slower per core due to cache contention.]

MAXBND==4 PARALINE  64x16:  59K   99K  122K  128K  ["" -- roughly]

MAXBND==4 PARALINE  32x16:  59K   90K  115K  119K  [Linux probably doesn't schedule processes based upon L2 cache association with cores, so can flip around and almost as bad as all 4 cores sharing L2 cache.]


Perf with STAG+DISS+DISSVSR+LUMVSR    = 30K -- so about 2X slower.

Perf with STAG+DISS+DISSVSR+LUMVSR+3D = 11K -- so about 6X slower.


TODO:

1) if limiting interpolation (e.g. for stag or rescale in stag), then pass that fact rather than using global npr2interp.  Ensure all interior loops use that passed data rather than globals.

2) For KAZ stuff, probably fine.


Performance on Ranger (All MAXBND==4 PARALINE) for 1000 steps with DODIAGS=0 and PRODUCTION 1 and TIMEORDER=2 and FLUXB=FLUXCTTOTH and DODISS=DODISSVSR=DOLUMVSR=0:


N1xN2    #NODES  #OPENMP/task  #MPItasks  ncpux?      PERF   Eff


64x64:      1       1             1        1x1x1       35K  100%

64x64:      1       4             4        2x2x1      403K   72%

64x64:      1       1            16        4x4x1      407K   73%

64x64:      1      16             1        1x1x1       71K   13%

64x64:      1       8             2        2x1x1      215K   39%

64x64:      2       1            32        8x4x1      940K   84%

64x64:      2       4             8        4x2x1      950K   85%

64x64:      2       1            64        8x8x1     1793K   80%

64x64:      2       4            16        4x4x1     1862K   83%

64x64:      2       1           256        16x16x1   7098K   80%

64x64:      2       4           64         8x8x1     6766K   76%


64x16:      1       1             1        1x1x1       35K

64x16:      1       4             4        2x2x1      406K   72%

64x16:      1       1            16        4x4x1      471K   84%

64x16:      1      16             1        1x1x1       55K   10%

64x16:      1       8             2        2x1x1      191K   34%

64x16:      2       1            32        8x4x1      907K   81%

64x16:      2       4             8        4x2x1      777K   70% [repeated run got same result!?]

64x16:      2       1            64        8x8x1     1730K   77%

64x16:      2       4            16        4x4x1     1488K   66%


So each Ranger core is almost 2X slower than ki-rh42.  That's AMD vs. Intel for you!

So clearly bad to cross on PCI bus with memory as OpenMP has to when more than 4 threads with 1 thread per core.

MPI seems to be doing fine at 64^2 on one node.

Unclear why OpenMP is actually slower even for 64x16, which worked better on ki-rh42 by 2X!


Performance on Ranger (All MAXBND==4 PARALINE) for 1000 steps with DODIAGS=0 and PRODUCTION 1 and TIMEORDER=2 and FLUXB=FLUXCTSTAG and DODISS=DODISSVSR=DOLUMVSR=1:

Default: module unload mvapich2 pgi ; module load mvapich2 intel mkl


N1xN2    #NODES  #OPENMP/task  #MPItasks  ncpux?      PERF   Eff


64x32x8:    1       1             1        1x1x1        6K

64x32x32:   1       1             1        1x1x1        8.7K        [to be used as reference for efficiency for 64x32x32 per MPI task runs]

64x32x8:    4       4             4        2x2x1       41K   43%

64x32x8:   16       1            16        4x4x1       84K   88%

64x32x8:   64       4            64        8x8x1      617K   40%

64x32x8:   64       4            64        8x8x1      624K   41%  [used purely static schedule, no user chunking]

128x64x8:  64       4            64        8x8x1      713K   46%  [used purely static schedule, no user chunking]

64x32x32:  64       4            64        8x8x1      884K   40%-57%  [used purely static schedule, no user chunking] [smaller % is when using correct reference point]

64x32x32:  64       4            64        8x8x1      896K   40%-58%  [Changed to mvapich/1.0.1 during compile and in batch script]

64x32x32:  64       4            64        8x8x1      888K   39%-58%  [Changed to openmpi/1.3 during compile and in batch script]

64x32x8:  256       1           256        8x8x4     1278K   83%


On ki-rh42 with new code:

N1xN2    #NODES  #OPENMP/task  #MPItasks  ncpux?      PERF   Eff

64x32x8:    1       1             1        1x1x1       17K


From 11K -> 17K -- not that big a difference after a day of pain.


On Ranger: 4x4x4 per CPU with PARALINE+STAG+DISS+LUM (but R0=0 and Rout=10) on 2048 processors and 16x16x8 for ncpux?: late-time with 54K ZCPS with average of 10% fractional diagnostics


FULLSTAG=UNFUDDLE,STAG, RK2, DODISS, currents, entropy evolution, etc.etc.

HALFSTAG=UNFUDDLE,STAG,RK2, no features

HALFTOTH=UNFUDDLE,Toth,RK2, no features

ORIG=Original very lean (and crashy-unstable) 2D HARM, which is on unfuddle as "origcode"


system   tile_size   cores   zcps   efff         CODE-MODE

----------------------------------------------------------------------

3D:

lonestar  34x32x8    1        13K       1            FULLSTAG

lonestar  34x32x8    1        14K       1            HALFSTAG

lonestar  34x32x8    1        28K       1            HALFTOTH

lonestar  34x32x8    1024     9.5M      71%          FULLSTAG

lonestar  34x8x8     1        ?         1            FULLSTAG

lonestar  34x8x8     512      3.3M      >50%?        FULLSTAG


ki-rh42    32x16x8    1         14K      1           FULLSTAG

ki-rh42    32x16x8    1         15K      1           HALFSTAG

ki-rh42    32x16x8    1         28K      1           HALFTOTH

ki-rh42    16x32x32   1         30K      1           HALFTOTH (Noble setup, who got 36K)

2D:

ki-rh42    64x64      1         57K       1          HALFTOTH

ki-rh42    256x256    1         61K       1          HALFTOTH

ki-rh42    64x64      1         80K       1          ORIG

ki-rh42    256x256    1         84K       1          ORIG


http://www.intel.com/design/core2XE/documentation.htm


1) Best if arrays are powers of 2 so compiler bit shifts to index array

2) Try to keep all memory local.  Cache-lines are grabbed along fastest memory portions, so avoid loops that skip over indices.  Also best to compact multiple arrays together into a single per-point array or structure if those arrays are accessed together since then avoids cache miss when having to grab other array that will be displaced due to its full size.

3) Try to reduce memory footprint (total arrays, global or not) (and also reduce N1,N2,N3) so code+data fit into L2 cache (~4MB+)

4) Try to ensure code+data at function level can fit into L1 cache (32K)

5) Esp. for multi-core apps, want to fit MUCH of data into L2 cache

This corresponds to (currently) using 25K/cell including BZones

So (e.g.) N1M=N2M=12 (i.e. N1=N2=4) would fit completely into L2 cache of 4MB

In reality, just need to fit code+data along several cache lines to avoid memory being accessed.  Code is currently directed in memory along N3, then N2, then N1.  So optimal for N3>>N2>>N1.

Should allow N1=phi, N2=theta, N3=r in case want many r cells per core

Instead of changing [N1M][N2M][N3M], should change from [i][j][k] -> [k][j][i] for example.  So macrofy this.

Similar, but more involved than what Sasha does with his init's.


http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm


When arrays of structures are involved, the compiler performs a multiply by the structure size to perform the array indexing. If the structure size is a power of 2, an expensive multiply operation will be replaced by an inexpensive shift operation. Thus keeping structure sizes aligned to a power of 2 will improve performance in array indexing.


3) Note that for very small N1,N2,N3, ZCPS will appear to drop even if internal performance is the same.  This is because (e.g.) if N1=32 and N2=2, then in 2-direction it's really like doing N2=4 because of the extra 2 zones cmputed for the surface flux.  In reality not all parts of the code do this, so affected performance is between 1-2X factor.


Use pfmon (perfmon2) to look at cache and memory problems.

Note that perfmon2 by default only looks at a single thread, but shows OpenMP usage for that thread so still useful.  Also shows pow() and other library calls that gprof does not.

Note that perfmon2's HALT cycles may be larger for faster code in a repeated way (not just other procs eating time).  This doesn't mean if time without perfmon that will be slower.  For example, I found OpenMP'ing poledeath() led to more HALT cycles but clearly shorter wall time.


See installperfstuff.txt and compilekernel.txt


Use gprof:

1) Compile with -g -pg

2) Run test code

3) Run: gprof ./grmhd > prof.txt

4) Run: gprof -l ./grmhd > profbyline.txt

5) Look at prof.txt top and bottom portions to get idea of what is bottleneck

6) Use profbyline.txt to see if issue with bottleneck happens to be single lines that can be improved.


In any case of pfmon or gprof, compile with -fno-inline (remove other inline commands) to see inlined functions expanded so can tell what interior functions take time within a function that was previously inlined.  Use __inline keyword in front of those functions one wants to see not inlined.


Note that while gprof seems to miss some functions (e.g. pow() in icc library), pfmon catches it as pow.L .


One can use gprof-helper.c to have gprof measure multi-thread performance to see overhead.

See inside of gprof-helper.c:

do:

1) gcc -shared -fPIC gprof-helper.c -o gprof-helper.so -lpthread -ldl

2) (e.g.) LD_PRELOAD=./gprof-helper.so grmhd 4 1 1 1


Note that both gprof and pfmon or any performance timer will slow down code.  Remove -pg from makefile to avoid slowdown as well.


If not using the SIMD (single instruction, multiple data), or SSE type operations (which automatically are tried if using most compilers with optimizations), then roughly:

CPU

cycles | operation or procedure

---------------------------------------------

1          addition, subtraction, comparison

2          fabs

3          abs

4          multiplication

10         division, modulus

20         sqrt

50         exp

60         sin, cos, tan

80         asin, acos, atan

100        pow

10         Miss L1 cache

50         Miss L2 cache if L3 cache present

150        Branch misprediction

200        Miss L3 cache or miss L2 cache if no L3 cache present

200-1000+  Page fault


Also, note that there are more than just math operations to worry about.  Another rule of thumb is that:


page fault >>>>>> cache miss >> branch mistaken >> dependency chain >> non-sse-sqrt, division, and modulus > sse-sqrt etc. > everything else


with ">>" being much greater than (as in 3-20 times).


To get size of code's array elements:


Use program "nm" to list objects and symbols, or use objdump -axfhp

Use "nm -S --size-sort" to list by memory size


To check global symbol sizes use:

nm -S --size-sort grmhd

or:  objdump -axfhp grmhd


e.g. to check total size used do:

1) nm -S --size-sort grmhd | awk '{print $2}' > list.grmhd.sizes.txt

2) total=0

3) for fil in `cat list.grmhd.sizes.txt` ; do total=$(($total+0x$fil)) ; done

4) echo $total

Now take that number and divide by N1M*N2M*N3M and that's approximate per zone bytes used  (must include boundary cells)


Seems to be about 1250 elements/zone


Checking FLOPS with papi

http://www.cisl.ucar.edu/css/staff/rory/papi/flops.html


with perfmon2 (pfmon):

pfmon --show-time -e FP_COMP_OPS_EXE ./grmhd.makedir.testpfmon_simple 1 1 1

Note that X87_OPS_RETIRED:any doesn't seem to make sense

Then take resulting number and divide by the number of seconds for running the code WITHOUT pfmon!  That's FLOPS.

I get about 580MFLOPS  per core \sim 0.6GFLOPS per core

Theoretical maximum is about 5.4GFLOPS, so I'm about 10% efficient with FPU

This shows that I'm memory limited


When using new code with macrofied arrays and TIMEORDER==2 and RK2 but still with PARALINE, I get 0.7GFLOPS since more focused on inversion than memory accesses.