Optimizing for HPC
- When running on one of the SGI
computers, such as uv, you should use dplace to ensure that the process does not migrate over the nodes. This is especially important with multi-threaded code since communication across nodes is expensive on a NUMA machine and not using dplace will usually cause threads to run on different nodes. Usage is simple, just type dplace in front of the command that launches your program. For instance, to run a program called foo, you would run it like this: dplace foo
is similar to dplace but exists on all of our linux machines, including kraken and uv, and can be used for specifying both cpu and memory bindings to nodes. Even if you use dplace, numactl has a very useful feature for showing the NUMA hardware info: numactl --hardware This will give you the number of nodes, how much memory is on each node, how much of that memory is free, and the cost of reading memory from another node.
- When using the default options with
dplace or numactl, when a thread attempts to allocate memory, it will be given local memory on the calling thread's node. However, if there is not enough free memory on the local node, then it will use memory from a slower to access remote node. This means that running the same program twice, even with dplace, can result in different results if the first time there was enough memory on the local node and the second time there wasn't. This is especially excruciating when performing large file IO since the system cache might use up an entire node's memory. Any cores used on that node will now always have to use the more expensive remote memory. The file-system page cache can be freed by using all the memory on that node. This can be done using memhog 64g membind n where n is the node you need freed. Touching that many pages can take a few minutes. Use nodeinfo to verify the progress. Do not run this if other users' processes are running on that node! The recommended way to do this is with bcfree
which frees pages in the file-system page cache. The man page discusses how to use it. The catch is that on our systems it currently requires root permission to use
- The SGI tuning guide discusses how to tune programs for usage on uv http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=linux&db=bks&cmd=toc&pth=/SGI_Developer/LX_86_AppTune.
It discusses dplace and many other tools. Chapter 4 lists lots of interesting tools to see what is going on in the system. for instance nodeinfo will give NUMA statistics sort of like a top for NUMA. This can be used while running your program to see if it is making expensive memory accesses to other nodes. The SGI Altix UV white paper http://www.sgi.com/pdfs/4192.pdf should be read if you want to know how to fully optimize your program for the uv hardware. This SGI document describes how to use MPI on a UV. http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=linux&db=bks&cmd=toc&pth=/SGI_Developer/MPT_UGShort summary is that you compile normally using gcc (not mpicc), but add the -lmpi flag as a compiler flag (additionally add -lmpi++ if using g++). Then you run it with mpirun. If using MPI_THREAD_MULTIPLE then replace -lmpi with -lmpi_mt. omplace or dplace should be used to pin threads. man omplace describes this.