 |
|
| |
|
If you like to help us for this effort, you can download the
code and send us your results.
We plan to publish all the results. Following are some sample results we have collected.
The sequential results are obatained by using the following command:
./Apex-Map-Seq -n1000 -m67108864 -i1024
It reports the performance changes when temporal locality (a) changes from 0.001 to
1, spatial locality changes from 1 to 65536 in cycles/data-access when memory size
(M) is 512MB.
Following images show the results on a superscalar platform and on a vector platform.
We can clearly find that the performance of the vecto platform is strongly depedent on
the spatial locality (L). The higher the spatial locality, the better the performance.
There is almost a perfect linear relationship between the length of L and the performance
when L <= 256. However, the temporal locality almost has no effect.
On the contrary, both temporal locality and spatial locality affect the
performance significantly.
Following two pictures shows the parallel results on 256 MPI processes obtained on
the same two platforms as the sequential case by using following command:
mpirun -np 256 ./Apex-Map-Par -n1000 -m67108864 -i1024 -t
Both of them show the performance changes when temporal locality (a) changes from 0.001 to
1, spatial locality changes from 1 to 65536 words in aggregate bandwidth (MB/s) achieved
when local memory size is 512MB. Compared with the sequential case, the shapes of the
following two figures are much more close. For higher spatial locality, the vector platform
delivers over 40 times higher aggregate bandwidth.
In particular, we are also interested in the performance comparison on all possible
number of processes for random access when L=1 and 4096. The command is :
mpirun -np P ./Apex-Map-Par -n1000 -m67108864 -i1024 -a1.0 -l1 -t
mpirun -np P ./Apex-Map-Par -n1000 -m67108864 -i1024 -a1.0 -l4096 -t
First, we notice the SMP effect (16 cpus per node) on the superscalar platform.
Secondly, the vector platform scale much better than the superscalar platform.
The aggregate bandwidth using 4096 processors on the superscalar platform is
only slightly over the aggregate bandwdith using 16 processors on the vector platform.
|