Week 2 =========== .. toctree:: :maxdepth: 2 In the second week we implemented microbenchmarks to measure the execution throughput in form of the floating point operations executed per second of the following instructions: * `FMADD (scalar)`, FP32 variant. * `FMLA (vector)` with arrangement specifier `4S`. * `FMLA (vector)` with arrangement specifier `2S`. Furthermore we implemented a kernel, which performs a permutation operation on a tensor `abc` of the form `abc -> cba`. The `a` and `b` dimensions were fixed to `8` and `4` respectively, while the `c` dimensions was allowed to vary. Execution Throughput ------------------------- .. 1. Benchmark implementation 1. General overview 2. Optimizations 2. Results on the raspberry pis 3. Explanation of results For each of the given instructions, we generally execute the same instruction repeatedly and measure the total time taken. If we let ``i`` denote the total number of instructions executed, ``t`` the required execution time and ``f`` the floating point operations executed per instruction, then the floating point operations ``z`` executed per second of the benchmark may be determined as follows: ``z = (i * f) / t`` The functions ``fmadd_kernel``, ``fmla_4s_kernel`` and ``fmla_2s_kernel`` execute their respective instruction repeatedly as described. However, these functions do not execute their respective instructions at the maximal possible rate. This is because, in these functions, each instruction depends in it's arguments on the results of it's preceding instruction. This prevents the CPU from fully utilizing the instruction pipelines. The functions ``fmadd_kernel_v2``, ``fmla_4s_kernel_v2`` and ``fmla_2s_kernel_v2`` avoid this issue. We obtained the following results when running the benchmarks on the provided Raspberry Pi machines. * ``fmadd_kernel``: 1.12 GFlops * ``fmla_4s_kernel``: 9.59 GFlops * ``fmla_2s_kernel``: 4.80 GFlops * ``fmadd_kernel_v2``: 9.59 GFlops * ``fmla_4s_kernel_v2``: 38.37 GFlops * ``fmla_2s_kernel_v2``: 19.17 GFlops Permutation ------------------------- .. literalinclude:: ../../../neon/src/permutation_kernel.s :language: armasm * ``permutation_kernel c=4``: GiB/s: 31.5954 * ``permutation_kernel c=8``: GiB/s: 32.2068