Function multi-versioning

CPU architectures often gain interesting new instructions as they evolve but application developers find it difficult to take advantage of those instructions. The reluctance to lose backward-compatibility is one of the main roadblocks slowing developers from using advancements in newer computing architectures. FMV, which first appeared in GCC 4.8, is a way to have multiple implementations of a function, each using a different architecture specialized instruction-set extensions. GCC 6 introduces changes to FMV to make it even easier to bring architecture- based optimizations to the application code.

In this tutorial we will use FMV on general code and on FFT library code (FFTW). Upon completing the tutorial, you will be able to use this technology on your code and use the libraries to deploy architecture-based optimizations to your application code.

Install and configure a Clear Linux OS host on bare metal

First, follow our guide to 从实时桌面安装 Clear Linux* OS. Once the bare metal installation and initial configuration are complete, add the desktop-dev bundle to the system. desktop-dev: contains the necessary development tools like GCC and Perl*.

To install the bundles, run the following command in the $HOME directory:

sudo swupd bundle-add desktop-dev

Detect loop vectorization candidates

Now, we need to detect the loop vectorization candidates to be cloned for multiple platforms with FMV. As an example, we will use the following simple C code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/time.h>
 #define MAX 1000000

 int a[256], b[256], c[256];

 void foo(){
     int i,x;
     for (x=0; x<MAX; x++){
         for (i=0; i<256; i++){
             a[i] = b[i] + c[i];
         }
     }
 }


 int main(){
     foo();
     return 0;
 }

Save the example code as example.c in the current directory and build with the following flags:

gcc -O3  -fopt-info-vec  example.c -o example

The build generates the following output:

example.c:11:9: note: loop vectorized
example.c:11:9: note: loop vectorized

The output shows that line 11 is a good candidate for vectorization:

for (i=0; i<256; i++){
    a[i] = b[i] + c[i];

Generate the FMV patch

To generate the FMV patch with the make-fmv-patch project, we must clone the project and generate a log file with the loop vectorized information:

git clone https://github.com/clearlinux/make-fmv-patch.git
gcc -O3  -fopt-info-vec  example.c -o example &> log

To generate the patch files, execute:

perl ./make-fmv-patch/make-fmv-patch.pl log .

The make-fmv-patch.pl script takes two arguments: <buildlog> and <sourcecode>. Replace <buildlog> and <sourcecode> with the proper values and execute:

perl make-fmv-patch.pl <buildlog> <sourcecode>

The command generates the following example.c.patch patch:

--- ./example.c 2017-09-27 16:05:42.279505430 +0000
+++ ./example.c~    2017-09-27 16:19:11.691544026 +0000
@@ -5,6 +5,7 @@

 int a[256], b[256], c[256];

+__attribute__((target_clones("avx2","arch=atom","default")))
 void foo(){
     int i,x;
     for (x=0; x<MAX; x++){

We recommend you use the make-fmv-patch script to add the attribute generating the target clones on the function foo. Thus, we can have the following code:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define MAX 1000000

int a[256], b[256], c[256];

__attribute__((target_clones("avx2","arch=atom","default")))
void foo(){
    int i,x;
    for (x=0; x<MAX; x++){
        for (i=0; i<256; i++){
            a[i] = b[i] + c[i];
        }
    }
}


int main(){
    foo();
    return 0;
}

Changing the value of the $avx2 variable, we can change the target clones when adding the patches or in the make-fmv-patch.pl script:

my $avx2 = '__attribute__((target_clones("avx2","arch=atom","default")))'."\n";

Compile the code again with FMV and add the option to analyze the objdump log:

gcc -O3 example.c -o example -g
objdump -S example | less

You can see the multiple clones of the foo function:

foo
foo.avx2.0
foo.arch_atom.1

The cloned functions use AVX2 registers and vectorized instructions. To verify, enter the following commands:

vpaddd (%r8,%rax,1),%ymm0,%ymm0
vmovdqu %ymm0,(%rcx,%rax,1)

FFT project example using FFTW

To follow the same approach with a package like FFTW, we must use the -fopt-info-vec flag to get a build log file similar to:

~/make-fmv-patch/make-fmv-patch.pl results/build.log fftw-3.3.6-pl2/

patching fftw-3.3.6-pl2/libbench2/verify-lib.c @ lines (36 114 151 162 173 195 215 284)
patching fftw-3.3.6-pl2/tools/fftw-wisdom.c @ lines (150)
patching fftw-3.3.6-pl2/libbench2/speed.c @ lines (26)
patching fftw-3.3.6-pl2/tests/bench.c @ lines (27)
patching fftw-3.3.6-pl2/libbench2/util.c @ lines (181)
patching fftw-3.3.6-pl2/libbench2/problem.c @ lines (229)
patching fftw-3.3.6-pl2/tests/fftw-bench.c @ lines (101 147 162 249)
patching fftw-3.3.6-pl2/libbench2/mp.c @ lines (79 190 215)
patching fftw-3.3.6-pl2/libbench2/caset.c @ lines (5)
patching fftw-3.3.6-pl2/libbench2/verify-r2r.c @ lines (44 187 197 207 316 333 723)

For example, the fftw-3.3.6-pl2/tools/fftw-wisdom.c.patch file generates the following patches:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
    --- fftw-3.3.6-pl2/libbench2/verify-lib.c   2017-01-27 21:08:13.000000000 +0000
    +++ fftw-3.3.6-pl2/libbench2/verify-lib.c~  2017-09-27 17:49:21.913802006 +0000
    @@ -33,6 +33,7 @@

     double dmax(double x, double y) { return (x > y) ? x : y; }

    +__attribute__((target_clones("avx2","arch=atom","default")))
     static double aerror(C *a, C *b, int n)
     {
         if (n > 0) {
    @@ -111,6 +112,7 @@
    }

    /* make array hermitian */
    +__attribute__((target_clones("avx2","arch=atom","default")))
    void mkhermitian(C *A, int rank, const bench_iodim *dim, int stride)
    {
         if (rank == 0)
    @@ -148,6 +150,7 @@
    }

    /* C = A + B */
    +__attribute__((target_clones("avx2","arch=atom","default")))
    void aadd(C *c, C *a, C *b, int n)
    {
         int i;
    @@ -159,6 +162,7 @@
    }

    /* C = A - B */
    +__attribute__((target_clones("avx2","arch=atom","default")))
    void asub(C *c, C *a, C *b, int n)
    {
         int i;
    @@ -170,6 +174,7 @@
    }

    /* B = rotate left A (complex) */
    +__attribute__((target_clones("avx2","arch=atom","default")))
    void arol(C *b, C *a, int n, int nb, int na)
    {
         int i, ib, ia;
    @@ -192,6 +197,7 @@
         }
    }

With these patches, we can select where to apply the FMV technology making bringing architecture-based optimizations to application code even easier.

Congratulations!

You have successfully installed an FMV development environment on Clear Linux OS. Furthermore, you used cutting edge compiler technology to improve the performance of your application based on Intel Architecture technology and profiling of the specific execution of your application.