2021年12月10日 星期五

RISC-V NaN Generation and Propagation

1. RISC-V spec said NaN propagation should generate canonical NaN

Except when otherwise stated, if the result of a floating-point operation is NaN, it is the canonical NaN. The canonical NaN has a positive sign and all significand bits clear except the MSB, a.k.a. the quiet bit. For single-precision floating-point, this corresponds to the pattern 0x7fc00000.

2. IEEE 754 2018 standard recommends that operations on NaNs could propagate the input's NaN, but it's not required.

3. LLVM NaN propagation implementation follows IEEE rule which does not match RISC-V spec.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/InstructionSimplify.cpp#L4966-L4968


The simplest test is adding two constant floating values and one is NaN, and the result of  -O0 and -O3 are different due to the NaN's fraction bit.

So software could use C++ isnan() to check NaN before comparing, it's why we could not use memcmp to compare two floating values, not only accuracy issue, but also non-portable.

BTW, I don't like spec giving a recommended rule...

2019年10月12日 星期六

LLVM Machine Instruction: Convergent attribute

ref: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089241.html



1. Convergent attribute is useful for SIMT/SPMD programming model.

2. Intended interpretation is that a convergent operation cannot be move either into or out of a conditionally executed region.

3. If you have a convergent instruction A, it islegal to duplicate it to instruction B if (assuming B is after A in program flow) A dominates B and B post-dominates A.

case:

 r1 = texture2D(..., r0, ...)
 if (...) {
   // r0 used as temporary here
   r0 = ...
   r2 = r0 + ...
 } else {
   // only use of r1
   r2 = r1 + ...
}


In this example, various optimizations might try to sink the texture2D operation
into the else block, like so:

if (...) {
  r0 = ...
  r2 = r0 + ...
} else {
  r1 = texture2D(..., r0, ...)
  r2 = r1 + ...
}


In most SPMD/SIMT implementations, the fallout of this races is exposed via
the predicated expression of acyclic control flow:

pred0 <- cmp ...
if (pred0)  r0 = ...                            // thread 1
if (pred0)  r2 = r0 + ...                       // thread 1
if (!pred0) r1 = texture2D(..., r0, ...)        // thread 0 
if (!pred0) r2 = r1 + ...                       // thread 0


If thread 0 takes the else path and perform the texture2D operation, but
its neighbor thread 1 takes the then branch, then the texture2D will fail
because thread 1 has already overwritten its value of r0 before thread 0 has
a chance to read it.


2019年9月15日 星期日

Stage Mix




stage mix幾乎都是剪輯那些韓國多人團體的作品
要滿足
1. 工業化一致的攝影方式跟攝影器材
2. 軍隊式標準的舞蹈
3. 細心的剪接

才能辦到
工業化一致的分鏡是必備的, 因為大團體每個人都要妥善的分配上鏡時間
軍隊式標準的舞蹈也是必備的, 因為跳錯會影響精心設計過後的上鏡畫面


我是在想說...
是不是要這樣幹
大家才會想去看現場表演
因為只有在現場
才能緊叮你的偶像片刻不移
看到平常看不到的畫面...

這樣發售的某次演場會影片倒是很無聊
因為就只是換衣服跟場景嘛~(?)


2019年5月22日 星期三

ARMv7 NEON VQRDMULH instruction implementation

VQRDMULH :

Vector Saturating Rounding Doubling Multiply Returning High Half. VQRDMULH multiplies corresponding elements in two vectors, doubles the results, and places the most significant half of the final results in the destination vector.

implement reference code
https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L329

<code>
// This function implements the same computation as the ARMv7 NEON VQRDMULH // instruction.

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
std::int32_t b) {
bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
std::int64_t a_64(a);
std::int64_t b_64(b);
std::int64_t ab_64 = a_64 * b_64;
std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
std::int32_t ab_x2_high32 =
static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}
</code>



1. ab_x2_high32 computed by divides "1<<31", not "1<<32", why?

Ans: because there are two sign bits after multiple two fixpoint value, the most significant half is starting from second MSB

2. if ab_64>=0, why does it need to add rounding with 1<<30? not 1<<31?
same

Ans: like the above answer, although the final result is [63:0], but the most significant bit is [62:0], so corresponding to the most significant half, the rounding is 1<<30.


Note: VQRDMULH likes to RISCV RVV's VSMUL


When multiplying two N-bit signed numbers, the largest magnitude is obtained for -2N-1 * -2N-1 producing a result +22N-2, which has a single (zero) sign bit when held in 2N bits. All other products have two sign bits in 2N bits. To retain greater precision in N result bits, the product is shifted right by one bit less than N, saturating the largest magnitude result but increasing result precision by one bit for all other products.
This is why the final result need to >> (SEW-1)

# Signed saturating and rounding fractional multiply
vsmul.vv vd, vs2, vs1, vm  # vd[i] = clip((vs2[i]*vs1[i]+round)>>(SEW-1))
vsmul.vx vd, vs2, rs1, vm  # vd[i] = clip((vs2[i]*x[rs1]+round)>>(SEW-1))

2017年1月21日 星期六

我們能利用machine learning去幫助compiler的optimization演算法變強嗎?



ML通常是拿來幫忙做predict和decisions
透過大量training data (input: feature,  output: predict a value or classification)
對於沒有看過的feature做predict和decisions

我對於ML經驗只有半年多
以下是我的一些看法

1. 
如果是對特定benchmark的話
用ML能幫忙找出不錯的heuristic值 (ex. --param, --regalloc=[basic|fast|greedy|pbqp] 等等)
google也滿多的
Automatic Feature Generation for Machine Learning Based Optimizing Compilation
但直到看到"MILEPOST GCC"  覺得圖很眼熟...
就是interactive compiler!


2. 
如果想把machine learning的training也做進compiler內
看來就只能做在JIT了
google了一下也有人做到VM
Using machines to learn method-specific compilation strategies (IBM)
Method-Specific Dynamic Compilation using Logistic Regression
也有人拿來做CPU/GPU decisions XD
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

如果做在AOT compiler, 感覺就像PGO, 看起來也不需ML

3. 
如果要用machine learning去學出某隻benchmark/function最好的instruction sequence
感覺可以用reinforcement learning去做
但是用interactive compiler去挑heuristic就可能做的很好了.. 
也有google到
Basic-block Instruction Scheduling Using Reinforcement Learning and Rollouts


我認為ML強的地方是對於noise多的input或者沒見過的pattern
做的決策會比人類直接寫的演算法好
但是compiler optimizations其實就是一堆pattern matching
雖然register allocation和instruction scheduling可能不太一樣
這就比較像是如何用machine learning去幫忙加速解決NP problem (ex. tree coloring)

這是做的到的, 像是圍棋是np-hard problem
但已經可以用machine learning (deep learning)來找approximation解

The Speed Game: Automated Trading Systems in C++

https://meetingcpp.com/index.php/tv16/items/18.html



C++ low latency coding techniques:

● General considerations

C++11 Move semantics
Static assert
Data member layout, padding and alignment (盡量alignment access)
● False sharing
http://shuyufu.blogspot.tw/2013/01/false-sharing.html
● Cache locality

● Compile-time dispatch

std::sort(array, array + N, [](int a, int b) { return b < a; });
 71us, std deviation 1.5us

int comparer(const void* a, const void* b) { return *(int*)a - *(int*)b; } qsort(arr, N, sizeof(int), comparer); 223us, std deviation 7us

● Constexpr
C++14 feature
● Variadic templates
● Loop unrolling
Generally, don’t bother, the compiler will figure it out
● Expression short-circuiting

Rewrite:
if (expensiveCheck() && inexpensiveCheck()) {}
As:
if (inexpensiveCheck() && expensiveCheck()) {}

● Signed vs unsigned comparisons
用loop iteratort用signed就是了 (可看前篇

● Mixing float and doubles

Default type of a floating point literal in C++ is double, not float
Side note: If you are brave, consider -ffast-math

● Branch prediction/reduction
Compile-time branch prediction hints are a topic of much discussion (particularly within SG14)
○ I.e. gcc’s __builtin_expect

Avoid this:
 if (checkForErrorA()) handleErrorA();
 else if (checkForErrorB()) handleErrorB();
 else if (checkForErrorC()) handleErrorC();
 else executeHotpath();

Aim for this:
 uint32_t errorFlags; ...
 if (errorFlags) HandleError(errorFlags)
 else { ... hotpath }

● Exceptions
不要用

● Slowpath removal
Code is data - keep it minimal, and keep the slowpath code away from the hotpath

● Avoiding allocations
使用萬惡的replace new
Tip: don’t use swap. Memory is cheap. Buy more!

● Fast containers

● Lambda functions

If only at runtime you know what the target is, you have no choice but to use std::function

If you know at compile time which target is to be run, then prefer lambdas template
void SendMessage(T&& target) {
 // populate and send message 
 target(mBuffer);
 send(mBuffer);
 } 

 SendMessage([&](auto& message) { 
 message.field1 = x; 
 ... message.field3 = z;
 });


Surprises and side notes:
Older versions of gcc’s implementation had copy on write semantics (新版比較快)

#include <string> #include <iostream> int main() { std::string a(50, 'c'); std::string b = a; *const_cast<char*>(a.c_str()) = 'A'; std::cout << "a: " << a << "\nb: " << b << std::endl; }

GCC < 5:
$ g++ -std=c++11 main.cpp && ./a.out
a: Accccccccccccccccccccccccccccccccccccccccccccccccc
b: Accccccccccccccccccccccccccccccccccccccccccccccccc
Aha! Copy-on-write in action.
GCC >= 5:
$ g++-5 -std=c++11 main.cpp && ./a.out
a: Accccccccccccccccccccccccccccccccccccccccccccccccc
b: cccccccccccccccccccccccccccccccccccccccccccccccccc
clang:

$ clang++ -std=c++11 -stdlib=libc++ main.cpp && ./a.out
a: Accccccccccccccccccccccccccccccccccccccccccccccccc
b: cccccccccccccccccccccccccccccccccccccccccccccccccc

reference: http://shaharmike.com/cpp/std-string/

2016年10月20日 星期四

CppCon 2016: Michael Spencer “My Little Optimizer: Undefined Behavior is Magic" Note



compiler assume UB can't happen, and optimizing accordingly

//kernel bug
void foo(bool *ok){
 bool k = *ok;   <==dereference NULL pointer is UB, so ok will not be NULL
 if (!ok)
    return;
 blah(k)
}

into

void foo(bool *ok){
 bool k = *ok; 
 blah(k)
}




how is UB represented?

Explicitly:
 unreachable
 undef

implicitly:
 the optimizer knows some things just can't happen



int unreachable(int *out) {
  *out = 42;
  return *((int*)0);
}

==> load from null is UB, so replace it with unreachable


undef: a value which can have any bit pattern at any point in the program

int undef(int *p){
  int a;
  return *p + a;
}



signed math can't overflow or underflow

bool signed_underflow(int a, int b){
  return a - b > -1;
}

since a-b can't underflow, a-b will be greater than -1 if a >= b
bool signed_underflow(int a, int b){
  return a>= b;
}



Type aliasing: pointers to objects of different types can't alias (most of the time).
void check(int *h, long *k){
 *h = 5;
 *k =6;
  printf("%d\n", *h);
}
int main(void)
 long k;
 check((int*)&k, &k);
}

so optimizer will generate

void @check(i32* h, i32* k){
  store i32 5, i32* %h, !tbaa{!"int"}
  store i32 6, i32* %k, !tbaa{!"long"}
  call @printf("%d\"n", i32 5)
  ret void
}

c++ define infinite loop is UB, loop will terminate
N1528: Why undefined behavior for infinite loops?
為了優化

for (p = q; p != 0; p = p -> next) {
    ++count;
}
for (p = q; p != 0; p = p -> next) {
    ++count2;
}
to

for (p = q; p != 0; p = p -> next) {
    ++count;
    ++count2;
}

Infinite loop without side-effects
#include <iostream>
 
int fermat() {
  const int MAX = 1000;
  int a=1,b=1,c=1;
  // Endless loop with no side effects is UB
  while (1) {
    if (((a*a*a) == ((b*b*b)+(c*c*c)))) return 1;
    a++;
    if (a>MAX) { a=1; b++; }
    if (b>MAX) { b=1; c++; }
    if (c>MAX) { c=1;}
  }
  return 0;}
 
int main() {
  if (fermat())
    std::cout << "Fermat's Last Theorem has been disproved.\n";
  else
    std::cout << "Fermat's Last Theorem has not been disproved.\n";}
Possible output:
Fermat's Last Theorem has been disproved.



 Use of an object, or a pointer to an object, outside of its lifetime frequently results in undefined behavior.


optimizer will just replace original value to undef



references to null value
ex. 

bool foo(int &a){
 return &a == nullptr;
}
int main(){
  int *blah= nullptr;
  return foo(*blash);
}

optimizer generate

bool foo(int &a){
 return flase;
}
int main(){
  int *blah= nullptr;
  return foo(*blash);
}

then generate

bool foo(int &a){
 return flase;
}
int main(){
  return 0;
}



"this" will not be null because " calling any nonstatic method of any class through a null-pointer leads to undefined behavior"

Still Comparing "this" Pointer to Null?

ex. v8 bug
struct FieldType{
  FieldType *none(){
     return (FieldType*) (0);
  }
  bool IsNone() {
     return this == none();
  }
};

bool foo (FieldType *F){
   return F->isNone();
}

finally, optimizer inline two times and generate

bool foo (FieldType *F){
   return false;
}



signed {under,over}flow

因為假設overflow不會發生, 所以compiler才可以做optimization

1.
%1 = mul nsw i32 %x, 20
%A = sdiv i32 %1, 2  
==>
%A = mul nsw i32 %x, 10

2.
%2 = mul nsw i32 %x, 21
%B = icmp sgt i32 %2, 0
==>
%B = icmp sgt i32 %x, 0

%4 = sub new i32 0, %x
%5 = sub nsw i32 0, %y
%C = icmp sgt i32 %4, %5
==>
%c = icmp sgt i32 %y, %x


signed {under,over}flow for promotion

int f (int *a, int x, int y){
 return a[x+y];
}

signed 需要多一個指令作promotion, 影片裡面表示unsigned會overflow所以必須要多一個指令拿lower part
但是我自己用gcc 6.0/clang 3.7 結果都是和他說的相反

signed version
addl  %edx, %esi
movslq  %esi, %rax
movl  (%rdi,%rax,4), %eax

unsigned version
addl  %edx, %esi
movl  (%rdi,%rsi,4), %eax



loop induction is signed的話 optimizer可以做佳化
因為保證n+1 > n

template <class T>
T f(T N) {
  T i = 0;
  for(i<N;i+=2);
  return i;
}


signed 可以幫optimizer作induction variable的推測 (remove loop)
unsigned不行


int _signed(int N){
  int i =0;
  for (;i<N;i+=2);
  return i;
}

==>for loop可移掉

unsigned int _unsigned(unsigned int N){
  unsigned int i =0;
  for (;i<N;i+=2);
  return i;
}

==> for loop不能移掉



realloc sample

To see the undefined behavior, you need to read the fine print in the C99 standard, which says:
The realloc function deallocates the old object pointed to by ptr and returns a pointer to a new object that has the size specified by size.
#include <stdio.h>
#include <stdlib.h>
 
int main() {
  int *p = (int*)malloc(sizeof(int));
  int *q = (int*)realloc(p, sizeof(int));
  *p = 1;
  *q = 2;
  if (p == q)
    printf("%d %d\n", *p, *q);
}

O0, print 2, 2
O2, print 1, 2



Atomics optimization

int x;
atomic<int> y;
void foo(){
x= 0;
y.store(0, seq_cst);
if (y.load(seq_cst) == 0)
x= 1;
}

nothing can be observe the difference ==>
int x;
atomic<int> y;
void foo(){
  x= 0;
  y.store(0, seq_cst);
  if (0 == 0)
  x= 1;
}
==> simplify CFG
int x;
atomic<int> y;
void foo(){
  x= 0;
  y.store(0, seq_cst);
  x= 1;
}

==>

int x;
atomic<int> y;
void foo(){
  y.store(0, seq_cst);
  x= 1;
}

DATA RACE detail, see
compilers can optimize non-atomic memory accesses before and after atomic accesses. 





what can you do about it?
  1. -Wall -Wextra -Werror
  2. sanitize  (undefined, address, thread, memory)
  3. static analysis (clang, coverity, pvs-stdio