2019年10月12日 星期六

LLVM Machine Instruction: Convergent attribute

ref: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089241.html



1. Convergent attribute is useful for SIMT/SPMD programming model.

2. Intended interpretation is that a convergent operation cannot be move either into or out of a conditionally executed region.

3. If you have a convergent instruction A, it islegal to duplicate it to instruction B if (assuming B is after A in program flow) A dominates B and B post-dominates A.

case:

 r1 = texture2D(..., r0, ...)
 if (...) {
   // r0 used as temporary here
   r0 = ...
   r2 = r0 + ...
 } else {
   // only use of r1
   r2 = r1 + ...
}


In this example, various optimizations might try to sink the texture2D operation
into the else block, like so:

if (...) {
  r0 = ...
  r2 = r0 + ...
} else {
  r1 = texture2D(..., r0, ...)
  r2 = r1 + ...
}


In most SPMD/SIMT implementations, the fallout of this races is exposed via
the predicated expression of acyclic control flow:

pred0 <- cmp ...
if (pred0)  r0 = ...                            // thread 1
if (pred0)  r2 = r0 + ...                       // thread 1
if (!pred0) r1 = texture2D(..., r0, ...)        // thread 0 
if (!pred0) r2 = r1 + ...                       // thread 0


If thread 0 takes the else path and perform the texture2D operation, but
its neighbor thread 1 takes the then branch, then the texture2D will fail
because thread 1 has already overwritten its value of r0 before thread 0 has
a chance to read it.


2019年9月15日 星期日

Stage Mix




stage mix幾乎都是剪輯那些韓國多人團體的作品
要滿足
1. 工業化一致的攝影方式跟攝影器材
2. 軍隊式標準的舞蹈
3. 細心的剪接

才能辦到
工業化一致的分鏡是必備的, 因為大團體每個人都要妥善的分配上鏡時間
軍隊式標準的舞蹈也是必備的, 因為跳錯會影響精心設計過後的上鏡畫面


我是在想說...
是不是要這樣幹
大家才會想去看現場表演
因為只有在現場
才能緊叮你的偶像片刻不移
看到平常看不到的畫面...

這樣發售的某次演場會影片倒是很無聊
因為就只是換衣服跟場景嘛~(?)


2019年5月22日 星期三

ARMv7 NEON VQRDMULH instruction implementation

VQRDMULH :

Vector Saturating Rounding Doubling Multiply Returning High Half. VQRDMULH multiplies corresponding elements in two vectors, doubles the results, and places the most significant half of the final results in the destination vector.

implement reference code
https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L329

<code>
// This function implements the same computation as the ARMv7 NEON VQRDMULH // instruction.

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
std::int32_t b) {
bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
std::int64_t a_64(a);
std::int64_t b_64(b);
std::int64_t ab_64 = a_64 * b_64;
std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
std::int32_t ab_x2_high32 =
static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}
</code>



1. ab_x2_high32 computed by divides "1<<31", not "1<<32", why?

Ans: because there are two sign bits after multiple two fixpoint value, the most significant half is starting from second MSB

2. if ab_64>=0, why does it need to add rounding with 1<<30? not 1<<31?
same

Ans: like the above answer, although the final result is [63:0], but the most significant bit is [62:0], so corresponding to the most significant half, the rounding is 1<<30.


Note: VQRDMULH likes to RISCV RVV's VSMUL


When multiplying two N-bit signed numbers, the largest magnitude is obtained for -2N-1 * -2N-1 producing a result +22N-2, which has a single (zero) sign bit when held in 2N bits. All other products have two sign bits in 2N bits. To retain greater precision in N result bits, the product is shifted right by one bit less than N, saturating the largest magnitude result but increasing result precision by one bit for all other products.
This is why the final result need to >> (SEW-1)

# Signed saturating and rounding fractional multiply
vsmul.vv vd, vs2, vs1, vm  # vd[i] = clip((vs2[i]*vs1[i]+round)>>(SEW-1))
vsmul.vx vd, vs2, rs1, vm  # vd[i] = clip((vs2[i]*x[rs1]+round)>>(SEW-1))