ARMv7 NEON VQRDMULH instruction implementation

5月 22, 2019

VQRDMULH :

Vector Saturating Rounding Doubling Multiply Returning High Half. VQRDMULH multiplies corresponding elements in two vectors, doubles the results, and places the most significant half of the final results in the destination vector.

implement reference code
https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L329

<code>

// This function implements the same computation as the ARMv7 NEON VQRDMULH // instruction.

template <>
	inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
	std::int32_t b) {
	bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
	std::int64_t a_64(a);
	std::int64_t b_64(b);
	std::int64_t ab_64 = a_64 * b_64;
	std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
	std::int32_t ab_x2_high32 =
	static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
	return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
	}

</code>

1. ab_x2_high32 computed by divides "1<<31", not "1<<32", why?

Ans: because there are two sign bits after multiple two fixpoint value, the most significant half is starting from second MSB

2. if ab_64>=0, why does it need to add rounding with 1<<30? not 1<<31?
same

Ans: like the above answer, although the final result is [63:0], but the most significant bit is [62:0], so corresponding to the most significant half, the rounding is 1<<30.

Note: VQRDMULH likes to RISCV RVV's VSMUL

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#133-vector-single-width-fractional-multiply-with-rounding-and-saturation

When multiplying two N-bit signed numbers, the largest magnitude is obtained for -2N-1 * -2N-1 producing a result +22N-2, which has a single (zero) sign bit when held in 2N bits. All other products have two sign bits in 2N bits. To retain greater precision in N result bits, the product is shifted right by one bit less than N, saturating the largest magnitude result but increasing result precision by one bit for all other products.

This is why the final result need to >> (SEW-1)

# Signed saturating and rounding fractional multiply
vsmul.vv vd, vs2, vs1, vm  # vd[i] = clip((vs2[i]*vs1[i]+round)>>(SEW-1))
vsmul.vx vd, vs2, rs1, vm  # vd[i] = clip((vs2[i]*x[rs1]+round)>>(SEW-1))

搜尋此網誌

隨筆

ARMv7 NEON VQRDMULH instruction implementation

留言

張貼留言

這個網誌中的熱門文章

我們能利用machine learning去幫助compiler的optimization演算法變強嗎？

The Speed Game: Automated Trading Systems in C++

被討厭的勇氣自我啟發之父阿德勒的教導

ARMv7 NEON VQRDMULH instruction implementation

留言

張貼留言

這個網誌中的熱門文章

我們能利用machine learning去幫助compiler的optimization演算法變強嗎？

The Speed Game: Automated Trading Systems in C++

被討厭的勇氣 自我啟發之父阿德勒的教導

被討厭的勇氣自我啟發之父阿德勒的教導