static inline float sseHorizontalMin(const __m128 &p)
{
__m128 data = p; /* [0, 1, 2, 3] */
__m128 low = _mm_movehl_ps(data, data); /* [2, 3, 2, 3] */
__m128 low_accum = _mm_min_ps(low, data); /* [0|2, 1|3, 2|2, 3|3] */
__m128 elem1 = _mm_shuffle_ps(low_accum,
low_accum,
_MM_SHUFFLE(1,1,1,1)); /* [1|3, 1|3, 1|3, 1|3] */
__m128 accum = _mm_min_ss(low_accum, elem1);
return _mm_cvtss_f32(accum);
}
static inline float sseHorizontalMax(const __m128 &p)
{
__m128 data = p; /* [0, 1, 2, 3] */
__m128 high = _mm_movehl_ps(data, data); /* [2, 3, 2, 3] */
__m128 high_accum = _mm_max_ps(high, data); /* [0|2, 1|3, 2|2, 3|3] */
__m128 elem1 = _mm_shuffle_ps(high_accum,
high_accum,
_MM_SHUFFLE(1,1,1,1)); /* [1|3, 1|3, 1|3, 1|3] */
__m128 accum = _mm_max_ss(high_accum, elem1);
return _mm_cvtss_f32(accum);
}
Follow the project on Facebook : https://www.facebook.com/immersionengine
Follow me on twitter : twitter.com/lefebv_l
FYI, I profiled this version, and found it to be slower than a naïve implementation using three calls to std::min/std::max. The reason appears to be pipelining. The calls compile to two moves and then three mins or maxes. The moves and two of the mins/maxes can be pipelined together, so it actually ends up being faster overall.
ReplyDeleteThanks Ian, I'll take a look !
ReplyDeleteWhat processor do you have? Instruction latency depends on your architecture.
It's an Intel processor (990X), so it should follow the latencies given in the intrinsics guide--3 cycles for minss/maxss and 1 for all the moves/shuffles. The problem is data dependency--the naïve version only has a dependency on the last minss/maxss. This version has a dependency for every instruction.
ReplyDeleteHi there, I found your blog via Google while searching for such kinda informative post and your post looks very interesting for me خصم اي هيرب
ReplyDelete