Awsom3D Devlog : Computer Graphics, Virtual Reality & Marketing: SSE horizontal minimum and maximum

Thursday, June 19, 2014

SSE horizontal minimum and maximum

 static inline float sseHorizontalMin(const __m128 &p)  
 {   
     __m128 data = p;             /* [0, 1, 2, 3] */   
     __m128 low = _mm_movehl_ps(data, data); /* [2, 3, 2, 3] */   
     __m128 low_accum = _mm_min_ps(low, data); /* [0|2, 1|3, 2|2, 3|3] */   
     __m128 elem1 = _mm_shuffle_ps(low_accum,   
                       low_accum,   
                       _MM_SHUFFLE(1,1,1,1)); /* [1|3, 1|3, 1|3, 1|3] */   
     __m128 accum = _mm_min_ss(low_accum, elem1);   
     return _mm_cvtss_f32(accum);   
 }  
 static inline float sseHorizontalMax(const __m128 &p)  
 {   
     __m128 data = p;             /* [0, 1, 2, 3] */   
     __m128 high = _mm_movehl_ps(data, data); /* [2, 3, 2, 3] */   
     __m128 high_accum = _mm_max_ps(high, data); /* [0|2, 1|3, 2|2, 3|3] */   
     __m128 elem1 = _mm_shuffle_ps(high_accum,   
                       high_accum,   
                       _MM_SHUFFLE(1,1,1,1)); /* [1|3, 1|3, 1|3, 1|3] */   
     __m128 accum = _mm_max_ss(high_accum, elem1);   
     return _mm_cvtss_f32(accum);   
 }

Follow the project on Facebook : https://www.facebook.com/immersionengine
Follow me on twitter : twitter.com/lefebv_l

4 comments:

Agatha MallettFebruary 7, 2016 at 12:45 AM
FYI, I profiled this version, and found it to be slower than a naïve implementation using three calls to std::min/std::max. The reason appears to be pipelining. The calls compile to two moves and then three mins or maxes. The moves and two of the mins/maxes can be pipelined together, so it actually ends up being faster overall.
ReplyDelete
Replies
Laurent LefebvreFebruary 8, 2016 at 7:04 AM
Thanks Ian, I'll take a look !
What processor do you have? Instruction latency depends on your architecture.
ReplyDelete
Replies
Agatha MallettFebruary 8, 2016 at 7:13 AM
It's an Intel processor (990X), so it should follow the latencies given in the intrinsics guide--3 cycles for minss/maxss and 1 for all the moves/shuffles. The problem is data dependency--the naïve version only has a dependency on the last minss/maxss. This version has a dependency for every instruction.
ReplyDelete
Replies
RobertKReedJuly 20, 2019 at 8:04 PM
Hi there, I found your blog via Google while searching for such kinda informative post and your post looks very interesting for me خصم اي هيرب
ReplyDelete
Replies

Add comment

New comments are not allowed.

ABOUT ME

I am passionate about storytelling, dreams & adventures.

As a developer, I worked in R&D on high-end fields; video game industry, artificial intelligence, and CGI.

As an entrepreneur, I created 2 companies. The first one Cre@activity helps businesses to get more clients on the internet, the second one Rendr Softworks helps architects present their projects in an innovative way thanks to Virtual Reality equipment.

Website (French) : http://www.rendr.fr

Website (Worldwide) : http://www.rendrsoftworks.com

Technology : http://www.immersion-engine.com

Web Agency : http://www.creactivity.fr