Just out of curiosity, did you profile what is the speed improvement compared to current implementations?
I'm not an expert, but AFAIK most CPUs today support SIMD instructions, does it mean this implementation can work in multiple CPU or is there some constraint?