Using C++ Instrinsic Functions For Pipelined Text Processing

Oiriginally Posted on his O'Reillynet blog on November 7, 2005 by Rick Jelliffe

(Retrieved and reformatted from : 

A good C++ programming technique that has almost no published material available on the WWW relates to using the special pipeline instructions in modern CPUs for faster text processing. 

Typically compilers (such as for C++ and Java) find it hard to detect possible parallelism in programs: indeed, this is because typically you have to structure the algorithm to suit the CPU’s capabilities, which goes against the desired requirements for cross-platform, Write Once Read Anywhere code. It used to require assembler programming, but modern compilers support a special class of functions called
intrinsics which are more-or-less macros that expand to assembler code sections.

Lets take the UTF-8 to UTF-16 conversion function used in  libxml  as our example.

Its pretty typical code; it doesn’t look too bad to me. UTF-8 is a variable-width encoding of Unicode: it uses one byte for ASCII characters (and with exactly the same codes), two bytes for most Western characters (both bytes have their top bit set), three bytes for most Chinese characters, and four bytes for new rarer characters. UTF-16 is a fixed width encoding of two bytes. (We ignore the issue of surrogates here.)

Here’s the original code (with some dead code removed):

* @outb:  a pointer to an array of bytes to store the result
* @outlen:  the length of @outb
* @in:  a pointer to an array of UTF-8 chars
* @inlen:  the length of @in
* Take a block of UTF-8 chars in and try to convert it to an UTF-16LE
* block of chars out.
* Returns the number of bytes written, or -1 if lack of space, or -2
*     if the transcoding failed.

static int
UTF8ToUTF16LE(unsigned char* outb, int *outlen,
			  const unsigned char* in, int *inlen)
	unsigned short* out = (unsigned short*) outb;
	const unsigned char* processed = in;
	const unsigned char *const instart = in;
	unsigned short* outstart= out;
	unsigned short* outend;
	const unsigned char* inend= in+*inlen;
	unsigned int c, d;
	int trailing;

	/* UTF16LE encoding has no BOM */
	if ((out == NULL) || (outlen == NULL) || (inlen == NULL)) return(-1);
	if (in == NULL) {
		*outlen = 0;
		*inlen = 0;
	outend = out + (*outlen / 2);


	// Output the rest the usual way
	while (in < inend) {
		d= *in++;
		if      (d < 0×80)  { c= d; trailing= 0; }
		else if (d < 0xC0) {
			/* trailing byte in leading position */
			*outlen = (out - outstart) * 2;
			*inlen = processed - instart;
		} else if (d < 0xE0)  { c= d & 0×1F; trailing= 1; }
		else if (d < 0xF0)  { c= d & 0×0F; trailing= 2; }
		else if (d < 0xF8)  { c= d & 0×07; trailing= 3; }
		else {
			/* no chance for this in UTF-16 */
			*outlen = (out - outstart) * 2;
			*inlen = processed - instart;

		if (inend - in < trailing) {

		for ( ; trailing; trailing–) {
			if ((in >= inend) || (((d= *in++) & 0xC0) != 0×80))
			c <<= 6;
			c |= d & 0×3F;

		/* assertion: c is a single UTF-4 value */
		if (c < 0×10000) {
			if (out >= outend)

			*out++ = c;
		else if (c < 0×110000) {
			if (out+1 >= outend)
			c -= 0×10000;
			*out++ = 0xD800 | (c >> 10);
		processed = in;
	*outlen = (out - outstart) * 2;
	*inlen = processed - instart;

The optimization is very low-impact and modest in scope, but it is enough to show how to use intrinsics for this purpose.

At the point marked
we add code that tests 16 bytes chunks at a time and tests whether they are all ASCII characters. If so, then we use an efficient unrolled series of assignments to transfer the input bytes to the output. The first time any non-ASCII byte is found in a 16 byte chunk, that chunk and all the rest of the file are processed in the original code’s way. In the worst case, when the input has no ASCII characters, there will only be a few cycles penalty; in the best case, when in the input has only ASCII characters, all characters (except any straggling characters where the length is not a multiple of 16) will be processed by the optimized code section; inputs which start off ASCII then get some non-ASCII later (such an English HTML page that may have a few special publishing characters) benefit.

The same optimization could be used for encoders for 8-bit fixed and variable encoders, such as a CP1252 to UTF-16 transcoder: any ASCII compatible encoding where code points less than 0×80 are only used for ASCII characters.

First, we need to add an extra header, which declares the intrinsics functions available to access the SSE2 operations on modern x86 CPUs subsequent to the Pentium 4:

#include <emmintrin.h>

And here is the code. __m128i is a datatype built into recent x86 compilers to represent a 16 byte chunk of memory used to contain integers of various sizes: in our case here, we use intrinsic functions that treat the datatype as 16 signed chars.

while (in +16  <= inend &&
		!( _mm_movemask_epi8(
				_mm_load_si128 ((__m128i*)in) ,
				_mm_setzero_si128() )))) { 

			// unrolled output
			*out++ = *in++;  // 1
			*out++ = *in++;  // 2
			*out++ = *in++;  // 3
			*out++ = *in++;  // 4
			*out++ = *in++;  // 5
			*out++ = *in++;  // 6
			*out++ = *in++;  // 7
			*out++ = *in++;  // 8
			*out++ = *in++;  // 9
			*out++ = *in++;  // 10
			*out++ = *in++;  // 11
			*out++ = *in++;  // 12
			*out++ = *in++;  // 13
			*out++ = *in++;  // 14
			*out++ = *in++;  // 15
			*out++ = *in++;  // 16

			processed += 16; // correct the count

The first and last lines are housekeeping. The second line contains the interesting bits: the intrinsic functions. The rest of the body are the simple unrolled data transfers. The code does assume that the input and output arrays are 16-byte aligned, as required by the intrinsic functions.

So what is going on? The _mm_load_si128() intrinsic function takes a pointer to the input (cast appropriately) and starts a parallelized load of that chunk. The next stage of the pipeline checks for bytes that are less than zero using _mm_cmplt_epi8(): this will find any bytes not in the ASCII range, because it treats the bytes as signed chars. This results in an notional __m128i containing 0×00 in each byte position if all the characters are ASCII. The _mm_movemask_epi8()

intrinsic function then constructs an integer from the top bits of each of the 16 bytes, and we use that integer for our test. Phew!

How much increase does this give us? Testing the best case on my new notebook with a recent compiler, all ASCII input, inputs larger than 1K have a four times increase in processing rate (at best, from 12.7 cycles per byte to 2.7 cycles per byte.) Even for small 100 byte input, the processing rate is doubled. So even this very basic optimization is worthwhile; and I am sure various improvements would suggest themselves to any reader who looked further.

The SSE2 streaming operations on x86 CPUs are not well suited for the needs of text processing, and could be improved. They are much more aimed at the needs of floating point and integer mathematics; however, I hope this example shows that they can be useful for some text operations too. One of the justifications for XML Binary Infosets is that text operations required by XML parsing are too CPU intensive; this example may perhaps counter that to some extent: optimizations based on the use of intrinsic functions will become more common as x86 systems which don’t support SSE2 become obsolete, and some of these optimizations may be relevant for XML parser writers, certainly for transcoder writers.

[Original concluded: "How geeky is that? This is not code I am maintaining; please put any fixes or suggestions below! "] 

This article was cited in: 

A Case Study in SIMD Text Processing with Parallel Bit Streams
UTF-8 to UTF-16 Transcoding,   Robert D. Cameron, PPoPP’08, 2008, ACM.