A Robust Interface for C Library Encoding Functions
Peter Zion, Wednesday, May 14th, 2008
A common task for library routines is to encode one buffer into another, transforming the data in some way in the process, where the size of the required destination buffer depends on the contents of the source buffer. Simple examples of routines which do this include:
- Functions that encode and decode between different character sets or code pages (eg. ISO-8859-1 to UTF-8 conversion)
- Functions that escape and unescape binary data for text representation
printf-type routines: the data is variable-length because, for instance, string arguments may be of variable length- In C, even the
strcpyroutine falls into this category, since the length of the source string is determined by the contents of the source buffer
Typical C library routines, including many in the standard C library, deal with this situation by passing a buffer size to the encoding function, which then never writes more than that many elements to the buffer; the problem with this approach is that it doesn’t give any information about what size of buffer would have sufficed to encode all the data. Most code I have seen simply doesn’t deal with cases larger than those that fit in a fixed, arbitrary size output buffer, which often results in a bug that only appears for large or edge cases.
Higher level libraries typically return some sort of dynamically-sized buffer to which it has appended the data, such as a high-level string class, but there are potentially problems here too: there is usually a hidden cost, such as many memory reallocations, that goes along with appending large encodings to a dynamic buffer. There is also usually an overhead (however small it may be) to using dynamic buffers in cases where raw, stack-allocated buffer would have sufficed.
One surprisingly simple and elegant but rarely seen way to handle such functions in C is to return the size of the buffer that would be required to hold the entire result instead of the number of elements written. To understand the benefits of this, imagine that such a function had the prototype
size_t Encode( char *buf, size_t buf_len, const char *data );
When you then use such an encoding function in your code, you can both avoid memory allocation in small cases and at the same time allow arbitrarily long encodings as a fallback, as this example illustrates:
char small_buf[256];
char *buf = small_buf;
size_t required_len = Encode( buf, 256, data )
if ( required_len > 256 ) // (use unlikely() here when possible)
Encode( buf = malloc( required_len ), required_len, data );
// Do something with buf... when done, make sure to clean up:
if ( buf != small_buf )
free( buf );
The key point to take away here is that we don’t actually allocate any memory unless we are encoding a particularly long string, which is often an exceptional case, and yet the size of the encoding output is limited only by the largest contiguous virtual memory block available. Note also that min( buf_len, required_len ) is precisely the number of bytes written to the buffer, whether or not all the data was encoded; if all the data was encoded, this is precisely the encoding length.
When using this style, it is helpful to allow the passed buffer pointer to be NULL, in which case the function simply returns the size of the buffer necessary to hold the results even though no such results were returned.
As an example of a routine using this style, here is a function to encode a string by backslashing double quotes and backslashes:
size_t c_quote_str( char *dst, size_t dst_len, const char *str )
{
size_t required_len = 0;
for (;;)
{
if ( *str == '"' || *str == '\\' )
{
if ( dst != NULL && dst_len > 0 )
{
*dst++ = '\\';
--dst_len;
}
++required_len;
}
if ( dst != NULL && dst_len > 0 )
{
*dst++ = *str;
--dst_len;
}
++required_len;
if ( *str++ == '\0' )
break;
}
return required_len;
}