Using the AddressSanitizer API


Recently, I’ve spent some time annotating string-processing C code with preconditions so that I can test our verification tool on it. All of these preconditions look quite similar—you get a pointer that you assume points to a null-terminated string; sometimes a pointer to a buffer with its size passed in another argument—but still I wanted to test somehow that they’re never violated at runtime. In this article I’ll show how to use AddressSanitizer to dynamically test such preconditions, including checking whether a pointer points to a correct null-terminated string or how much space is safely accessible in a buffer starting from a given pointer.

All code described in this article can be found here.

There is no way in standard C to write a function that checks whether a pointer points to a valid string or gives you the size of a buffer—that information simply does not exist anywhere at run-time. The proper way of maintaining it would be to keep a data structure that stores the metadata about all the valid regions of addresses, like in the tool described in the paper “An Optimized Memory Monitoring for Runtime Assertion Checking of C Programs”. Although a memory allocator might already have something similar to that for heap regions, tracking all the stack objects, automatic variables, and globals could be expensive.

AddressSanitizer, a popular memory error detection tool, uses a very efficient implementation of shadow memory to track the validity of individual memory cells instead of storing bounds of memory regions explicitly. However, it can still sometimes give you approximate information about ranges of valid addresses if these regions are separated by invalid memory. So, I was wondering how difficult it would be to extend it with support for the kinds of assertions that I want to check.

It seems that AddressSanitizer exposes some API as several functions in the header sanitizer/asan_interface.h (code here). These functions allow you to access the shadow memory and query information about region bounds and variable names although some of the functions did not do exactly what I assumed they’re going to do. In particular, the documentation for __asan_address_is_poisoned says that it returns 1 only if the given address is safe to access. It does that for addresses in the general area of the stack, heap, or the globals section but for some completely garbage pointers or NULL it misleadingly returns 1.

I’ve used LLVM 3.7. If you’re getting different results then maybe it’s the LLVM version since I don’t think this API is meant to be very stable. You probably shouldn’t rely on it too much or use it in production code. For testing or debugging it seems very useful, especially these functions:

  • __asan_locate_address gives you lots of useful information about a given pointer including its location (whether is on the stack, heap, globals section, etc.), its corresponding variable name, and—most importantly—the base address and size of the memory region it belongs to.
  • ASAN_POISON_MEMORY_REGION could be used if you have a larger block of allocated memory that you logically subdivide into smaller chunks (for example in your own allocator). You can leave some space between these chunks and poison it to be able to detect sequential buffer overflows.
  • __asan_describe_address seems very useful for debugging. It prints out the same kind of information about a given address that you get in ASAN error messages (including the allocation-time stack trace).

To check whether a pointer points to a valid region, check whether the type returned by __asan_locate_address is either “global”, “heap”, “stack”, or “stack-fake”. Since its a bit unwieldy to deal with these string descriptors every time, I ended up creating a following wrapper:

/**
 * If `ptr` points within a valid memory region, sets `base` to the start of
 * this region, `size` to its size and returns true. Otherwise, returns false.
 */
bool __asan_region_bounds(void *ptr, void **base, size_t *size)

The AddressSanitizer’s concept of a region roughly corresponds to the notion of an object that appears in the C standard although it’s a lot less restrictive. Individual globals or locals live in their dedicated regions, structure fields or array elements will usually be a part of a bigger region that contains at least the whole structure. For dynamically-allocated memory, the region corresponds to the whole chunk of size requested in the malloc call. That’s why ASAN will not detect sequential buffer overflows in adjacent buffers in the same struct.

Another thing worth mentioning is the treatment of function pointers. It looks like __asan_locate_address classifies them as “heap-invalid” so they’re not considered to be valid pointers by __asan_region_bounds. Even though on many systems it might be possible to cast a function pointer to a data pointer and read the underlying machine code, the C standard doesn’t guarantee that so it’s a reasonable choice to treat them as invalid.

Example: strcpy()

To describe what I mean by checking preconditions for string functions, let’s take the C library function strcpy() as the example. The man page gives a following description:

The strcpy() function copies the string pointed to by src, including the terminating null byte ('\0'), to the buffer pointed to by dest. The strings may not overlap, and the destination string dest must be large enough to receive the copy.

To check this precondition dynamically we need to be able to check whether src is a properly null-terminated string, dest is big enough, and that the two regions do not overlap. In the following sections I will describe how to do it using the AddressSanitizer API. All the code together with a few testcases can be downloaded from here.

Note that AddressSanitizer alone will detect an invalid access if src invalid or dest is too small. The reason I’ve implemented these functions is to test the preconditions I already have and need for static analysis to make sure that I’m not verifying the function based on wrong assumptions.

Checking if src is a valid string

Since we know the region bounds we can simply check whether a string is null-terminated with a linear walk over the region. Since it’s easy to calculate the string’s length as a side effect of this walk, the most convenient way of implementing this is as a “safe strlen” function which accepts any pointer—including ones not pointing to properly-allocated memory.

The following implementation will return the length of the string starting at str if this string is properly null-terminated. If it isn’t, it returns a negative value. We need to be very careful with types here—just returning an int (which usually is 32-bit) can lead to overflows for strings bigger than 2 GiB.

int64_t __asan_strlen(const char *str)
{
	void *base;
	size_t size;

	if (__asan_region_bounds((void *)str, &base, &size))
		return -1;

	const char *end = base + size;
	size_t result = 0;

	for (const char *ptr = str; ptr != end; ++ptr) {
		if (*ptr == '\0')
			return result;
		result++;
	}

	/* region ends without the NULL byte */
	return -1;
}

Of course, the only thing that this procedure really verifies is that there is a valid sequence of readable bytes starting at str and ending with NULL. Thus, we might end up with accidentally null-terminated “strings” by taking addresses of non-string variables or pointers inside large random memory regions. Just because the function dynamically discovers something to be a valid string doesn’t mean that it’s always guaranteed to be one.

For example, if i is a 32-bit variable equal to 16777215 (or 0x00ffffff), then ((char *)&i) + 2 will point to a null-terminated “string” on little-endian architectures but not on big-endian.

Space available in dest

Checking if the destination buffer is big enough to accommodate the input string boils down to determining the region bounds and doing some pointer arithmetic. For convenience, let’s define a function __asan_buffspace that, for a given pointer ptr, will return the biggest x such that for every i between 0 and x (inclusive), dereferencing ptr+i is safe. This takes into account the possibility of ptr being a completely invalid pointer (e.g. NULL or pointing outside an allocated region) in which case the function returns -1 since not even ptr+0 is safe.

int64_t __asan_buffspace(const void *ptr)
{
	void *base;
	size_t size;

	if (__asan_region_bounds((void *)ptr, &base, &size))
		return -1;

	return ((int64_t)size) - (ptr - base);
}

Again, I return a signed integer for convenience. The cast of size from size_t to int64_t will overflow for buffers bigger than a few exabytes so if you’re reading this in the very far future, please modify the code accordingly.

Checking if buffers are non-overlapping

This is a much more interesting property, since it might allow us to catch bugs that would not be caught with AddressSanitizer alone. To check that the two buffers do not overlap, it might be tempting to just compare their base addresses. This would be too restrictive, e.g.

const char *hello = "Hello, world!";
char buff[256];
strcpy(buff, hello);
strcpy(buff + 100, buff);

Both invocations of strcpy are given non-overlapping arguments, even though in the second case they belong to the same allocated region. Similar thing would happen for example for two arrays in the same class or structure.

However, since we know the string’s length, checking for overlap is simpler and doesn’t require any additional API calls. The only case when the buffers overlap is when dest <= src and dest + length >= src. Note that if you want to be really super strict about it, then comparing pointers that don’t point within the same array is undefined behavior according to the C standard. I don’t really know how to deal with this (checking ASAN’s base addresses is not enough) but it’s likely a non-issue since we’re making platform-specific assumptions anyway.

Full precondition for strcpy can be checked like that:

bool check_strcpy(char *dest, const char *src)
{
	int64_t length = __asan_strlen(src);

	if (length < 0)
		return false; // src is not null-terminated

	if (length >= __asan_buffspace(dest))
		return false; // no enough space in `dest`

	if (dest <= src && dest + length >= src)
		return false; // buffers overlap

	return true;
}