System memory allocation using the malloc subsystem

Memory is allocated to applications using the malloc subsystem.

The malloc subsystem is a memory management API that consists of the following subroutines:

malloc
calloc
realloc
free
mallopt
mallinfo
alloca
valloc
posix_memalign

The malloc subsystem manages a logical memory object called a heap. The heap is a region of memory that resides in the application's address space between the last byte of data allocated by the compiler and the end of the data region. The heap is the memory object from which memory is allocated and to which memory is returned by the malloc subsystem API.

The malloc subsystem performs the following fundamental memory operations:

Allocation:
Performed by the malloc, calloc valloc, alloca, and posix_memalign subroutines.
Deallocation:
Performed by the free subroutine.
Reallocation:
Performed by the realloc subroutine.

The mallopt and mallinfo subroutines are supported for System V compatibility. The mallinfo subroutine can be used during program development to obtain information about the heap managed by the malloc subroutine. The mallopt subroutine can be used to disclaim page-aligned, page-sized free memory, and to enable and disable the default allocator. Similar to the malloc subroutine, the valloc subroutine is provided for compatibility with the Berkeley Compatibility Library.

For additional information, see the following sections:

Working with the process heap

_edata is a symbol whose address is the first byte following the last byte of initialized program data. The _edata symbol refers to the start of the process heap, which is enlarged by the malloc subsystem when the first block of data is allocated. The malloc subsystem enlarges the process heap by increasing the process brk value, which denotes the end of the process heap. This is done by calling the sbrk subroutine. The malloc subsystem expands the process heap as the needs of the application dictate.

The process heap is divided into allocated and freed memory blocks. The free pool consists of the memory available for subsequent allocation. An allocation is completed by first removing a memory block from the free pool and then returning to the calling function a pointer to this block. A reallocation is completed by allocating a memory block of the new size, moving the data in the original block to the new block, and freeing the original block. The allocated memory blocks consist of the pieces of the process heap being used by the application. Because the memory blocks are not physically removed from the heap (they change state from free to allocated), the size of the process heap does not decrease when memory is freed by the application.

Process address space in 32-bit applications

A 32-bit application program running on the system has an address space that is divided into the following segments:

Segment	Description
`0x00000000` to `0x0fffffff`	Contains the kernel.
`0x10000000` to `0x1fffffff`	Contains the application program text.
`0x20000000` to `0x2fffffff`	Contains the application program data, the process heap, and the application stack.
`0x30000000` to `0xcfffffff`	Available for use by shared memory or mmap services.
`0xd0000000` to `0xdfffffff`	Contains shared library text.
`0xe0000000` to `0xefffffff`	Available for use by shared memory or mmap services.
`0xf0000000` to `0xffffffff`	Contains the application shared library data.

Process address space in 64-bit applications

A 64-bit application program running on the system has an address space that is divided into the following segments:

Segment	Description
`0x0000 0000 0000 0000` to `0x0000 0000 0fff ffff`	Contains the kernel.
`0x0000 0000 f000 0000` to `0x0000 0000 ffff ffff`	Reserved.
`0x0000 0001 0000 0000` to `0x07ff ffff ffff ffff`	Contains the application program text, application program data, the process heap, and shared memory or mmap services.
`0x0800 0000 0000 0000` to `0x08ff ffff ffff ffff`	Privately loaded objects.
`0x0900 0000 0000 0000` to `0x09ff ffff ffff ffff`	Shared library text and data.
`0x0f00 0000 0000 0000` to `0x0fff ffff ffff ffff`	Application stack.

Note: AIX® uses a delayed paging slot allocation technique for storage allocated to applications. When storage is allocated to an application with a subroutine, such as malloc, no paging space is assigned to that storage until the storage is referenced. This technique is useful for applications that allocate large sparse memory segments. However, this technique can affect portability of applications that allocate very large amounts of memory. If the application expects that calls to malloc will fail when there is not enough backing storage to support the memory request, the application might allocate too much memory. When this memory is referenced later, the machine quickly runs out of paging space and the operating system kills processes so that the system is not completely exhausted of virtual memory. The application that allocates memory must ensure that backing storage exists for the storage being allocated. Setting the PSALLOC environment variable to PSALLOC=early changes the paging space allocation technique to an early allocation algorithm. In early allocation, paging space is assigned once the memory is requested. For more information, see Paging space and virtual memory in Operating system and device management.

Understanding system allocation policy

The allocation policy refers to the set of data structures and algorithms employed to represent the heap and to implement allocation, deallocation, and reallocation. The malloc subsystem supports several different allocation policies, including the default allocation policy, the watson allocation policy, the malloc 3.1 allocation policy, and the user-defined allocation policy. The API for accessing the malloc subsystem is identical for all allocation policies; only the underlying implementation is different.

You can use the following environment variables to specify the allocation policy and any normal or debug options for that policy:

MALLOCTYPE specifies the allocation policy.
MALLOCOPTIONS specifies normal options to the chosen allocation policy.
MALLOCDEBUG specifies debug options to the chosen allocation policy.
MALLOCALIGN specifies the default malloc alignment external to a program.

The default allocation policy is generally more efficient and is the preferred choice for the majority of applications. The other allocation policies have some unique behavioral characteristics that can be beneficial in specific circumstances, as described in Comparing the Different Allocation Policies.

Some options to the various allocation policies are compatible with each other and can be used in tandem. When you are using options in tandem, use a comma (,) to separate options specified by the MALLOCOPTIONS and MALLOCDEBUG environment variables.

The MALLOCALIGN environment variable can be set to the default alignment desired for every malloc() allocation. An example is

MALLOCALIGN=16;  export MALLOCALIGN

The MALLOCALIGN environment variable can be set to any power of 2 value greater than or equal to the size of a pointer in the corresponding run mode (4 bytes for 32-bit mode, 8 bytes for 64-bit mode). For 32-bit vector-enabled programs, this environment variable can be set to 16, so all malloc()s will be suitably aligned for vector data types if necessary. Note that 64-bit vector programs will already receive 16-byte-aligned allocations.

Also, internally to a program, the program can use the mallopt(M_MALIGN, 16) routine to change the default malloc() to provide 16-byte aligned allocations. The mallopt(M_MALIGN) routine allows a program to control the default malloc alignment dynamically at runtime.

Understanding the default allocation policy

The default allocation policy maintains the free space in the heap as nodes in a cartesian binary search tree in which the nodes are ordered left-to-right by address (increasing address to the right) and top-to-bottom by length (such that no child is larger than its parent). This data structure imposes no limitation on the number of block sizes supported by the tree, allowing a wide range of potential block sizes. Tree-reorganization techniques optimize access times for node location, insertion, and deletion, and also protect against fragmentation.

The default allocation policy provides support for the following optional capabilities:

Allocation

A small amount of overhead is required to service an allocation request. This is due to the need for a metadata prefix and the need for suitable alignment of each block of memory. The size of the metadata prefix for all allocations is 8 and 16 bytes for 32-bit and 64-bit programs respectively. Each block must be aligned on a 16 or 32 byte boundary, thus the total amount of memory required for an allocation of size n is:

size = roundup(n + prefix_size, alignment requirement)

For example, an allocation of size 37 in a 32-bit process would require roundup(37 + 8, 16), which is equal to 48 bytes.

The node of the tree with the lowest address that is greater than or equal to the size required is removed from the tree. If the block found is larger than the needed size, the block is divided into two blocks: one of the needed size, and the second a remainder. The second block, called the runt, is returned to the free tree for future allocation. The first block is returned to the caller.

If a block of sufficient size is not found in the free tree, the heap is expanded, a block the size of the acquired extension is added to the free tree, and allocation continues as previously described.

Deallocation

Memory blocks deallocated with the free subroutine are returned to the tree, at the root. Each node along the path to the insertion point for the new node is examined to see if it adjoins the node being inserted. If it does, the two nodes are merged and the newly merged node is relocated in the tree. If no adjoining block is found, the node is simply inserted at the appropriate place in the tree. Merging adjacent blocks can significantly reduce heap fragmentation.

Reallocation

If the size of the reallocated block will be larger than the original block, the original block is returned to the free tree with the free subroutine so that any possible coalescence can occur. A new block of the requested size is then allocated, the data is moved from the original block to the new block, and the new block is returned to the caller.

If the size of the reallocated block is smaller than the original block, the block is split and the smaller one is returned to the free tree.

Limitations

The default allocation policy supports the following options:

Understanding the watson allocation policy

The Watson allocation policy maintains the free space in the heap as nodes in two separate red-black trees: one sorted by address, the other by size. Red-black trees provide simpler and more efficient tree operations than the cartesian trees of the default allocator, thus the watson allocation policy is often faster than the default.

Allocation

The Watson allocation policy has the same overhead requirements as the default allocation policy.

The size tree is searched for the smallest possible block that is greater than or equal to the size required. This block is then removed from the size tree. If the block found is larger than the needed size, the block is divided into two blocks: one block of the remaining size, and the second of the required size. The first block, called the runt, is returned to the size tree for future allocation. The second block is returned to the caller. If the block found in the size tree was exactly the required size, the block is removed from both the size and the address tree, and then returned to the caller.

If a block of sufficient size is not found in the free tree, the process heap is expanded, a block the size of this expansion is added to the size and address trees, and allocation continues as previously described.

Deallocation

Memory blocks deallocated with the free subroutine are returned to the address tree at the root. Each node along the path to the insertion point for the new node is examined to see if it adjoins the node being inserted. If it does, the two nodes are merged and the newly merged node is relocated in the size tree. If no adjoining block is found, the node is simply inserted at the appropriate place in the both the address and size tree.

After insertion, both red-black trees must be checked for correct balancing.

Reallocation

If the size of the reallocated block will be larger than the original block, the original block is returned to the free trees with the free subroutine so that any possible coalescence can occur. A new block of the requested size is then allocated, the data is moved from the original block to the new block, and the new block is returned to the caller.

If the size of the reallocated block is smaller than the original block, the block is split and the remainder is returned to the free tree.

Limitations

The Watson allocation policy supports the following options:

Understanding the malloc 3.1 allocation policy

The malloc 3.1 allocation policy can be selected by setting MALLOCTYPE=3.1 prior to process startup. Thereafter, all 32-bit programs run by the shell will use the malloc 3.1 allocation policy (64-bit programs will continue to use the default allocation policy).

The malloc 3.1 allocation policy maintains the heap as a set of 28 hash buckets, each of which points to a linked list. Each linked list contains blocks of a particular size. The index into the hash buckets indicates the size of the blocks in the linked list. The size of the block is calculated using the following formula:

size = 2 ^{i + 4}

where i identifies the bucket. This means that the blocks in the list anchored by bucket zero are 20+4 = 16 bytes long. Therefore, given that a prefix is 8 bytes in size, these blocks can satisfy requests for blocks between 0 and 8 bytes long. The following table illustrates how requested sizes are distributed among the buckets.

Note: This algorithm can use as much as twice the amount of memory actually requested by the application. An extra page is required for buckets larger than 4096 bytes because objects of a page in size or larger are page-aligned. Because the prefix immediately precedes the block, an entire page is required solely for the prefix.

Bucket	Block Size	Sizes Mapped	Pages Used
0	16	0 ... 8
1	32	9 ... 24
2	64	25 ... 56
3	128	57 ... 120
4	256	121 ... 248
5	512	249 ... 504
6	1K	505 ... 1K-8
7	2K	1K-7 ... 2K-8
8	4K	2K-7 ... 4K-8	2
9	8K	4K-7 ... 8K-8	3
10	16K	8K-7 ... 16K-8	5
11	32K	16K-7 ... 32K-8	9
12	64K	32K-7 ... 64K-8	17
13	128K	64K-7 ... 128K-8	33
14	256K	128K-7 ... 256K-8	65
15	512K	256K-7 ... 512K-8	129
16	1M	256K-7 ... 1M-8	257
17	2M	1M-7 ... 2M-8	513
18	4M	2M-7 ... 4M-8	1K + 1
19	8M	4M-7 ... 8M-8	2K + 1
20	16M	8M-7 ... 16M-8	4K + 1
21	32M	16M-7 ... 32M-8	8K + 1
22	64M	32M-7 ... 64M-8	16K + 1
23	128M	64M-7 ... 128M-8	32K + 1
24	256M	128M-7 ... 256M-8	64K + 1
25	512M	256M-7 ... 512M-8	128K + 1
26	1024M	512M-7 ... 1024M-8	256K + 1
27	2048M	1024M-7 ... 2048M-8	512K + 1

Allocation

A block is allocated from the free pool by first converting the requested bytes to an index in the bucket array, using the following equation:

needed = requested + 8

If needed <= 16,
then
bucket = 0

If needed > 16,
then
bucket = (log(needed)/log(2) rounded down to the nearest integer) - 3

The size of each block in the list anchored by the bucket is block size = 2^{bucket + 4}. If the list in the bucket is null, memory is allocated using the sbrk subroutine to add blocks to the list. If the block size is less than a page, then a page is allocated using the sbrk subroutine, and the number of blocks arrived at by dividing the block size into the page size are added to the list. If the block size is equal to or greater than a page, needed memory is allocated using the sbrk subroutine, and a single block is added to the free list for the bucket. If the free list is not empty, the block at the head of the list is returned to the caller. The next block on the list then becomes the new head.

Deallocation

When a block of memory is returned to the free pool, the bucket index is calculated as with allocation. The block to be freed is then added to the head of the free list for the bucket.

Reallocation

When a block of memory is reallocated, the needed size is compared against the existing size of the block. Because of the wide variance in sizes handled by a single bucket, the new block size often maps to the same bucket as the original block size. In these cases, the length of the prefix is updated to reflect the new size and the same block is returned. If the needed size is greater than the existing block, the block is freed, a new block is allocated from the new bucket, and the data is moved from the old block to the new block.

Limitations

Setting MALLOCTYPE=3.1 will only enable the malloc 3.1 policy for 32-bit programs. For 64-bit programs to use the malloc 3.1 policy, the MALLOCTYPE environment variable must be explicitly set to MALLOCTYPE=3.1_64BIT. This allocation policy is less efficient than the default and is not recommended for use in most cases.

The malloc 3.1 allocation policy supports the following options:

Malloc Disclaim
Understanding the no_overwrite Option

Understanding the pool allocation policy

Malloc pool is a high performance front end to the libc functions malloc, calloc, free, posix_memalign and realloc for managing storage objects smaller than 513 bytes. The performance benefits derive from dramatically shorter path lengths and better data cache utilization. For multithreaded applications, there is the further benefit that thread local pool anchors are used to avoid atomic operations. This front end can be used in conjunction with any of the storage management schemes currently provided in libc (yorktown, and watson).

To use malloc pool, run the following command:

export MALLOCOPTIONS=pool<:max_size>

When this option is specified, a collection of pools is created during malloc initialization where each pool is a linked list of fixed sized objects. The smallest pool can hold objects of pointer-size (such as 8 bytes for 32-bit applications or 16 bytes for 64-bit applications). Each successive pool can accommodate objects whose size is pointer-size larger than the previous pool. This means there are 128 pools for 32-bit applications and 64 pools for 64-bit applications. The collection of pools is represented as an array of pointers that "anchor" the linked lists.

Malloc pool uses its own memory, the pool heap, which is not shared with standard malloc. When specified, the max_size option is rounded up to the next higher 2 MB value and is used to control the size of the pool heap. The max_size option can be specified as a decimal number or a hexadecimal number preceded by 0x or 0X (for example, export MALLOCOPTIONS=pool:0x1700000 will set max_size to 24 MB after rounding up.

For 32-bit applications, the size of the pool heap starts at 2 MB. If more storage is required and the total pool heap storage is less than max_size, an additional 2 MB is acquired. Each 2 MBb area will be on a 2 MB boundary, but need not be contiguous to any of the other 2 MB areas. For 64-bit applications, a single contiguous pool heap of max_size is allocated during malloc initialization and never extended. If max_size is not specified, it defaults to 512 MB for 32-bit applications and 32 MB for 64-bit applications. For both 32- and 64-bit modes, max_size will be set to 512 MB if a larger size is specified. For 32-bit mode, the max_size is set to 512MB, and for 64-bit mode, the max_size is set to 3.7 GB if a larger size is specified.

Storage Utilization

All pool anchors are initially set to NULL or empty. When malloc pool services a request and the corresponding pool is empty, a routine is called that allocates storage from the pool heap in contiguous 1024-byte chunks on 1024-byte boundaries. Multiple objects of the requested size are "created". The address of the first is returned to satisfy the request, while the remaining objects are linked together and placed on the pool anchor. For each 1024-byte chunk, there is a 2-byte entry in an auxiliary table that is used by free to determine the size of a returned object.

When an object is freed by malloc pool, it is merely "pushed" onto the appropriate pool anchor. No attempt is made to coalesce blocks to create larger sized objects.

Because of this behavior, malloc pool may use more storage than other forms of malloc.

Alignment

The default alignment for the malloc(), calloc(), and realloc() subroutines must be specified by setting the MALLOCALIGN environment variable appropriately. The posix_memalign() subroutine continues working even if the MALLOCALIGN environment variable is not set. If MALLOCALIGN is greater than 512, malloc pool is not used.

Cache efficiency

The memory objects allocated with malloc pool have no prefixes or suffixes. Data cache lines are therefore more densely packed with application usable data. As all memory objects that are a power of 2 in size are aligned on a boundary equal to that size, each object is contained within the minimal number of cache lines. The malloc and free subroutines do not scan trees or linked lists and therefore do not “pollute” the cache.

Multithreaded support

Malloc pools can improve performance significantly in a multithreaded scenario because it reduces lock contention and the need of atomic operations.

Load balancing support

In some multithreaded scenarios, one-thread's free pool might grow very large due to repetitive freeing of dynamically allocated memory. However, other threads may not be able to use this memory.

Load-balancing support causes a thread to release half the memory in each pool to a global pool after the pool reaches a threshold value so that other threads can use it. You can tune the threshold values at which a thread's pool will be readjusted.

To turn on the load-balancing support, the following options must be exported:

export MALLOCOPTIONS=pool:0x80000000,pool_balanced

export MALLOCFREEPOOL=min_size<-max_size>:threshold_value<,min_size<-max_size>:
threshold_value, ... >,default:threshold

The following example sets the threshold value for the pools that provide a memory of 0 -16 bytes and 256 chunks, and the threshold value of the pool that serves 32-byte chunks to 512-byte chunks. For the rest of the pools, 128-byte chunks is the threshold value.

export MALLOCFREEPOOL=0-16:256,32:512,default:128

Debugging support

There is no debug version of this high-performance front end. If the MALLOCDEBUG environment variable is set, the pool option is ignored. It is expected that applications will be debugged using "normal" malloc prior to activating pooling.

Understanding the user-defined allocation policy

The malloc subsystem provides a mechanism through which users may develop their own algorithms for managing the system heap and allocating memory.

Understanding the no_overwrite option

An additional option available to all allocation policies is no_overwrite. To reduce the overhead of glink code within the malloc subsystem, the function descriptor for the malloc subsystem APIs are overwritten with the function descriptor for the actual underlying implementation. Because some programs, such as third-party debuggers, might not work when function pointers are modified in this manner, the no_overwrite option can be used to disable this optimization.

To disable this optimization, set MALLOCOPTIONS=no_overwrite prior to process startup.

Comparing the various allocation policies

The various malloc allocation policies elaborated above provide flexibility to application developers when used separately or combined in supported ways. It is the responsibility of the developer to recognize the unique needs of an application and to tune the various allocation policy parameters in a beneficial way.

Comparing the default and malloc 3.1 allocation policies

Because the malloc 3.1 allocation policy rounds up the size of each allocation request to the next power of 2, it can produce considerable virtual- and real-memory fragmentation and poor locality of reference. The default allocation policy is generally a better choice because it allocates exactly the amount of space requested and is more efficient about reclaiming previously used blocks of memory.

Unfortunately, some application programs may depend inadvertently on side effects of the malloc 3.1 allocation policy for acceptable performance or even for correct functioning. For example, a program that overruns the end of an array may function correctly when using the malloc 3.1 allocator only because of the additional space provided by the rounding-up process. The same program is likely to experience erratic behavior or even fail when used with default allocator because the default allocator allocates only the number of bytes requested.

As another example, because of the inefficient space reclamation of the malloc 3.1 allocation algorithm, the application program almost always receives space that has been set to zeros (when a process touches a given page in its working segment for the first time, that page is set to zeros). Applications may depend on this side effect for correct execution. In fact, zeroing out of the allocated space is not a specified function of the malloc subroutine and would result in an unnecessary performance penalty for programs that initialize only as required and possibly not to zeros. Because the default allocator is more aggressive about reusing space, programs that are dependent on receiving zeroed storage from malloc will probably fail when the default allocator is used.

Similarly, if a program continually reallocates a structure to a slightly greater size, the malloc 3.1 allocator may not need to move the structure very often. In many cases, the realloc subroutine can make use of the extra space provided by the rounding implicit in the malloc 3.1 allocation algorithm. The default allocator will usually have to move the structure to a slightly larger area because of the likelihood that something else has been called by the malloc subroutine just above it. This may present the appearance of a degradation in realloc subroutine performance when the default allocator is used instead of the malloc 3.1 allocator. In reality, it is the surfacing of a cost that is implicit in the application program's structure.

Debugging application mismanagement of the system heap

The malloc subsystem offers a collection of debugging tools intended to help the application developer debug and correct errors in a program's heap management. These debugging tools are controlled through the MALLOCDEBUG environment variable.

Synopsis of malloc environment variable and options

The following table shows the compatibility between the MALLOCTYPE and MALLOCOPTIONS environment variables.

Table 1. Compatibility between **MALLOCTYPE** and **MALLOCOPTIONS** Environment Variables
	multiheap (and sub-options)	buckets (and suboptions)	Thread Cache	disclaim	no_overwrite
Default Allocator	yes	yes	yes	yes	yes
3.1	no	no	yes	yes	yes
Watson	no	no	yes	no	no
Watson2	no	no	yes¹	no	no
user:	no	no	no	no	yes

1: With the Watson2 allocator, thread cache is always used and can not be disabled.

Table 2. Compatibility between **MALLOCDEBUG** and **MALLOCTYPE** Environment Variables
	York Town<Default Allocator>	3.1	Watson	Watson2	user:
catch_overflow (and suboptions)	yes	no	yes	no	no
report_allocations	yes	no	yes	yes	no
postfree_checking	yes	no	yes	no	no
validate_ptrs	yes	no	yes	yes	no
trace	yes	no	yes	yes	no
log	yes	no	yes	yes	no
verbose	yes	no	yes	no	no
check_arena	yes	no	yes	no	no

All MALLOCDEBUG options are compatible and supported with MALLOCOPTIONS.

Understanding the Watson2 allocation policy

The Watson2 malloc subsystem adapts to the behavior of the application when it changes from a single thread to multiple threads and from multiple threads to a single thread. It uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required. The Watson2 malloc subsystem has O (logN) amortized the cost per operation for many workloads because vast number of operations can be run at a constant time without synchronization.

Allocation

Allocation is handled through a combination of mechanisms. These mechanisms depend on parameters, such as the number of active threads, size of the request and the deallocation history of the process. The set of mechanisms reaches from a thread-specific caching and use a variable number of heaps, which has thread affinity to a Double-red-black tree and Page-based coalescing.

Deallocation

Deallocation depends on the same parameters as the allocation behavior. Typically, a returning block is captured in the thread-specific cache. Based on the heap affinity and capacity utilization, memory might be returned to one of the multiple heap structures. At times, the content from the multiple heap structure is consolidated to a common heap structure to improve coalescing and to reduce heap fragmentation. To improve robustness against application errors, the allocator identifies deallocation of invalid pointers or corrupted blocks, to a certain degree and filters these operations.

Reallocation

Large blocks of memory that are adequate are reused. If the current block cannot satisfy the request, it is replaced with a regular deallocation and allocation.

Limitations

The Watson2 malloc subsystem is adaptive to the application and does not require any further options, but the Watson2 malloc subsystem supports the following debug features that are controlled by the MALLOCDEBUG variable: validate_ptrs, report_allocations, and trace. The reports that are related to allocations can be redirected to a file by using the output:<filename> option. See Debug malloc tool for detailed information on the MALLOCDEBUG variable.