by Doug Lea
[A German adaptation and translation of this article appears in unix/mail December, 1996. This article is now out of date, and doesn’t reflect details of current version of malloc.]
Introduction
Memory allocators form interesting case studies in the engineering of infrastructure software. I started writing one in 1987, and have maintained and evolved it (with the help of many volunteer contributors) ever since. This allocator provides implementations of the the standard C routines malloc()
, free()
, and realloc()
, as well as a few auxiliary utility routines. The allocator has never been given a specific name. Most people just call it Doug Lea’s Malloc, or dlmalloc for short.
内存分配器在基础软件工程中形成了有趣的案例研究。我从1987年开始编写它(dlmalloc),从那以后 (在许多志愿者贡献者的帮助下) 一直在维护和发展它。该分配器提供了标准C例程malloc() 、free() 和realloc() 以及一些辅助实用程序例程的实现。分配器从未被赋予特定名称。大多数人只是称之为道格·李的malloc, 简称dlmalloc。
The code for this allocator has been placed in the public domain (available from ftp://g.oswego.edu/pub/misc/malloc.c), and is apparently widely used: It serves as the default native version of malloc in some versions of Linux; it is compiled into several commonly available software packages (overriding the native malloc), and has been used in various PC environments as well as in embedded systems, and surely many other places I don’t even know about.
该分配器的代码已被放置在公共领域 (可从ftp://g.oswego.edu/pub/misc/malloc.c获得),并且显然被广泛使用: 在某些版本的Linux中,它是malloc的默认本机版本; 它被编译成几个常用的软件包 (覆盖本机malloc),并且已经在各种PC环境以及嵌入式系统中使用,当然还有许多我甚至不知道的地方。
I wrote the first version of the allocator after writing some C++ programs that almost exclusively relied on allocating dynamic memory. I found that they ran much more slowly and/or with much more total memory consumption than I expected them to. This was due to characteristics of the memory allocators on the systems I was running on (mainly the then-current versions of SunOs and BSD). To counter this, at first I wrote a number of special-purpose allocators in C++, normally by overloading operator new
for various classes. Some of these are described in a paper on C++ allocation techniques that was adapted into the 1989 C++ Report article Some storage allocation techniques for container classes.
在编写了一些几乎完全依赖于分配动态内存的C + + 程序后,我编写了分配器的第一个版本。我发现它们的运行速度比我预期的要慢得多和/或总内存消耗要多得多。这是由于我运行的系统上的内存分配器的特性 (主要是当时的SunOs和BSD版本)。为了解决这个问题,起初我在C ++ 中编写了许多特殊用途的分配器,通常是通过为各种类重载新的运算符。其中一些在一篇关于C allocation分配技术的论文中进行了描述,该论文改编为1989 C Report报告文章《Some storage allocation techniques for container classes.》。
However, I soon realized that building a special allocator for each new class that tended to be dynamically allocated and heavily used was not a good strategy when building kinds of general-purpose programming support classes I was writing at the time. (From 1986 to 1991, I was the the primary author of libg++ , the GNU C++ library.) A broader solution was needed — to write an allocator that was good enough under normal C++ and C loads so that programmers would not be tempted to write special-purpose allocators except under very special conditions.
然而,我很快意识到,当构建我正在编写的通用编程支持类时,为每个倾向于动态分配和大量使用的新类构建一个特殊的分配器不是一个好的策略时间。(从1986年到1991年,我是libg + + 的主要作者,即GNU C + + 库。)需要一个更广泛的解决方案 — 编写一个在正常C + + 和C负载下足够好的分配器,这样程序员就不会被诱惑去编写特殊用途的分配器,除非在非常特殊的条件下。
This article presents a description of some of the main design goals, algorithms, and implementation considerations for this allocator. More detailed documentation can be found with the code distribution.
本文介绍了该分配器的一些主要设计目标、算法和实现注意事项。有关代码分发的更详细的文档。
Goals
A good memory allocator needs to balance a number of goals:
一个好的内存分配器需要平衡许多目标:
Maximizing Compatibility扩大兼容性
An allocator should be plug-compatible with others; in particular it should obey ANSI/POSIX conventions.分配器应与其他人兼容; 特别是它应遵守ANSI/POSIX约定。
Maximizing Portability扩大可移植性
Reliance on as few system-dependent features (such as system calls) as possible, while still providing optional support for other useful features found only on some systems; conformance to all known system constraints on alignment and addressing rules.依赖于尽可能少的依赖于系统的特性 (例如系统调用),同时仍然为仅在某些系统上发现的其他有用特性提供可选支持; 符合所有关于对齐和寻址规则的已知系统约束。
Minimizing Space减小空间
The allocator should not waste space: It should obtain as little memory from the system as possible, and should maintain memory in ways that minimize fragmentation — ``holes’’in contiguous chunks of memory that are not used by the program.分配器不应该浪费空间: 它应该从系统中获得尽可能少的内存,并且应该以最小化碎片的方式维护内存 — 程序未使用的连续内存块中的 “漏洞”。
Minimizing Time减少时间
The malloc()
, free()
and realloc
routines should be as fast as possible in the average case.在一般情况下,malloc() 、free() 和realloc例程应尽可能快。
Maximizing Tunability扩大可调性
Optional features and behavior should be controllable by users either statically (via #define
and the like) or dynamically (via control commands such as mallopt
).可选的属性和行为应该由用户静态地 (通过#define
等) 或动态地 (通过控制命令,如mallopt
) 控制。
Maximizing Locality扩大本地
Allocating chunks of memory that are typically used together near each other. This helps minimize page and cache misses during program execution.分配通常在彼此附近一起使用的内存块。这有助于最大限度地减少程序执行期间的页面和缓存丢失。
Maximizing Error Detection增强错误检测
It does not seem possible for a general-purpose allocator to also serve as general-purpose memory error testing tool such as Purify. However, allocators should provide some means for detecting corruption due to overwriting memory, multiple frees, and so on.通用分配器似乎也不可能用作通用内存错误测试工具,如纯化。但是,分配器应该提供一些方法来检测由于覆盖内存、多次释放等而导致的损坏。
Minimizing Anomalies最小化异常
An allocator configured using default settings should perform well across a wide range of real loads that depend heavily on dynamic allocation — windowing toolkits, GUI applications, compilers, interpretors, development tools, network (packet)-intensive programs, graphics-intensive packages, web browsers, string-processing applications, and so on.使用默认设置配置的分配器应该在很大程度上依赖于动态分配的各种实际负载中表现良好 — 窗口工具包、图形用户界面应用程序、编译器、解释器、开发工具、网络 (数据包)-密集型程序、图形密集型包、web浏览器、字符串处理应用程序等。
Paul Wilson and colleagues have written an excellent survey paper on allocation techniques that discusses some of these goals in more detail. See Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles, ``Dynamic Storage Allocation: A Survey and Critical Review’’ in International Workshop on Memory Management, September 1995 (also available via ftp). (Note that the version of my allocator they describe is not the most current one however.) Paul Wilson 及其同事写了一篇关于分配技术的优秀调查论文,更详细地讨论了其中一些目标。见Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles,《动态存储分配: 调查和重要回顾》,国际内存管理研讨会,1995年9月 (也可通过ftp获得)。(请注意,他们描述的我的分配器版本不是最新的版本。)
As they discuss, minimizing space by minimizing wastage (generally due to fragmentation) must be the primary goal in any allocator.正如他们所讨论的,通过减少浪费 (通常是由于碎片) 来减小空间必须是任何分配器中的主要目标。
For an extreme example, among the fastest possible versions of malloc()
is one that always allocates the next sequential memory location available on the system, and the corresponding fastest version of free()
is a no-op. However, such an implementation is hardly ever acceptable: it will cause a program to run out of memory quickly since it never reclaims unused space. Wastages seen in some allocators used in practice can be almost this extreme under some loads. As Wilson also notes, wastage can be measured monetarily: Considered globally, poor allocation schemes cost people perhaps even billions of dollars in memory chips.对于一个极端的例子,在malloc() 的最快版本中,总是分配系统上可用的下一个顺序内存位置,以及相应的最快版本的free() 是禁止操作的。然而,这样的实现几乎是不可接受的: 它将导致程序快速耗尽内存,因为它从不回收未使用的空间。在一些实际使用的分配器中看到的浪费在某些负载下几乎是如此极端。威爾遜也指出的,浪费可以用金钱来衡量: 从全球来看,糟糕的分配计划可能会让人们在内存芯片上花费数十亿美元。
While time-space issues dominate, the set of trade-offs and compromises is nearly endless. Here are just a few of the many examples:虽然时空问题占主导地位,但一系列的权衡和妥协几乎是无穷无尽的。以下是许多示例中的几个:
- Accommodating worst-case alignment requirements increases wastage by forcing the allocator to skip over bytes in order to align chunks.通过强制分配器跳过字节以对齐块,满足最坏情况的对齐要求会增加浪费。
- Most provisions for dynamic tunability (such as setting a debug mode) can seriously impact time efficiency by adding levels of indirection and increasing numbers of branches.大多数动态可调性的规定 (如设置调试模式) 会通过增加间接级别和增加分支数量来严重影响时间效率。
- Some provisions designed to catch errors limit range of applicability. For example, previously to version 2.6.6, regardless of platform, malloc internally handled allocation size arguments as if they were signed 32-bit integers, and treats nonpositive arguments as if they were requests for a size of zero. (However, as of V2.6.6, negative arguments result in null failure returns, to comply to POSIX standards.)一些旨在捕捉错误的规定限制了适用性范围。例如,在版本2.6.6之前,不管平台如何,malloc内部处理分配大小参数,就好像它们是带符号的32位整数一样,并且对待非正参数就好像它们是零大小的请求。(但是,从V2.6.6开始,否定参数导致null失败返回,以符合POSIX标准。)
- Accommodating the oddities of other allocators to remain plug-compatible with them can reduce flexibility and performance. For the oddest example, some early versions of Unix allocators allowed programmers to
realloc
memory that had already beenfreed
. Until 1993, I allowed this for the sake of compatibility. (However, no one at all complained when this ``feature’’ was dropped.)适应其他分配器的奇怪之处以保持与它们的插件兼容会降低灵活性和性能。对于最奇怪的例子,一些早期版本的Unix分配器允许程序员重新分配已经释放的内存。直到1993,为了兼容性,我允许这样做。(然而,当这个 “特征” 被删除时,根本没有人抱怨。) - Some (but by no means all) heuristics that improve time and/or space for small programs cause unacceptably worse time and/or space characteristics for larger programs that dominate the load on typical systems these days.一些 (但绝不是全部) 改善小程序的时间和/或空间的启发式算法导致如今主导典型系统负载的较大程序的时间和/或空间特性变得不可接受。
No set of compromises along these lines can be perfect. However, over the years, the allocator has evolved to make trade-offs that the majority of users find to be acceptable. The driving forces that continue to impact the evolution of this malloc include:没有一套这样的妥协是完美的。然而,多年来,分配器已经发展到大多数用户认为可以接受的权衡。继续影响此malloc的演变的驱动力包括:
- Empirical studies of malloc performance by others (including the above-mentioned paper by Wilson et al, as well as others that it in turn cites). These papers find that versions of this malloc increasingly rank as simultaneously among the most time- and space-efficient memory allocators available. However, each reveals weaknesses or opportunities for further improvements.其他人对malloc性能的实证研究 (包括威爾遜等人的上述论文,以及它所引用的其他论文)。这些论文发现,这种malloc的版本越来越多地同时排在可用的时间和空间效率最高的内存分配器之一。然而,每一个都揭示了弱点或进一步改进的机会。
- Changes in target workloads. The nature of the kinds of programs that are most sensitive to malloc implementations continually change. For perhaps the primary example, the memory characteristics of X and other windowing systems increasingly dominate.目标工作负载的更改。对malloc实现最敏感的程序类型的性质不断变化。也许对于主要的例子,X和其他窗口系统的记忆特性越来越占主导地位。
- Changes in systems and processors. Implementation details and fine-tunings that try to make code readily optimizable for typical processors change across time. Additionally, operating systems (including Linux and Solaris) have themselves evolved, for example to make memory mapping an occasionally-wise choice for system-level allocation.系统和处理器的变化。实现细节和微调,试图使代码易于优化,以适应典型处理器随时间的变化。此外,操作系统 (包括Linux和Solaris) 本身已经发展,例如使内存映射偶尔成为系统级分配的选择。
- Suggestions, experience reports, and code from users and contributors. The code has evolved with the help of several regular volunteer contributors. The majority of recent changes were instigated by people using the version supplied in Linux, and were implemented in large part by Wolfram Gloger for the Linux version and then integrated by me.来自用户和贡献者的建议、经验报告和代码。在几个定期志愿者贡献者的帮助下,该代码得到了发展。最近的大部分变化是由使用Linux提供的版本的人发起的,并且很大程度上是由Wolfram Gloger为Linux版本实现的,然后由我集成。
Algorithms
The two core elements of the malloc algorithm have remained unchanged since the earliest versions:自最早版本以来,malloc算法的两个核心元素保持不变:
Boundary Tags边界标记
Chunks of memory carry around with them size information fields both before and after the chunk. This allows for two important capabilities:内存块都包含大小字段。这点保证了两个重要的功能:
- Two bordering unused chunks can be coalesced into one larger chunk. This minimizes the number of unusable small chunks.两个相邻的未使用块可以合并成一个更大的块。这将尽可能的减少无法使用的小块数量,从而减少内存碎片。
- All chunks can be traversed starting from any known chunk in either a forward or backward direction.可以从任何已知块开始向前或向后遍历所有块。
The original versions implemented boundary tags exactly in this fashion. More recent versions omit trailer fields on chunks that are in use by the program. This is itself a minor trade-off: The fields are not ever used while chunks are active so need not be present. Eliminating them decreases overhead and wastage. However, lack of these fields weakens error detection a bit by making it impossible to check if users mistakenly overwrite fields that should have known values.原始版本正是以这种方式实现了边界标签。最新版本省略了程序正在使用的块上的标记字段。这本身就是一个小的权衡: 当块处于活动状态时,该字段从未使用过,因此不需要它。消除它们减少了开销和浪费。但是,缺少这些字段会稍微削弱错误检测,因为无法检查用户是否错误地覆盖了应该具有已知值的字段。
Binning装箱
Available chunks are maintained in bins, grouped by size. There are a surprisingly large number (128) of fixed-width bins, approximately logarithmically spaced in size. Bins for sizes less than 512 bytes each hold only exactly one size (spaced 8 bytes apart, simplifying enforcement of 8-byte alignment). Searches for available chunks are processed in smallest-first, best-fit order. As shown by Wilson et al, best-fit schemes (of various kinds and approximations) tend to produce the least fragmentation on real loads compared to other general approaches such as first-fit.通过箱子来保存和维护可用的块,按大小分箱。有有多达128个固定大小的箱子,箱子标号大约等于箱子大小的对数。大小小于512字节的箱子每个只包含一类大小 (相邻箱子的大小间隔8个字节,简化了8字节对齐的执行), xxw:所以,大小小于512的箱子中的所有块是不用排序的(因为大小都是一样的)。对可用块的搜索按照优先最小、最合适的顺序进行处理。威爾遜等人所示,与其他一般方法 (如首次匹配) 相比,最佳匹配方案 (各种类型和近似值) 在实际运行中上产生的内存碎片最少。
Until the versions released in 1995, chunks were left unsorted within bins, so that the best-fit strategy was only approximate. More recent versions instead sort chunks by size within bins, with ties broken by an oldest-first rule. (This was done after finding that the minor time investment was worth it to avoid observed bad cases.) 在1995年发布的版本之前,这些箱子中的内存块没有排序,所以最合适的策略只是近似的。更近的版本代替按箱内的大小对块进行排序。(我们发现较小的时间投资(指的是箱内的块按大小排序)是值得的,可以避免观察到的坏情况(应该是指搜索慢,容易产生很多较小的内存碎片))
Thus, the general categorization of this algorithm is best-first with coalescing: Freed chunks are coalesced with neighboring ones, and held in bins that are searched in size order.因此,该算法的一般分类方法是最佳匹配伴随合并: 释放的块与相邻的空闲块合并,并保存在按大小顺序排序的箱中。
This approach leads to fixed bookkeeping overhead per chunk. Because both size information and bin links must be held in each available chunk, the smallest allocatable chunk is 16 bytes in systems with 32-bit pointers and 24 bytes in systems with 64-bit pointers. These minimum sizes are larger than most people would like to see — they can lead to significant wastage for example in applications allocating many tiny linked-list nodes. However, the 16 bytes minimum at least is characteristic of any system requiring 8-byte alignment in which there is any malloc bookkeeping overhead.这种方法导致每个块固定的记录开销。因为大小信息和箱子链表都必须保存在每个可用块中,所以在具有32位指针的系统中,最小可分配块为16字节(大小信息-4字节,箱子链表信息-4字节,前指针-4字节,后指针-4字节,四个字段的类型大小加起来就是16),在具有64位指针的系统中,最小可分配块为24字节(大小信息-4字节,箱子链表信息-4字节,前指针-8字节,后指针-8字节,四个字段的类型大小加起来就是16)。这些最小大小比大多数人希望看到的要大 — 它们可能导致严重的浪费,例如在分配许多微小链表节点的应用程序中。然而,至少16字节是任何需要8字节对齐的系统的特征,因为16字节保存了任何malloc记录开销。
This basic algorithm can be made to be very fast. Even though it rests upon a search mechanism to find best fits, the use of indexing techniques, exploitation of special cases, and careful coding lead to average cases requiring only a few dozen instructions, depending of course on the machine and the allocation pattern.这个基本算法可以非常快。尽管它依靠一种搜索机制来找到最合适的,但索引技术的使用、特殊情况的利用和仔细的编码导致平均情况只需要几十个指令,当然取决于机器和分配模式。
While coalescing via boundary tags and best-fit via binning represent the main ideas of the algorithm, further considerations lead to a number of heuristic improvements. They include locality preservation, wilderness preservation, memory mapping, and caching.虽然通过边界标签合并和通过装箱进行最佳匹配代表了算法的主要思想,而进一步的考虑带来了许多启发式的改进。包括:本地保护、未使用的内存保护、内存映射和缓存。
Locality preservation-位置保护
Chunks allocated at about the same time by a program tend to have similar reference patterns and coexistent lifetimes. Maintaining locality minimizes page faults and cache misses, which can have a dramatic effect on performance on modern processors. If locality were the only goal, an allocator might always allocate each successive chunk as close to the previous one as possible. However, this nearest-fit (often approximated by next-fit) strategy can lead to very bad fragmentation. In the current version of malloc, a version of next-fit is used only in a restricted context that maintains locality in those cases where it conflicts the least with other goals: If a chunk of the exact desired size is not available, the most recently split-off space is used (and resplit) if it is big enough; otherwise best-fit is used. This restricted use eliminates cases where a perfectly usable existing chunk fails to be allocated; thus eliminating at least this form of fragmentation. And, because this form of next-fit is faster than best-fit bin-search, it speeds up the average malloc
. 程序大约在同一时间分配的块往往具有相似的引用模式和生存期。位置保护可最大程度地减少页面错误和缓存失误,这可能会对现代处理器的性能产生巨大影响。如果位置是唯一的目标,则分配器可以始终将每个连续块分配到尽可能接近前一个块的位置。然而,这种最近匹配 (通常近似于下一次匹配) 策略会导致非常糟糕的碎片化。在当前版本的malloc中,next-fit版本仅在限制上下文中使用,在与其他目标冲突最少的情况下,该上下文保持位置: 如果没有确切所需大小的块,则使用最近分割的空间 (并重新放置)如果它足够大; 否则使用最合适的。这种受限使用消除了无法分配完全可用的现有块的情况; 因此至少消除了这种形式的碎片。而且,因为这种next-fit形式比best-fit bin-search更快,所以它加快了平均malloc值。
Wilderness Preservation-野块保护
The `wilderness'' (so named by Kiem-Phong Vo) chunk represents the space bordering the topmost address allocated from the system. Because it is at the border, it is the only chunk that can be arbitrarily extended (via
sbrkin Unix) to be bigger than it is (unless of course
sbrk` fails because all memory has been exhausted).“Wilderness” (由Kiem-Phong Vo命名) 块表示与从系统分配的最顶层地址相邻的空间。因为它在边界,所以它是唯一可以任意扩展的块 (通过Unix中的sbrk) 要大于它 (当然,除非sbrk失败,因为所有内存都已耗尽)。
One way to deal with the wilderness chunk is to handle it about the same way as any other chunk. (This technique was used in most versions of this malloc until 1994). While this simplifies and speeds up implementation, without care it can lead to some very bad worst-case space characteristics: Among other problems, if the wilderness chunk is used when another available chunk exists, you increase the chances that a later request will cause an otherwise preventable sbrk
.处理野块的一种方法是像处理其他块一样处理它。(此技术在1994年用于此malloc的大多数版本中)。虽然这简化并加快了实施速度,但如果不小心,它会导致一些非常糟糕的最坏情况的空间特征: 在其他问题中,如果在存在另一个可用块时使用荒野块,您增加了稍后请求将导致其他可预防的可能性sbrk。
A better strategy is currently used: treat the wilderness chunk as ``bigger’’ than all others, since it can be made so (up to system limitations) and use it as such in a best-first scan. This results in the wilderness chunk always being used only if no other chunk exists, further avoiding preventable fragmentation.目前使用了一个更好的策略: 将野块视为比所有其他块 “更大”,因为它可以这样做 (取决于系统限制) 并在最佳首次扫描中使用它。这导致只有当不存在其他块时,荒野块才会被使用,从而进一步避免可预防的碎片。
Memory Mapping-内存映射
In addition to extending general-purpose allocation regions via sbrk
, most versions of Unix support system calls such as mmap
that allocate a separate non-contiguous region of memory for use by a program. This provides a second option within malloc
for satisfying a memory request. Requesting and returning a mmap
ed chunk can further reduce downstream fragmentation, since a released memory map does not create a `hole'' that would need to be managed. However, because of built-in limitations and overheads associated with
mmap, it is only worth doing this in very restricted situations. For example, in all current systems, mapped regions must be page-aligned. Also, invoking
mmapand
mfreeis much slower than carving out an existing chunk of memory. For these reasons, the current version of malloc relies on
mmaponly if (1) the request is greater than a (dynamically adjustable) threshold size (currently by default 1MB) and (2) the space requested is not already available in the existing arena so would have to be obtained via
sbrk`.除了通过 sbrk 扩展通用分配区域外,大多数版本的 Unix 支持系统调用(如 mmap),还分配了单独的非连续内存区域供程序使用。这为满足内存请求提供了malloc内部的第二个选项。请求并返回一个映射块可以进一步减少下游碎片化,因为已发布的内存映射不会创建需要管理的空洞。但是,由于固有限制和与 mmap 相关的间接开销,因此仅在非常有限的情况下才值得这样做。例如,在所有当前系统中,映射区域必须与页面对齐。此外,调用 mmap 和 mfree 比切割现有内存块慢得多。出于这些原因,目前版本的 malloc 仅在 (1) 请求大于 (动态可调) 阈值大小(目前默认为 1MB)时依赖 mmap,并且 (2) 请求的空间在现有场景中尚未可用,因此必须通过 sbrk 获得。
In part because mmap
is not always applicable in most programs, the current version of malloc also supports trimming of the main arena, which achieves one of the effects of memory mapping — releasing unused space back to the system. When long-lived programs contain brief peaks where they allocate large amounts of memory, followed by longer valleys where the have more modest requirements, system performance as a whole can be improved by releasing unused parts of the wilderness chunk back to the system. (In nearly all versions of Unix, sbrk
can be used with negative arguments to achieve this effect.) Releasing space allows the underlying operating system to cut down on swap space requirements and reuse memory mapping tables. However, as with mmap
, the call itself can be expensive, so is only attempted if trailing unused memory exceeds a tunable threshold.部分原因是mmap在大多数程序中并不总是适用,当前版本的malloc也支持对主场的修剪,这实现了内存映射的一个效果—将未使用的空间释放回系统。当长期运行的程序包含短暂的高峰,分配大量的内存,然后是较长的低谷,对内存的需求较小,系统的整体性能可以通过释放未使用的野块的部分回到系统中而得到改善。(在几乎所有的Unix版本中,sbrk可以用负值参数来达到这个效果)。) 释放空间允许底层操作系统减少交换空间的需求,并重新使用内存映射表。然而,和mmap一样,这个调用本身可能是昂贵的,所以只有在尾随的未使用内存超过一个可调整的阈值时才会尝试。
Caching-缓存
In the most straightforward version of the basic algorithm, each freed chunk is immediately coalesced with neighbors to form the largest possible unused chunk. Similarly, chunks are created (by splitting larger chunks) only when explicitly requested.在基本算法的最直接版本中,每个被释放的块都会立即与相邻的块联合起来,形成最大的未使用的块。同样,只有在明确要求的情况下,才会创建块(通过拆分更大的块)。
Operations to split and to coalesce chunks take time. This time overhead can sometimes be avoided by using either of both of two caching strategies:分割和合并块的操作需要时间。这种时间开销有时可以通过使用两种缓存策略中的一种来避免。
Deferred Coalescing延迟合并
Rather than coalescing freed chunks, leave them at their current sizes in hopes that another request for the same size will come along soon. This saves a coalesce, a later split, and the time it would take to find a non-exactly-matching chunk to split.与其合并已释放的块,不如让它们保持当前的大小,希望很快会有另一个相同大小的请求出现。这样做可以节省合并的时间,也可以节省后来分割的时间,还可以节省寻找一个不完全匹配的块来分割的时间。
Preallocation预分配
Rather than splitting out new chunks one-by one, pre-split many at once. This is normally faster than doing it one-at-a-time.与其一个一个地拆分新块,不如一次预拆分许多块。这通常比一次一次地做要快。
Because the basic data structures in the allocator permit coalescing at any time, in any of malloc
, free
, or realloc
, corresponding caching heuristics are easy to apply.因为分配器的基本数据结构允许在任何时候,在任何malloc、free或realloc中进行合并,相应的缓存方法很容易应用。
The effectiveness of caching obviously depends on the costs of splitting, coalescing, and searching relative to the work needed to track cached chunks. Additionally, effectiveness less obviously depends on the policy used in deciding when to cache versus coalesce them. 缓存的有效性显然取决于分割、合并和搜索的成本与跟踪缓存块所需的工作相比。此外,效果也不太明显,它取决于决定何时缓存和合并的策略。.
Caching can be a good idea in programs that continuously allocate and release chunks of only a few sizes. For example, if you write a program that allocates and frees many tree nodes, you might decide that is worth it to cache some nodes, assuming you know of a fast way to do this. However, without knowledge of the program, malloc
cannot know whether it would be a good idea to coalesce cached small chunks in order to satisfy a larger request, or whether that larger request should be taken from somewhere else. And it is difficult for the allocator to make more informed guesses about this matter. For example, it is just as costly for an allocator to determine how much total contiguous space would be gained by coalescing chunks as it would be to just coalesce them and then resplit them.在持续分配和释放只有少数大小的块的程序中,缓存可能是一个好主意。例如,如果你写了一个分配和释放许多树节点的程序,你可能会决定缓存一些节点是值得的,假设你知道有一个快速的方法可以做到这一点。然而,在不了解程序的情况下,malloc无法知道将缓存的小块合并起来以满足一个更大的请求是否是个好主意,或者这个更大的请求是否应该从其他地方获取。而且,分配器也很难对这个问题做出更有根据的猜测。例如,对于一个分配器来说,确定通过合并小块可以获得多少总的连续空间的代价和只是合并小块然后重新分割它们的代价是一样的。
Previous versions of the allocator used a few search-ordering heuristics that made adequate guesses about caching, although with occasionally bad worst-case results. But across time, these heuristics appear to be decreasingly effective under real loads. This is probably because actual programs that rely heavily on malloc increasingly tend to use a larger variety of chunk sizes. For example, in C++ programs, this probably corresponds to a trend for programs to use an increasing number of classes. Different classes tend to have different sizes.以前的分配器版本使用了一些搜索排序的启发式方法,对缓存进行了充分的猜测,虽然偶尔会有糟糕的结果。但随着时间的推移,这些启发式方法在实际负载下似乎越来越有效了。这可能是因为严重依赖malloc的实际程序越来越倾向于使用更多的块大小。例如,在C++程序中,这可能与程序使用越来越多的类的趋势相一致。不同的类往往有不同的大小。
[Note: More recent versions of malloc DO cache, but only small chunks.][注:最近版本的malloc 使用了缓存,但只是小块的缓存。]
Lookasides-旁观者
There remains one kind of caching that is highly desirable in some applications but not implemented in this allocator — lookasides for very small chunks. As mentioned above, the basic algorithm imposes a minimum chunk size that can be very wasteful for very small requests. For example, a linked list on a system with 4-byte pointers might allocate nodes holding only, say, two pointers, requiring only 8 bytes. Since the minimum chunk size is 16 bytes, user programs allocating only list nodes suffer 100% overhead.还有一种缓存在某些应用中是非常理想的,但在这个分配器中没有实现—对非常小的块进行旁观。如上所述,基本算法规定了一个最小的块大小,这对于非常小的请求来说可能是非常浪费的。例如,在一个有4字节指针的系统上,一个链接列表可能只分配了持有两个指针的节点,只需要8个字节。因为最小的块大小是16字节,所以只分配列表节点的用户程序会遭受100%的开销。
Eliminating this problem while still maintaining portable alignment would require that the allocator not impose any overhead. Techniques for carrying this out exist. For example, chunks could be checked to see if they belong to a larger aggregated space via address comparisons. However, doing so can impose significant costs; in fact the cost would be unacceptable in this allocator. Chunks are not otherwise tracked by address, so unless arbitrarily limited, checking might lead to random searches through memory. Additionally, support requires the adoption of one or more policies controlling whether and how to ever coalesce small chunks.消除这个问题的同时仍然保持可移植的对齐方式,需要分配器不施加任何开销。有一些技术可以做到这一点。例如,可以通过地址比较来检查各块是否属于一个更大的聚合空间。然而,这样做会带来巨大的成本;事实上,这种成本在这个分配器中是不可接受的。另外,分块并不是按地址追踪的,所以除非任意限制,否则检查可能会导致在内存中随机搜索。此外,支持需要采用一个或多个策略来控制是否以及如何将小块合并起来。
Such issues and limitations lead to one of the very few kinds of situations in which programmers should routinely write their own special purpose memory management routines (by, for example in C++ overloading operator new()
). Programs relying on large but approximately known numbers of very small chunks may find it profitable to build very simple allocators. For example, chunks can be allocated out of a fixed array with an embedded freelist, along with a provision to rely on malloc
as a backup if the array becomes exhausted. Somewhat more flexibly, these can be based on the C or C++ versions of obstack available with GNU gcc and libg++.这样的问题和限制导致了一种极少数的情况,在这种情况下,程序员应该经常编写他们自己的特殊用途的内存管理例程(例如,在C++中重载运算符new())。依靠大量但近似已知数量的非常小的块的程序可能会发现建立非常简单的分配器是有利的。例如,可以从一个固定的数组中分配小块,并嵌入一个自由列表,同时规定在数组耗尽时依靠malloc作为备份。更灵活的是,这些分配器可以基于GNU gcc和libg++提供的C或C++版本的 obstack。
Last modified: Tue Apr 4 06:57:17 EDT 2000
http://gee.cs.oswego.edu/dl/html/malloc.html