nedmalloc is a VERY fast, VERY scalable, multithreaded memory allocator with little memory fragmentation. If you’re running on an older operating system (e.g. Windows XP, Linux 2.4 series, FreeBSD 6 series, Mac OS X 10.4 or earlier) you will probably find it significantly improves your application’s performance (Windows 7, Linux 3.x, FreeBSD 8, Mac OS X 10.6 all contain state-of-the-art allocators and no third party allocator is likely to significantly improve on them in real world results). Unlike other allocators, it is written in C and so can be used anywhere and it also comes under the Boost software license which permits commercial usage.
Nedmalloc是一种非常快速、可扩展的多线程内存分配器,内存碎片很少。如果您在较旧的操作系统上运行 (例如g. Windows XP、Linux 2.4系列、FreeBSD 6系列、Mac OS X 10.4或更早版本) 您可能会发现它显著提高了应用程序的性能 (Windows 7、Linux 3.X、FreeBSD 8、Mac OS X 10.6都包含最先进的分配器,在现实世界中,没有第三方分配器可能会显著改善它们)。与其他分配器不同,它是用C编写的,因此可以在任何地方使用,并且它也属于允许商业使用的Boost软件许可证。

It has been tested on some very high end hardware with more than eight processing cores and more than 8Gb of RAM. It is in daily use by some of the world’s major banks, root DNS servers, multinational airlines and consumer products (embedded). It also costs no money (though donations are welcome!). Thanks to work generously sponsored by Applied Research Associates, nedmalloc can patch itself into existing binaries to replace the system allocator on Windows - for example, Microsoft Word on Windows XP is noticeably quicker for very large documents after the nedmalloc DLL has been injected into it!
它已经在一些具有8多个处理内核和8Gb以上RAM的非常高端的硬件上进行了测试。它被世界上一些主要的银行、根域名服务器、跨国航空公司和消费品 (嵌入式) 日常使用。它也不需要花钱 (捐款是受欢迎的!)。由于应用研究协会(Applied Research Associates)慷慨赞助的工作,nedmalloc可以将自己修补到现有的二进制文件中,以取代Windows上的系统分配器 — 例如,在将nedmallocdll注入非常大的文档后,Windows XP上的Microsoft Word明显更快!

It is more than 125 times faster than the standard Windows XP memory allocator, 4-10 times faster than the standard FreeBSD 6 memory allocator and up to twice as fast as ptmalloc2, the standard Linux memory allocator. It can sustain a minimum of between 7.3m and 8.2m malloc & free pair operations per second on a 3400 (2.20Ghz) AMD Athlon64 machine.
它比标准Windows XP内存分配器快125倍以上,比标准FreeBSD 6内存分配器快4-10倍,速度是ptmalloc2(标准Linux内存分配器)的两倍。在3400 (2.20Ghz) AMD Athlon64机器上,它每秒至少可维持7.3m至8.2m malloc-free操作。

It scales with extra CPU’s far better than either the standard Windows XP memory allocator or ptmalloc2 and can cause significantly less memory bloating than ptmalloc2. It avoids processor serialisation (locking) entirely when the requested memory size is in the thread cache leading to the kind of scalability you can see in the graph on the right. In real world code:
它使用额外的CPU扩展,远远优于标准Windows XP内存分配器或ptmalloc2,并且比ptmalloc2造成的内存膨胀明显更少。当请求的内存大小在线程缓存中时,它完全避免了处理器序列化 (锁定),从而导致了您可以在右图中看到的那种可伸缩性。在现实世界代码中:

ned Productions - nedmalloc - 图1

Memory Mapped Packetised nedmalloc’s Improvement
Win32 (default) 123.72 46.29 45.38%
nedmalloc v1.02 179.87 71.3 -
nedmalloc v1.01 172.47 67.9 4.29%
Win32 (low frag) 164.28 58.74 9.49%
ptmalloc2 167.41 63.46 7.44%
Hoard v3.4 167.4 64.65 7.45%

If you want an explanation of the difference between the Packetised and Memory Mapped benchmarks, please see the Tn homepage (but basically, the Packetised involves performing a lot more memory ops in a more loaded multithreaded environment). As you can see above, the benefits of nedmalloc translate into real world code with more than a 50% speed increase over the default win32 allocator. The Tn speed test is very heavy on the memory bus, so you can expect your own applications to see greater improvements than this.
如果您想了解数据打包(Packetised )和内存映射基准之间的区别,请参阅Tn主页 (但基本上,打包包括在加载更多的多线程环境中执行更多的内存操作)。正如您在上面看到的,nedmallocer的优势转化为真实世界的代码,比默认的win32分配器提高了50% 以上的速度。Tn速度测试在内存总线上非常繁重,因此您可以期望自己的应用程序看到比这更大的改进。

See below for a Frequently Asked Questions list. Below and to the right is a series of comparisons between nedmalloc, system allocators and a number of other replacement memory allocators such as tcmalloc and Hoard. The graphs below are for v1.00 but are still good for an idea of performance on a wide variety of systems, but note than nedmalloc has become much faster in recent revisions (as you can see on the right).
有关常见问题列表,请参见下文。下面和右边是nedmalloc、系统分配器和许多其他替换内存分配器 (如tcmalloc和Hoard) 之间的一系列比较。下图适用于v1.00,但对于各种系统的性能概念仍然很有用,但是请注意,在最近的版本中,nedmalloc变得更快 (如您在下图看到的)。
ned Productions - nedmalloc - 图2

The next generation of memory allocator: the v1.2x series

下一代内存分配器: v1.2x系列

Since v1.10, and given the outstanding default performance of the Windows 7, Apple Mac OS X 10.6 and FreeBSD 7+ system allocators, nedmalloc has taken a different approach to improve performance: it has begun to implement changes to the 1970s malloc API and kernel VM design whose design increasingly constrains performance on modern systems.
自v1.10以来,鉴于Windows 7、Apple Mac OS X 10.6和FreeBSD 7 + 系统分配器的出色默认性能,nedmalloc采取了不同的方法来提高性能: 它已经开始改进20世纪70年代的malloc API和内核VM设计——其设计越来越限制了现代系统的性能。

To my knowledge, nedmalloc is among the fastest portable memory allocators available, and it has many features and outstanding configurability useful in themselves. However it cannot consistently beat the excellent system allocators in Windows 7, Apple Mac OS X 10.6+ or FreeBSD 7+ (and neither can any other allocator I know of in real world testing). It isn’t any slower than these allocators, but for now we have plateaued with current API and VM design.
据我所知,nedmalloc是可用的最快的便携式内存分配器之一,它具有许多功能和出色的可配置性,本身非常有用。然而,它不能一直击败Windows 7、苹果Mac OS X 10.6 + 或FreeBSD 7 + 中的优秀系统分配器 (在现实世界的测试中,我所知道的任何其他分配器也不能)。它并不比这些分配器慢,但目前我们已经在当前的API和VM设计上稳定下来。

For a next generation API design allocator, see the C1X change proposal N1527 at http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1527.pdf). Two reference C implementations of N1527 are also available at http://github.com/ned14/C1X_N1527. This proposed API substantially reduces whole program memory allocation latencies, and the ISO C1X committee have not rejected the idea in principle (they are currently considering whether to make it into a Technical Specification).
下一代API设计分配器,请参考C1X变更提案 N1527(http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1527.pdf)。两个参考N1527的C语言实现: http://github.com/ned14/C1X_N1527。这一提议的应用编程接口大大减少了整个程序内存分配延迟,并且国际标准化组织C1X委员会原则上没有拒绝这一想法 (他们目前正在考虑是否将其纳入技术规范)。

v1.10 beta 1 had a first attempt at an improved malloc API. N1527 introduced a second attempt, and resulting from the feedback from the March 2011 ISO C1X committee meeting in London, v1.20 intends to introduce a third attempt at getting the API right. The committee has had an idea of attributed arenas, so basically one creates memory pools which have certain configurable characteristics. This is fairly complex, but solves a whole load of problems present and future at once.
v1.10 beta 1首次尝试改进malloc API。N1527介绍了第二版的尝试,根据2011年3月在伦敦举行的ISO C1X委员会会议的反馈,v1.20打算引入第三版尝试来正确使用API。委员会已经有了一个“属性竞技场”的想法,所以基本上可以创建具有某些可配置特征的内存池。这相当复杂,但同时解决了当前和未来的一大堆问题。

For an example of a next generation VM design allocator (which by the way the new malloc API allows you to use directly through the alignment and size rounding pool attributes i.e. you set both to the page size), you can try the user mode page allocator in nedmalloc v1.10 (Windows Vista or later only). It opens a whole new world of performance and scalability, but requires Administrator privileges to run. Want to know more? Here are two academic papers on the subject:
关于下一代VM设计分配器的示例 (顺便说一下,新的malloc api允许您通过对齐和大小舍入池属性来直接使用,即您将两者都设置为页面大小),您可以在nedmalloc v1.10 (仅限Windows Vista或更高版本) 中尝试用户模式页面分配器。它开启了一个全新的性能和可扩展性世界,但需要管理员权限才能运行。想知道更多吗?以下是关于该主题的两篇学术论文:

  1. Douglas, N, (2011-May), ‘User Mode Memory Page Management: An old idea applied anew to the memory wall problem‘, ArXiv e-prints, vol: 1105.1815.
  2. Douglas, N, (2011-May), ‘User Mode Memory Page Allocation: A Silver Bullet For Memory Allocation?‘, ArXiv e-prints, vol: 1105.1811.

Downloads:

ned Productions - nedmalloc - 图3

ChangeLog (from GIT). GIT HEAD (both are identical mirrors):

Current bleeding edge: v1.10 beta 4 in GIT HEAD.

Current betas: Beta 3 of v1.10 (455Kb). You should use this in preference to any other (it’s a very mature beta).

Previous: Beta 2 of v1.06 (svn 1159) (963Kb) Beta 1 of v1.06 (svn 1151) (957Kb) v1.05 (svn 1078) of nedmalloc (80Kb) v1.04 (svn 1040) of nedmalloc (80Kb) v1.03 of nedmalloc (76.4Kb) v1.02 of nedmalloc (76.3Kb) v1.01 of nedmalloc (71.9Kb) v1.00 of nedmalloc (69.7Kb)

Changes last few releases:

v1.10 beta 3 17th July 2012:

  • [master 5f26c1a] Due to a bug introduced in sha 7a9dd5c (17th April 2010), nedmalloc has never allocated more than a single mspace when using the system pool. This effectively had disabled concurrency for any allocation > THREADCACHEMAX (8Kb) which no doubt made nedmalloc v1.10 betas 1 and 2 appear no faster than system allocators. My thanks to the eagle eyes of Gavin Lambert for spotting this. 由于sha 7a9dd5c (2010年4月17日) 中引入的一个bug,在使用系统池时,nedmalloc不能分配大于一个mspace的内存。这有效地禁用了任何分配> 线程cachemax (8Kb) 的并发,这无疑使nedmallocv1.10 betas 1和2看起来不比系统分配器快。感谢加文兰伯特的鹰眼发现了这个bug。

v1.10 beta 2 10th July 2012:

  • [master 51ab2a2] scons now tests for C++0x support before turning it on and tries multiple libraries for clock_gettime() rather than assuming it lives in librt. This ought to fix miscompilation on Mac OS X. Thanks to Robert D. Blanchet Jr. for reporting this.
  • [master b2c3517] Mac defines malloc_size to be const void ptr, not void ptr
  • [master 9333e50] Updated to use the new O(1) Cfind(rounds=1) feature in nedtries
  • [master 54c7e44] Avoid overflowing allocation size. Thanks to Xi Wang for supplying a patch fixing this.
  • [master 5b614a0] Removed try1 and finally1 from MinGW support as x64 target no longer supports SEH. Thanks to Geri for reporting this.
  • [master 48f1aa9] Tidied up bitrot which had broken compilation due to mismatched #if…#endif.

v1.10 beta 1 19th May 2011:

  • [master 89f1806] Moved from SVN to GIT. Bumped version to v1.10 as new ARA contract will involve significant further improvements mainly centering around realloc() performance.
  • [master 254fe7c] Added nedmemsize() for API compatibility with other allocators. Added DEFAULTMAXTHREADSINPOOL and set it to FOUR which is a BREAKING CHANGE from previous versions of nedalloc (which set it to 16).
  • [nedmalloc_fast_realloc 97d1420] Added win32mremap() implementation.
  • [nedmalloc_fast_realloc 8a1001e] Significantly improved test.c with new test options TESTCPLUSPLUS, BLOCKSIZE, TESTTYPE and MAXMEMORY.
  • [nedmalloc_fast_realloc 7ea606d] Implemented two variants of direct mremap() on Windows, one using file mappings and the other using over-reservation. The former is used on 32 bit and the latter on 64 bit.
  • [nedmalloc_fast_realloc 26ff9a7] Added the malloc2() interface to nedalloc.
  • [nedmalloc_fast_realloc 5bc5d97] Rewrote Readme.txt to become Readme.html which makes it much clearer to read.
  • [nedmalloc_fast_realloc 2efa595] Added doxygen markup to nedmalloc.h and a first go at a policy driven STL allocator class.
  • [nedmalloc_fast_realloc d851bde] Added a CHM documenting the nedalloc API.
  • [nedmalloc_fast_realloc dbd3991] Added a fast malloc operations logger which outputs a CSV log on process exit.
  • [nedmalloc_fast_realloc d6a8585] Added stack backtracing to the logger.
  • [master c7ea06d] Finished user mode page allocator, so merged nedmalloc_fast_realloc branch.
  • [master 9a8800f] Fixed small bug which was preventing the windows patcher from correctly finding the proper MSVCRT.
  • [master 37c58b1] Fixed leak of mutexes when using pthread or win32 mutexs as locks. Thanks to Gavin Lambert for reporting this.
  • [master f67e284] Fixed nedflushlogs() not actually flushing data and/or causing a segfault. Thanks to Roman Tatkin for reporting this.
  • [master 1324bf3] Finally got round to retiring the MSVC project files as they were sources of never ending hassle due to being out of sync with the SConstruct config. Rebuilt scons build system to be fully compatible with MSVC instead (long overdue!)
  • [master 068494e] As the release of v1.10 RC1 approaches, fixed a long standing problem with the binary patcher where multiple MSVCRT versions in the process weren’t handled - everything was sent to one MSVCRT only, and needless to say that sorta worked sometimes and sometimes not. Now when nedmalloc passes a foreign block to the system allocator, it runs a stack backtrace to figure out what MSVCRT in the process it ought to pass it to. It’s slow, but fixes a very common segfault on process exit on VS2010.
  • [master 4cca52c] Very embarrassingly, nedmalloc has been severely but unpredictably broken on POSIX for over a year now when built with DEBUG defined. This was turning on DEFAULT_GRANULARITY_ALIGNED whose POSIX implementation was causing random segfaults so mysterious that neither gdb nor valgrind could pick them up - in other words, the very worst kind of memory corruption: undetectable, untraceable and undebuggable. I only found them myself due to a recent bug report for TnFOX on POSIX where due to luck, very recent Linux kernels just happened by pure accident to cause this bug to manifest itself as preventing process init right at the very start - so early that no debugger could attach. After over a week of trial & error I narrowed it down to being somewhere in nedmalloc, then having something to do with DEBUG being defined or not, then two hours ago the eureka moment arrived and I quite literally did a jig around the room in joy. Problem is now fixed thank the heavens!!!
  • [master 3d55a01] Fixed a problem where the binary patcher was early outing too soon and therefore failing to patch all the binaries properly. It would seem that the Microsoft linker doesn’t sort the import table like I had thought it did - I would guess it sorts per DLL location, otherwise is unsorted. Thanks to Roman Tatkin for reporting this bug.
  • [master 6c74071] Added override of _GNU_SOURCE for when HAVE_MREMAP is auto-detected. Thanks to Maxim Zakharov for reporting this issue.
  • [master dee2d27] Marked off the v2 malloc API as deprecated in preparation for beta release. Updated CHM documentation.

Frequently Asked Questions:

  1. When should I replace my memory allocator? If you want your program to run at the maximum possible speed on operating systems before Windows 7, Apple Mac OS X 10.6, FreeBSD 7 or Linux kernel 3.x, you should consider replacing your memory allocator. Fixing up your code to use a new memory allocator is usually easy for most C and C++ projects, but can become tricky if you must maintain compatibility with your system allocator (you must tag each memory block so you can discern between what has been allocated by the system and your custom allocator). If you are running on Windows then nedmalloc can binary patch existing binaries thus avoiding the need to recompile.
    我应该什么时候更换内存分配器?如果你想让你的程序在Windows 7、苹果Mac OS X 10.6、FreeBSD 7或linux内核3.X之前的操作系统上以尽可能快的速度运行,你应该考虑更换你的内存分配器。对于大多数C和C++项目来说,修改代码以使用新的内存分配器通常很容易,但是,如果您必须保持与系统分配器的兼容性,可能会变得棘手 (您必须标记每个内存块,以便能够区分系统分配的内容和自定义分配器)。如果您在Windows上运行,则nedmalloc可以对现有二进制文件进行二进制修补,从而避免重新编译的需要。
  2. Is nedmalloc faster than all other memory allocators? No, there are faster ones, especially for specialised circumstances e.g. tcmalloc. However, nedmalloc is an excellent general-purpose allocator and it is based on dlmalloc, one of the most tried & tested memory allocators available as it is the core allocator in Linux. If you use nedmalloc, you will never be far from the best performing specialised allocator. As you might note in the real world benchmarks above, you get severely diminishing returns to allocator improvement once they get into a certain performance range.
    Nedmalloc比所有其他内存分配器快吗?不,有更快的,特别是对于特殊情况,例如tcmalloc。然而,nedmalloc是一个优秀的通用分配器,它基于dlmalloc, 这是最久经考验的内存分配器之一,因为它是Linux的核心分配器。如果您使用nedmalloc, 您将永远不会远离性能最佳的专业分配器。正如您可能在上面的现实世界基准中所指出的,一旦分配器改进达到一定的性能范围,您将获得严重递减的回报。
  3. How space-efficient is nedmalloc?dlmalloc does not fragment the memory space as much as other allocators, but it does have a sixteen or thirty-two byte minimum allocation with an eight or sixteen byte granularity. nedmalloc’s thread cache is a simple two power allocator which does cause bloating for items small enough to enter the thread cache (by default, 8Kb or less) but in general, this wastage across the entire program is small. You can configure nedmalloc to use finer grained bins to quarter the average wastage but this comes at a performance cost. When configured to only permit one memory space per thread, memory bloating is considerably less than that of ptmalloc2.
    Nedmalloc有多节省空间?dlmalloc不像其他分配器那样分割内存空间,但它具有最小十六或三十二字节的分配,粒度为八或十六字节。Nedmalloc的线程缓存是一个简单的双电源分配器,它确实会导致足够小的项目膨胀,以进入线程缓存 (默认情况下,8Kb或更少),但通常,整个项目的浪费很小。您可以将nedmalloc配置为使用更细粒度的bin来对平均损耗进行四分之一,但这需要付出性能代价。当配置为每个线程只允许一个内存空间时,内存膨胀大大小于ptmalloc2。
  4. Is tcmalloc better or worse than nedmalloc? As you can see in the graph above, nedmalloc is about equal to tcmalloc for threadcache-only ops and substantially beats it for non-threadcache ops. nedmalloc is also written in C rather than C++ and v0.5 of tcmalloc only works on Unix systems and not win32. tcmalloc achieves its speed by not doing free space coalescing (free space reclamation is one of the slowest parts of any allocator, and is rarely constant time) and simply decommits unused 4Kb pages instead. That means that in a 32 bit process, address space exhaustion is a real concern with tcmalloc, and even in a 64 bit process certain allocation patterns can keep expanding address space consumption indefinitely, all of which requires extra kernel memory to track (i.e. it’s a form of slow memory leak). Therefore consider carefully whether tcmalloc is right for your particular application.
    tcmalloc比nedmalloc好还是差?如上图所示,对于仅线程缓存操作,nedmalloc大约等于tcmalloc, 对于非线程缓存操作,它基本上优于tcmalloc。Nedmalloc是用C而不是C++编写,tcmalloc的v0.5仅适用于Unix系统,不适用于win32。Tcmalloc通过不进行自由空间合并 (自由空间回收是任何分配器中最慢的部分之一,并且很少是恒定时间) 来实现其速度,而只是不提交未使用的4Kb页面。这意味着在32位进程中,地址空间耗尽是tcmalloc真正关心的问题,即使在64位进程中,某些分配模式也可以无限期地继续扩展地址空间消耗,所有这些都需要额外的内核内存来跟踪 (i.e.这是一种缓慢的内存泄漏形式)。因此,请仔细考虑tcmalloc是否适合您的特定应用程序。
  5. Is Hoard better or worse than nedmalloc?As of v1.01, nedmalloc is close enough to Hoard to make little difference in real world code (see real world benchmarks above). nedmalloc’s synthetic test seems to trigger a bug in Hoard causing dismal performance, however I trust its author and its design enough to say that Hoard may be slightly faster in certain circumstances eg; if code allocates a large block in one thread and frees it in another. However, Hoard is licensed under the GPL unless you pay which is not the case with nedmalloc.
    Hoard 比nedmalloc好还是差?从v1.01开始,nedmalloc跟Hoard差不多,在实际的代码中没有什么不同 (见上面的实际基准)。Nedmalloc的综合测试似乎触发了一个导致性能不佳的Hoard错误,但是我相信它的作者和设计足以说明Hoard在某些情况下可能会稍快一些,例如; 如果代码在一个线程中分配一个大块,并在另一个线程中释放它。但是,Hoard是根据GPL获得许可的,除非你付款。nedmalloc不是这样的(意思是nedmalloc是开源免费)。
  6. Is ptmalloc3 better or worse than nedmalloc?ptmalloc3 is also a new implementation of ptmalloc2 and is also based on a newer dlmalloc. ptmalloc3 currently outperforms nedmalloc for a low number of threads especially on uniprocessor hardware, but on dual processor and above or with a lot of threads nedmalloc is faster. nedmalloc also runs fine on Windows whereas ptmalloc3 would (to my knowledge) require extra support code.
    ptmalloc3比nedmalloc好还是差?ptmalloc3也是ptmalloc2的新实现,并且还基于更新的dlmalloc。ptmalloc3目前在少量线程方面优于nedmalloc, 尤其是在单处理器硬件上,但在双处理器及以上或具有大量线程的情况下,nedmalloc更快。Nedmalloc在Windows上也运行良好,而ptmalloc3 (据我所知) 需要额外的支持代码。
  7. Is jemalloc better or worse than nedmalloc? Good question! There are many similarities between the designs, and like nedmalloc jemalloc keeps changing its internals over time so whatever I say here is likely out of date! Last time I looked, jemalloc uses red-black trees internally which are considerably slower than binary bitwise trees. On the other hand, jemalloc has the big advantage of a fully integrated threadcache whereas nedmalloc’s is literally bolted on on top of dlmalloc and its lack of integration does cost a few percent of performance (but eases my maintenance). jemalloc allocates small blocks more tightly and therefore wastes less memory, but this can introduce cache line sloshing when multiple CPU cores are writing to the same cache line. jemalloc is generally developed on Linux and Mac OS X first and Windows after, whereas I’d target Windows first due to its popularity and the others after. nedmalloc definitely is more experimental with C1X N1527 support (though I’d love if Jason added this too - hint hint!). In short, I’d doubt you’ll find ANY performance difference in real world code.
    jemalloc比nedmalloc好还是差?问得好!二者设计之间有很多相似之处,就像nedmalloc 和 jemalloc会随着时间的推移不断改变其内部,所以我在这里说的可能已经过时了!上次我看的时候,jemalloc内部使用红黑树,这比二进制按位树慢得多。另一方面,jemalloc具有完全集成的线程缓存的巨大优势,而nedmalloc’s实际上是固定在dlmalloc之上的,它缺乏集成确实会消耗性能的百分之几 (但简化了我的维护)。Jemalloc会更紧密地分配小块,因此会浪费更少的内存,但是当多个CPU内核写入同一缓存行时,这可能会引入缓存行晃动。Jemalloc通常首先在Linux和Mac OS X上开发,然后在Windows上开发,而我首先将Windows作为目标,因为它很受欢迎,之后是其他的。nedmalloc绝对更受C1X N1527支持 (虽然我希望Jason 添加这部分)。简而言之,我觉得你在实际的代码中不会发现任何性能差异。

原文:https://www.nedprod.com/programs/portable/nedmalloc/index.html