Linux HugePages 配置与 Oracle 性能关系说明 -- cnDBA.cn

一. HugePages 说明

1.1 HugePages 介绍

HugePages is a feature integrated into the Linux kernel with release 2.6. This feature basically provides the alternative to the 4K page size (16K for IA64) providing bigger pages.

关于HugePages，有一些相关的专业术语，具体如下：

（1） Page Table: A page table is the data structure of a virtual memory system in an operating system to store the mapping between virtual addresses and physical addresses. This means that on a virtual memory system, the memory is accessed by first accessing a page table and then accessing the actual memory location implicitly.

--Page Table 是操作系统上的虚拟内存系统的数据结构，其用来存储虚拟内存地址和物理内存地址之间的映射关系。这就意味着在虚拟内存系统上，我们访问内存时，是先访问Page Table，然后根据Page Table 中的映射关系，隐式的转移到物理的内存位置。

（2） TLB: A Translation Lookaside Buffer (TLB) is a buffer (or cache) in a CPU that contains parts of the page table. This is a fixed size buffer being used to do virtual address translation faster.

--TLB(Translation Lookaside Buffer) 是CPU 中的一块buffer 或者cache，其大小的固定的， TLB中包含了部分Page Table，用来快速进行虚拟地址的转换。

（3） hugetlb: This is an entry in the TLB that points to a HugePage (a large/big page larger than regular 4K and predefined in size). HugePages are implemented via hugetlb entries, i.e. we can say that a HugePage is handled by a "hugetlb page entry". The 'hugetlb" term is also (and mostly) used synonymously with a HugePage (See Note 261889.1). In this document the term "HugePage" is going to be used but keep in mind that mostly "hugetlb" refers to the same concept.

--hugetlb 是TLB中的一个entry，其指向HugePage（大于4k或预定义的一个large page）。 HugePage 通过hugetlb entries来实现，我们也可以说HugePage 是hugetlb page entry的一个句柄。在MOS 文档：Note 261889.1中，二者是几乎是相同的概念。

（4） hugetlbfs: This is a new in-memory filesystem like tmpfs and is presented by 2.6 kernel. Pages allocated on hugetlbfs type filesystem are allocated in HugePages.

--hugetlbfs 是2.6 内核中提出的一个新的in-memory filesystem，就像tmpfs一样。

1.2 常见的错误概念

WRONG: HugePages is a method to be able to use large SGA on 32-bit VLM systems RIGHT: HugePages is a method to have larger pages where it is useful for working with very large memory. It is both useful in 32- and 64-bit configurations

WRONG: HugePages cannot be used without USE_INDIRECT_DATA_BUFFERS RIGHT: HugePages can be used without indirect buffers. 64-bit systems does not need to use indirect buffers to have a large buffer cache for the RDBMS instance and HugePages can be used there too.

WRONG: hugetlbfs means hugetlb RIGHT: hugetlbfs is a filesystem type **BUT** hugetlb is the mechanism employed in the back where hugetlb can be employed WITHOUT hugetlbfs

WRONG: hugetlbfs means hugepages RIGHT: hugetlbfs is a filesystem type **BUT** HugePages is the mechanism employed in the back (synonymously with hugetlb) where HugePages can be employed WITHOUT hugetlbfs.

1.3 Regular Pages 与 HugePages 说明

When a single process works with a piece of memory, the pages that the process uses are reference in a local page table for the specific process. The entries in this table also contain references to the System-Wide Page Table which actually has references to actual physical memory addresses. So theoretically a user mode process (i.e. Oracle processes), follows its local page table to access to the system page table and then can reference the actual physical table virtually. As you can see below, it is also possible (and very common to Oracle RDBMS due to SGA use) that two different O/S processes can point to the same entry in the system-wide page table.

--当一个进程使用一块内存来工作时，进程使用的page 从local page table 中引用。 Local page table中的entries 又引用了System-Wide Page Table的page，该page 指向了实际的物理内存地址。

所以，理论上，用户的进程（如oracle 进程），根据local page table中的entry 指向了system page table中的entry，而System page table中的entry 指向了实际的物理内存。

当然，也有可能，2个不同的O/S 进程指向了system-wide page table 中同一个entry，如下图所示，最常见的原因是Oracle SGA的使用。

http://www.cndba.cn/dave/article/310

When HugePages are in the play, the usual page tables are employed. The very basic difference is that the entries in both process page table and the system page table has attributes about huge pages. So any page in a page table can be a huge page or a regular page. The following diagram illustrates 4096K hugepages but the diagram would be the same for any huge page size.

--当配置了HugePage后，最基本的不同是 process page table 和 system page table中的entry 都包含了huge page的属性。所以page table 中的任一page 都可能是huge page 或者regular page。

1.4 Some HugePages Facts/Features

(1) HugePages can be allocated on-the-fly but they must be reserved during system startup. Otherwise the allocation might fail as the memory is already paged in 4K mostly.

(2) HugePage sizes vary from 2MB to 256MB based on kernel version and HW architecture (See related section below.)

(3) HugePages are not subject to reservation / release after the system startup unless there is system administrator intervention, basically changing the hugepages configuration (i.e. number of pages available or pool size)

1.5 Advantages of HugePages Over Normal Sharing Or AMM

（1） Not swappable: 不需要内存页交换

HugePages are not swappable. Therefore there is no page-in/page-out mechanism overhead.HugePages are universally regarded as pinned.http://www.cndba.cn/dave/article/310

（2）Relief of TLB pressure: 减轻TLB的压力http://www.cndba.cn/dave/article/310

1）Hugepge uses fewer pages to cover the physical address space, so the size of “book keeping” (mapping from the virtual to the physical address) decreases, so it requiring fewer entries in the TLB

2）TLB entries will cover a larger part of the address space when use HugePages, there will be fewer TLB misses before the entire or most of the SGA is mapped in the SGA

3）Fewer TLB entries for the SGA also means more for other parts of the address space

（3）Decreased page table overhead: 降低page table 的消耗

Each page table entry can be as large as 64 bytes and if we are trying to handle 50GB of RAM, the pagetable will be approximately 800MB in size which is practically will not fit in 880MB size lowmem (in 2.4 kernels - the page table is not necessarily in lowmem in 2.6 kernels) considering the other uses of lowmem. When 95% of memory is accessed via 256MB hugepages, this can work with a page table of approximately 40MB in total.

每个一个page table 的entry最大需要64 bytes的内存，如果我们管理50GB的内存,那么Pagetable 就需要约800MB的内存空间. 如果我们使用256MB的hugepage，同样对于50G的内存，我们只需要40MB的page table。

Dave 注释：

按普通模式，每个page 4k，那么需要的entries个数是：(50*1024*1024/4)

每个entry 是64bytes，所以总的内存大小就是：(50*1024*1024/4) * 64/1024/1024=800M

注意，这只是一个进程的page table，如果有10个进程，那么光处理这些page 就需要800*10，约8G的内存空间，而我们总共的内存也不过50G而已，所以大内存的情况下，需要HugePage就显的尤其重要。

HugePage 最大的大小从2M到256MB，按2MB算：

(50*1024/2)*64/1024/1024= 1.6M

10 进程也才16M而已。http://www.cndba.cn/dave/article/310

（4）Eliminated page table lookup overhead: 降低page table 的lookup 次数

Since the pages are not subject to replacement, page table lookups are not required.

（5）Faster overall memory performance: 提升内存的性能

On virtual memory systems each memory operation is actually two abstract memory operations. Since there are fewer pages to work on, the possible bottleneck on page table access is clearly avoided.

--virtual memory system 上的每一次内存操作实际上都需要2次内存的操作， hugepage减少了page数量从而避免了访问page table上的瓶颈。

1.6 HugePage 的大小

单个HugePage的大小根据平台的不同而不同：

(1) Kernel version/linux distribution

(2) HW Platform

HugePage 的实际大小可以使用如下命令查看：

$ grep Hugepagesize /proc/meminfo

The table below shows the sizes of HugePages on different configurations. Note that these are general numbers taken from the most recent versions of the kernels. For a specific kernel source package, you can check for the HPAGE_SIZE macro value (based on HPAGE_SHIFT) for a different (more recent) kernel source tree.

--下表显示了不同平台下HugePages的值：

HW Platform Source Code Tree Kernel 2.4 Kernel 2.6http://www.cndba.cn/dave/article/310

Linux x86 (IA32) i386 4 MB 4 MB *

Linux x86-64 (AMD64, EM64T) x86_64 2 MB 2 MB

Linux Itanium (IA64) ia64 256 MB 256 MB

IBM Power Based Linux (PPC64) ppc64/powerpc N/A ** 16 MB

IBM zSeries Based Linux s390 N/A N/A

IBM S/390 Based Linux s390 N/A N/A

* Some older packaging for the 2.6.5 kernel on SLES8 (like 2.6.5-7.97) can have 2 MB Hugepagesize.

** Oracle RDBMS is also not certified in this configuration. See Document 341507.1

1.7 HugePages and Oracle 11g Automatic Memory Management (AMM)

The AMM and HugePages are not compatible. One needs to disable AMM on 11g to be able to use HugePages. See Document 749851.1 for further information.

--Oracle 11g的AMM与HugePages不兼容。需要注意。

1.8 没配置HugePages 的危险

在Linux OS下，如果对delicate 进程没有配置合适的的HugePage，那么可能会遇到如下的问题：

(1) HugePages not used (HugePages_Total = HugePages_Free) at all wasting the amount configured for

(2) Poor database performance 影响数据库性能

(3) System running out of memory or excessive swapping 内存不足或者经常需要进行swap

(4) Some or any database instance cannot be started 某些数据库实例不能启动

(5) Crucial system services failing (e.g.: CRS) 严重的系统故障

To avoid / help with such situations Bug 10153816 was filed to introduce a database initialization parameter in 11.2.0.2 (use_large_pages) to help manage which SGAs will use huge pages and potentially give warnings or not start up at all if they cannot get those pages.

1.9 为什么需要配置HugePages

HugePages is crucial for faster Oracle database performance on Linux if you have a large RAM and SGA. If your combined database SGAs is large (like more than 8GB, can even be important for smaller), you will need HugePages configured. Note that the size of the SGA matters. Advantages of HugePages are:

--如果使用了大内存和SGA，那么HugePage对提高数据库性能就非常重要。如果数据库SGA脚本，比如超过8G，就需要配置HugePages。配置HugePages 有如下好处：

(1) Larger Page Size and Less # of Pages: Default page size is 4K whereas the HugeTLB size is 2048K. That means the system would need to handle 512 times less pages.

(2) No Page Table Lookups: Since the HugePages are not subject to replacement (despite regular pages), page table lookups are not required.

(3) Better Overall Memory Performance: On virtual memory systems (any modern OS) each memory operation is actually two abstract memory operations. With HugePages, since there are less number of pages to work on, the possible bottleneck on page table access is clearly avoided.

(4) No Swapping: We must avoid swapping to happen on Linux OS at all Document 1295478.1. HugePages are not swappable (whereas regular pages are). Therefore there is no page replacement mechanism overhead. HugePages are universally regarded as pinned.

(5) No 'kswapd' Operations: kswapd will get very busy if there is a very large area to be paged (i.e. 13 million page table entries for 50GB memory) and will use an incredible amount of CPU resource. When HugePages are used, kswapd is not involved in managing them. See also Document 361670.1

二．配置HugePages

2.1 第一步：设置memlock

在/etc/security/limits.conf文件中添加memlock的限制，注意该值略微小于实际物理内存的大小。比如物理内存是64GB，可以设置为如下：

* soft memlock 60397977

* hard memlock 60397977

如果这里的值超过了SGA的需求，也没有不利的影响。

如果使用了Oracle Linux的oracle¬-validated 包，或者Exadata DB compute会自动配置这个参数。

2.2 第二步：验证memlock

使用如下命令查看参数值：

$ ulimit -l

60397977

2.3 第三步：11g中禁用AMM

如果Oracle 是11g以后的版本，那么默认创建的实例会使用Automatic Memory Management (AMM)的特性，该特性与HugePage不兼容。

在设置HugePage之前需要先禁用AMM。设置初始化参数MEMORY_TARGET 和MEMORY_MAX_TARGET 为0即可。

使用AMM的情况下，所有的SGA 内存都是在/dev/shm 下分配的，因此在分配SGA时不会使用HugePage。这也是AMM 与HugePage 不兼容的原因。

另外：默认情况下ASM instance 也是使用AMM的，但因为ASM 实例不需要大SGA，所以对ASM 实例使用HugePages意义不大。

如果我们要使用HugePage，那么就必须先确保没有设置MEMORY_TARGET / MEMORY_MAX_TARGET参数。

2.4 第四步：计算vm.nr_hugepages的建议值

确保所有的数据库实例都已经启动，包括ASM 实例。使用hugepages_settings.sh 脚本获取thevm.nr_hugepages 内核参数的建议值。

http://www.cndba.cn/dave/article/310

$ ./hugepages_settings.sh
...
Recommended setting: vm.nr_hugepages = 1496
$
也可以根据自己的经验来计算该值。
脚本如下：
#!/bin/bash
#
# hugepages_settings.sh
#
# Linux bash script to compute values for the
# recommended HugePages/HugeTLB configuration
#
# Note: This script does calculation for all shared memory
# segments available when the script is run, no matter it
# is an Oracle RDBMS shared memory segment or not.
#
# This script is provided by Doc ID 401749.1 from My Oracle Support 
# http://support.oracle.com
# Welcome text
echo "
This script is provided by Doc ID 401749.1 from My Oracle Support 
(http://support.oracle.com) where it is intended to compute values for 
the recommended HugePages/HugeTLB configuration for the current shared 
memory segments. Before proceeding with the execution please note following:
 * For ASM instance, it needs to configure ASMM instead of AMM.
 * The 'pga_aggregate_target' is outside the SGA and 
   you should accommodate this while calculating SGA size.
 * In case you changes the DB SGA size, 
   as the new SGA will not fit in the previous HugePages configuration, 
   it had better disable the whole HugePages, 
   start the DB with new SGA size and run the script again.
And make sure that:
 * Oracle Database instance(s) are up and running
 * Oracle Database 11g Automatic Memory Management (AMM) is not setup 
   (See Doc ID 749851.1)
 * The shared memory segments can be listed by command:
     # ipcs -m
Press Enter to proceed..."
read
# Check for the kernel version
KERN=`uname -r | awk -F. '{ printf("%d.%d/n",$1,$2); }'`
# Find out the HugePage size
HPG_SZ=`grep Hugepagesize /proc/meminfo | awk '{print $2}'`
if [ -z "$HPG_SZ" ];then
    echo "The hugepages may not be supported in the system where the script is being executed."
    exit 1
fi
# Initialize the counter
NUM_PG=0
# Cumulative number of pages required to handle the running shared memory segments
for SEG_BYTES in `ipcs -m | cut -c44-300 | awk '{print $1}' | grep "[0-9][0-9]*"`
do
    MIN_PG=`echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q`
    if [ $MIN_PG -gt 0 ]; then
        NUM_PG=`echo "$NUM_PG+$MIN_PG+1" | bc -q`
    fi
done
RES_BYTES=`echo "$NUM_PG * $HPG_SZ * 1024" | bc -q`
# An SGA less than 100MB does not make sense
# Bail out if that is the case
if [ $RES_BYTES -lt 100000000 ]; then
    echo "***********"
    echo "** ERROR **"
    echo "***********"
    echo "Sorry! There are not enough total of shared memory segments allocated for 
HugePages configuration. HugePages can only be used for shared memory segments 
that you can list by command:
    # ipcs -m
of a size that can match an Oracle Database SGA. Please make sure that:
 * Oracle Database instance is up and running 
 * Oracle Database 11g Automatic Memory Management (AMM) is not configured"
    exit 1
fi
# Finish with results
case $KERN in
    '2.4') HUGETLB_POOL=`echo "$NUM_PG*$HPG_SZ/1024" | bc -q`;
           echo "Recommended setting: vm.hugetlb_pool = $HUGETLB_POOL" ;;
    '2.6') echo "Recommended setting: vm.nr_hugepages = $NUM_PG" ;;
     *) echo "Unrecognized kernel version $KERN. Exiting." ;;
esac
# End

2.5 第五步：在/etc/sysctl.conf文件中设置vm.nr_hugepages参数

...

vm.nr_hugepages = 1496

...

2.6 第六步：停止所有实例，并重启服务器

2.7 验证配置

在重启系统之后，确保所有的数据库实例都已经启动，使用如下命令检查HugePage的状态：http://www.cndba.cn/dave/article/310

# grep HugePages /proc/meminfo
HugePages_Total:    1496
HugePages_Free:      485
HugePages_Rsvd:      446
HugePages_Surp:        0

为了确保HugePages配置的有效性，HugePages_Free值应该小于HugePages_Total 的值，并且应该等于HugePages_Rsvd的值。

Hugepages_Free 和HugePages_Rsvd 的值应该小于SGA 分配的gages。

2.8 故障处理

一些常见的问题如下：

Symptom Possible Cause Troubleshooting Action

System is running out of memory or swapping Not enough HugePages to cover the SGA(s) and therefore the area reserved for HugePages are wasted where SGAs are allocated through regular pages. Review your HugePages configuration to make sure that all SGA(s) are covered.

Databases fail to start memlock limits are not set properly Make sure the settings in limits.conf apply to database owner account.

One of the database fail to start while another is up The SGA of the specific database could not find available HugePages and remaining RAM is not enough. Make sure that the RAM and HugePages are enough to cover all your database SGAs

Cluster Ready Services (CRS) fail to start HugePages configured too large (maybe larger than installed RAM) Make sure the total SGA is less than the installed RAM and re-calculate HugePages.

HugePages_Total = HugePages_Free HugePages are not used at all. No database instances are up or using AMM. Disable AMM and make sure that the database instances are up.

Database started successfully and the performance is slow The SGA of the specific database could not find available HugePages and therefore the SGA is handled by regular pages, which leads to slow performance Make sure that the HugePages are many enough to cover all your database SGAs

2.9 MOS 相关文档

HugePages and Oracle Database 11g Automatic Memory Management (AMM) on Linux [ID 749851.1]

Hugepages are Not used by Database Buffer Cache [ID 829850.1]

Oracle Not Utilizing Hugepages [ID 803238.1]

/proc/meminfo Does Not Provide HugePages Information on Oracle Enterprise Linux (OEL5) [ID 860350.1]

HugePages Not Released On Oracle RDBMS Instance Shutdown with RHEL / EL 5 Update 1 (5.1) [ID 550443.1]

Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration [ID 401749.1]

HugePages on Oracle Linux 64-bit [ID 361468.1]

HugePages on Linux: What It Is... and What It Is Not... [ID 361323.1]

Document 749851.1 HugePages and Oracle Database 11g Automatic Memory Management (AMM) on Linux

Document 829850.1 Hugepages Are Not Used by Database Buffer Cache

Document 803238.1 Oracle Not Utilizing Hugepages

Document 728063.1 Setup HugePages in an Guest Does Not Work with Oracle VM 2.1 or 2.1.1

Document 550443.1 HugePages Not Released On Oracle RDBMS Instance Shutdown with RHEL / EL 5 Update 1 (5.1)

Document 860350.1 /proc/meminfo Does Not Provide HugePages Information on Oracle Enterprise Linux (OEL5)

签到成功

CNDBA社区