记录日常工作关于系统运维,虚拟化云计算,数据库,网络安全等各方面问题。

MAXAIO导致Oracle启动hang问题


    Oracle数据库,10.2.0.4 for linux x86,在正常重启时,到open阶段僵死。在操作系统上看到一些因计划任务启动的用户进程CPU使用率几乎100%,很明显处于等待状态。在Oracle的bdump目录下也很快生成有trc文件。这些文件的内容关键点是这样:

    WARNING:io_submit failed due to kernel limitations MAXAIO for process=0 pending aio=0

    WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=65536 AIO-NR=65536

    WARNING:Oracle process running out of OS kernel I/O resources (1)

    从字面上理解是,是操作系统的MAXAIO限制了Oracle用户进程操作。

我们知道,Linux的核心参数AIO-MAX-NR是与异步IO相关的,对于异步IO的简单解释如下:

在一个程序中如果涉及到磁盘的IO操作时,有两种情况
1. 程序等待IO操作完成,CPU再接下来处理程序的其他部分(等待IO的时间段内,CPU处于Idle Waiting状态)。
2. 程序不等待IO操作完成,允许CPU处理接下来的其他任务(或者理解为允许CPU处理接下来的不依赖于IO完成的任务)。
显然,第一种情况,CPU的资源白白的浪费了,也就是同步IO。第二种情况更有利于CPU的充分利用,这就是异步IO(asynchronous IO)。

查看操作系统的异步IO相关核心参数:

[root@db01 ~]# sysctl -a| grep aio
fs.aio-max-nr = 65536
fs.aio-nr = 0

我们知道aio-nr是所有当前活动的异步IO进程上下文的总和,其值最多不能超过aio-max-nr的设定值。并且数据库是在机房意外断电的情况下突然关闭的,关闭的一瞬间有大量的事务正在运行,所以重新启动的时候必然有大量的回滚与已提交但没有写入到数据文件的数据要开始写入,这时候的IO量是非常大的,由于启用了异步IO,此时大量的上下文切换达到了操作系统设定的最大值,于是oracle就停止响应了。查询metalink了解到oracle已证实这是一个BUG,官方建议fs.aio-max-nr的值设置为1M或以上可以解决此问题。

这里我将调整核心参数fs.aio-max-nr的值为3145728,即3M。

[root@db01 ~]# echo "fs.aio-max-nr = 3145728" >> /etc/sysctl.conf


    查了查资料,又说是bug,但给出了两种解决方法:一,增加操作系统内核参数AIO-MAX-NR的值;二,禁用磁盘AIO机制。我采用了修改系统内核参数AIO-MAX-NR的方法来解决这个问题。

    1、可以临时修改内核参数aio-max-nr

    # echo > /proc/sys/fs/aio-max-nr 1048576

    2、永久修改内核参数aio-max-nr,需要在/etc/sysctl.conf加上下面这句

    fs.aio-max-nr = 1048576

    用下列命令使参数生效  

    #/sbin/sysctl -p

    附,top显示结果

    Tasks: 568 total,   6 running, 562 sleeping,   0 stopped,   0 zombie

    Cpu(s): 20.4%us,  0.1%sy,  0.0%ni, 79.1%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st

    Mem:  132051284k total, 117157820k used, 14893464k free,   197072k buffers

    Swap:  5751260k total,  2404292k used,  3346968k free, 114662552k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

    12975 oracle    25   0 1687m  25m  19m R 99.8  0.0   9:38.01 ora_p004_oncz

    12981 oracle    25   0 1687m  25m  19m R 99.8  0.0   9:38.00 ora_p007_oncz

    12983 oracle    25   0 1687m  25m  19m R 99.8  0.0   9:38.01 ora_p008_oncz

    12985 oracle    25   0 1687m  25m  19m R 99.8  0.0   9:38.00 ora_p009_oncz

    12002 oracle    25   0 1968m 1.6g 1.3g R 90.5  1.3  21:25.03 ora_j000_ofdb

    附,bdump目录下的trc文件信息  www.2cto.com

    /u01/app/oracle/admin/oncz/bdump/oncz_p008_12983.trc

    Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production

    With the Partitioning, OLAP, Data Mining and Real Application Testing options

    ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1

    System name:    Linux

    Node name:      db-172-17-2-8

    Release:        2.6.18-348.el5

    Version:        #1 SMP Tue Jan 8 17:53:53 EST 2013

    Machine:        x86_64

    Instance name: oncz

    Redo thread mounted by this instance: 1

    Oracle process number: 29

    Unix process pid: 12983, image: oracle@db-172-17-2-8 (P008)

    *** SERVICE NAME:() 2013-02-19 15:55:08.764

    *** SESSION ID:(142.1) 2013-02-19 15:55:08.764

    ORA-27090: Message 27090 not found;  product=RDBMS; facility=ORA

    Additional information: 3

    Additional information: 128

    Additional information: 65536

    WARNING:io_submit failed due to kernel limitations MAXAIO for process=0 pending aio=0

    WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=65536 AIO-NR=65536

    WARNING:Oracle process running out of OS kernel I/O resources (1)

    WARNING:Oracle process running out of OS kernel I/O resources (1)

    WARNING:Oracle process running out of OS kernel I/O resources (1)

    WARNING:Oracle process running out of OS kernel I/O resources (1)

    附,参考资料

    Bug 9949948  Linux: Process spin under ksfdrwat0 if OS Async IO not configured high enough

    This note gives a brief overview of bug 9949948.

    The content was last updated on: 28-OCT-2011

    Click here for details of each of the sections below.

    Affects:

    Product (Component)  Oracle Server (Rdbms)

    Range of versions believed to be affected  Versions >= 10.2.0.4 but BELOW 11.1

    Versions confirmed as being affected

    10.2.0.5

    Platforms affected

    Linux X86-64bit

    Linux 32bit

    It is believed to be a regression in default behaviour thus:

    Regression introduced in 10.2.0.5

    Fixed:

    This issue is fixed in

    11.1.0.6 (Base Release)

    10.2.0.5.2 Patch Set Update

    10.2.0.5 Patch 5 on Windows Platforms

    Symptoms:

    Related To:

    Hang (Process Spins)

    Waits for "i/o slave wait"

    DISK_ASYNCH_IO

    Description

    This problem is introduced in 10.2.0.5

    It only affects platforms where Oracle has to reserve async IO slots,

    such as Linux platforms.

    If the OS async IO layer is underconfigured and an Oracle process

    cannot get sufficient AIO slots then rather than reverting to

    using non AIO call the process may go into an infinite spin

    under ksfdrwat0.

    Rediscovery notes:

    The spin will be preceded by messages in the trace

    file of the form:

    WARNING:io_submit failed due to kernel limitations MAXAIO

    for process=0 pending aio=0

    WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=65536 AIO-NR=65518

    WARNING:1 Oracle process running out of OS kernelI/O resources aiolimit=0

    Notice specifically that the value for aiolimit is reported as "0"

    for this bug.

    The process then spins in ksfdrwat0 typically with a stack showing

    skgfqio ()

    ksfdgo ()

    ksfdwtio ()

    ksfdwat1 ()

    ksfdrwat0 ()   <<< Spin point

    ksfdblock ()

    kcflwi ()

    kcflci ()

    kcblci ()

    kcblcio ()

    kcblgt ()

    kcbldrget ()

    It will show repeated waits for "i/o slave wait", which can be

    misleading as that is normally considered an idle wait event.

    Workaround

    Raise the OS AIO limits such that the number of concurrent slot

    requirements never exceeds the OS limit.

    ie: Increase AIO-MAX-NR

    OR

    Disable async IO (Set DISK_ASYNCH_IO=FALSE)

    See Note:1313555.1 for additional notes on this issue.

    Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

    References

    Bug:9949948 (This link will only work for PUBLISHED bugs)

    Note:245840.1 Information on the sections in this article



转载请标明出处【Oracle process running out of OS kernel I/O resources】。

《www.micoder.cc》 虚拟化云计算,系统运维,安全技术服务.

网站已经关闭评论