<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>We've had some trouble with the AMI sometimes locking up. During
this time it won't new connections, and doesn't send out sending
out data. Running asterisk -rx "manager show eventq" can grow to
over 100K events, and many megabytes of content. Sometimes the
problem seems to resolve itself, though not all the time.
Recently we had the AMI lock up for over an our and only
restarting asterisk resolved the situation. This seems to also
affect incoming calls, which are can be blocked.<br>
</p>
<p>We did some gdb thread dumps on Asterisk, and found two problem
threads. One thread is stuck attempting to obtain a lock in
manager.c purge_sessions, the other is holding a session lock in
manager.c process_events. I dumped the thread twice on the same
AMI lockup, and both stuck threads were stuck in the same spot.</p>
<p>In addition, we found the sendq on the Asterisk side of the AMI
connection to be very high. After we killed this connection, the
AMI recovered. We've long thought this problem is related to the
below Asterisk issue, and finding this is further evidence.<br>
</p>
<p><a class="moz-txt-link-freetext" href="https://issues.asterisk.org/jira/browse/ASTERISK-28469">https://issues.asterisk.org/jira/browse/ASTERISK-28469</a></p>
<p><br>
</p>
<p><br>
</p>
<p>Code snippets:</p>
<p>One thread is stuck waiting to acquire a session lock to see if
it should be purged:<br>
</p>
<p>static void purge_sessions(int n_max)<br>
{<br>
struct ao2_container *sessions;<br>
struct mansession_session *session;<br>
time_t now = time(NULL);<br>
struct ao2_iterator i;<br>
<br>
sessions = ao2_global_obj_ref(mgr_sessions);<br>
if (!sessions) {<br>
return;<br>
}<br>
i = ao2_iterator_init(sessions, 0);<br>
ao2_ref(sessions, -1);<br>
while ((session = ao2_iterator_next(&i)) && n_max
> 0) {<br>
ao2_lock(session); *****STUCK HERE*****<br>
if (session->sessiontimeout && (now >
session->sessiontimeout) && !session->inuse) {<br>
if (session->authenticated<br>
&& VERBOSITY_ATLEAST(2)<br>
&& manager_displayconnects(session)) {<br>
ast_verb(2, "HTTP Manager '%s' timed out from
%s\n",<br>
session->username,
ast_sockaddr_stringify_addr(&session->addr));<br>
}<br>
ao2_unlock(session);<br>
session_destroy(session);<br>
n_max--;<br>
} else {<br>
ao2_unlock(session);<br>
unref_mansession(session);<br>
}<br>
}<br>
ao2_iterator_destroy(&i);<br>
}</p>
<br>
<p><br>
</p>
<p>Thread holding the lock:<br>
</p>
<p>static int process_events(struct mansession *s)<br>
{<br>
int ret = 0;<br>
<br>
ao2_lock(s->session);<br>
if (s->session->stream != NULL) {<br>
struct eventqent *eqe = s->session->last_ev;<br>
<br>
while ((eqe = advance_event(eqe))) {<br>
if (eqe->category == EVENT_FLAG_SHUTDOWN) {<br>
ast_debug(3, "Received CloseSession event\n");<br>
ret = -1;<br>
}<br>
if (!ret && s->session->authenticated
&&<br>
(s->session->readperm &
eqe->category) == eqe->category &&<br>
(s->session->send_events &
eqe->category) == eqe->category) {<br>
if (match_filter(s, eqe->eventdata)) {<br>
if (send_string(s, eqe->eventdata) <
0) ***** STUCK HERE *****<br>
ret = -1; /* don't send more */<br>
}<br>
}<br>
s->session->last_ev = eqe;<br>
}<br>
}<br>
ao2_unlock(s->session);<br>
return ret;<br>
}</p>
If I trace back send_string via my thread dump, it goes to:<br>
<p>int ast_wait_for_output(int fd, int ms)<br>
{<br>
struct pollfd pfd[1];<br>
<br>
memset(pfd, 0, sizeof(pfd));<br>
pfd[0].fd = fd;<br>
pfd[0].events = POLLOUT;<br>
return ast_poll(pfd, 1, ms); ****STUCK HERE****<br>
}</p>
<p><br>
</p>
<p>Which ultimately calls a function in glibc, poll.c. I haven't
written C for in much capacity for 20 years, so this is largely
where my ability to debug ends.<br>
</p>
<p>I don't know why this never times out. I did notice that there's
a timeout that's being set in send_string that uses writetimout.
We leave that at the default of 100ms in manager.conf.<br>
</p>
<p><br>
</p>
<p>We're running Centos 6, with glibc-2.12-1.212.el6_10.3.x86_64,
and Asterisk 16.8-cert5.</p>
<p><br>
</p>
<p><br>
</p>
<p>Here's the dump of the two threads (there's another dump as well
I took a few minutes later that's very similar)<br>
</p>
<p>Thread 99 (Thread 0x7fbee6b87700 (LWP 24931)):<br>
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136<br>
No locals.<br>
#1 0x00000039c04095f3 in _L_lock_892 () from
/lib64/libpthread.so.0<br>
No symbol table info available.<br>
#2 0x00000039c04094d7 in __pthread_mutex_lock
(mutex=0x7fbf18012900) at pthread_mutex_lock.c:82<br>
ignore1 = 128<br>
ignore2 = 402729216<br>
ignore3 = -512<br>
type = 402729216<br>
id = 24931<br>
#3 0x000000000045b7a6 in __ao2_lock (user_data=<value
optimized out>, lock_how=<value optimized out>,
file=0x66aa32 "manager.c", func=0x66f388 "purge_sessions",
line=7047, var=<value optimized out>) at astobj2.c:241<br>
obj = <value optimized out><br>
obj_mutex = <value optimized out><br>
obj_rwlock = <value optimized out><br>
res = 0<br>
#4 0x00000000005f448a in purge_sessions (data=<value optimized
out>) at manager.c:7047<br>
i = {c = 0x1140d30, last_node = 0x7fbf18004058, complete =
0, flags = 0}<br>
n_max = 1<br>
sessions = <value optimized out><br>
session = 0x7fbf18012940<br>
now = 1609358394<br>
#5 purge_old_stuff (data=<value optimized out>) at
manager.c:8640<br>
No locals.<br>
#6 0x000000000058cc41 in ast_tcptls_server_root (data=0x8d9c20)
at tcptls.c:270<br>
i = <value optimized out><br>
desc = 0x8d9c20<br>
fd = -1<br>
addr = {ss = {ss_family = 2, __ss_padding =
"\200\320\330\021\000\374\000\000\000\000\000\000\000\000\260\b\000Ⱦ\177\000\000
\000\000\000\000\000\000\000
\341\070\300\071\000\000\000\355\247\a\300\071", '\000'
<repeats 11 times>"\340,
\370\024\001\000\000\000\000\021\220a\000\000\000\000\000\230\001\000\000\000\000\000\000p\325a",
'\000' <repeats 13 times>,
"\003\000\000\000\000\000\000\000v\254E", '\000' <repeats 12
times>, __ss_align = 0}, len = 16}<br>
tcptls_session = <value optimized out><br>
launched = 140456899585792<br>
__PRETTY_FUNCTION__ = "ast_tcptls_server_root"<br>
__FUNCTION__ = "ast_tcptls_server_root"<br>
#7 0x000000000059bc77 in dummy_start (data=<value optimized
out>) at utils.c:1249<br>
__clframe = {__cancel_routine = <value optimized
out>, __cancel_arg = 0x7fbee6b87700, __do_it = 1, __cancel_type
= 0}<br>
ret = <value optimized out><br>
a = {start_routine = 0x58cbf0
<ast_tcptls_server_root>, data = 0x8d9c20, name = <value
optimized out>}<br>
__PRETTY_FUNCTION__ = "dummy_start"<br>
#8 0x00000039c0407aa1 in start_thread (arg=0x7fbee6b87700) at
pthread_create.c:301<br>
__res = <value optimized out><br>
pd = 0x7fbee6b87700<br>
now = <value optimized out><br>
unwind_buf = {cancel_jmp_buf = {{jmp_buf =
{140457891362560, -1282427221842092215, 140736079386656,
140457891363264, 0, 3, 1246980655494963017, -1278908234687602871},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data =
{prev = 0x0, cleanup = 0x0, canceltype = 0}}}<br>
not_first_call = <value optimized out><br>
pagesize_m1 = <value optimized out><br>
sp = <value optimized out><br>
freesize = <value optimized out><br>
#9 0x00000039c00e8c4d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115<br>
No locals.</p>
<p><br>
</p>
<p><br>
</p>
<p>Thread 36 (Thread 0x7fbee634b700 (LWP 20342)):<br>
#0 0x00000039c00df403 in __poll (fds=<value optimized out>,
nfds=<value optimized out>, timeout=<value optimized
out>) at ../sysdeps/unix/sysv/linux/poll.c:87<br>
__arg2 = 1<br>
_a3 = 100<br>
_a1 = 140457882722704<br>
resultvar = <value optimized out><br>
__arg3 = 100<br>
__arg1 = 140457882722704<br>
_a2 = 1<br>
resultvar = <value optimized out><br>
oldtype = 0<br>
result = <value optimized out><br>
#1 0x000000000059cc27 in ast_wait_for_output (fd=<value
optimized out>, ms=<value optimized out>) at utils.c:1357<br>
pfd = {{fd = 29, events = 4, revents = 0}}<br>
#2 0x00000000004f2621 in ast_iostream_write
(stream=0x7fbec8004598, buffer=0x7fbe94f22358, size=421) at
iostream.c:487<br>
start = {tv_sec = 1609360234, tv_usec = 335812}<br>
ms = <value optimized out><br>
res = <value optimized out><br>
written = 0<br>
remaining = 421<br>
__PRETTY_FUNCTION__ = "ast_iostream_write"<br>
#3 0x00000000005ecbfd in send_string (s=0x7fbee634acd0) at
manager.c:3041<br>
stream = 0x7fbec8004598<br>
len = 421<br>
res = <value optimized out><br>
#4 process_events (s=0x7fbee634acd0) at manager.c:6241<br>
eqe = <value optimized out><br>
ret = 0<br>
__PRETTY_FUNCTION__ = "process_events"<br>
#5 0x0000000000602a6f in do_message (data=0x7fbec8002d60) at
manager.c:6865<br>
m = {hdrcount = 0, headers = {0x0 <repeats 128
times>}}<br>
hdr_loss = 0<br>
header_buf = '\000' <repeats 1024 times><br>
res = <value optimized out><br>
now = 1609332593<br>
#6 session_do (data=0x7fbec8002d60) at manager.c:7004<br>
ser = 0x7fbec8002d60<br>
session = 0x7fbf18012940<br>
s = {session = 0x7fbf18012940, stream = 0x7fbec8004598,
tcptls_session = 0x7fbec8002d60, parsing = MESSAGE_OKAY,
write_error = 1, hook = 0x0, lock = {mutex = {__data = {__lock =
0, __count = 0, __owner = 0, __nusers = 0, __kind = 1, __spins =
0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000'
<repeats 16 times>, "\001", '\000' <repeats 22 times>,
__align = 0}, _track = 0x0, _flags = {tracking = 0, setup = 0}}}<br>
res = -1<br>
arg = 1<br>
ser_remote_address_tmp = {ss = {ss_family = 2,
__ss_padding =
"\256\212\330\021\000\374\000\000\000\000\000\000\000\000[\000\000\000\003\000\000\000
\000\000\030\277\177\000\000`\242a\000\000\000\000\000\200<\002\000\000\000\000\000\200S\001\030\277\177\000\000\004\000\000\000\000\000\000\000
\341\070\300\071\000\000\000\355\247\a\300\071\000\000\000
\341\070\300\071\000\000\000\260Y\215\000\000\000\000\000`\242a\000\000\000\000\000\305\000\000\000\000\000\000\000\360\260e\000\000\000\000",
__ss_align = 140457375968976}, len = 16}<br>
__PRETTY_FUNCTION__ = "session_do"<br>
#7 0x000000000059bc77 in dummy_start (data=<value optimized
out>) at utils.c:1249<br>
__clframe = {__cancel_routine = <value optimized
out>, __cancel_arg = 0x7fbee634b700, __do_it = 1, __cancel_type
= 0}<br>
ret = <value optimized out><br>
a = {start_routine = 0x58d2c0
<handle_tcptls_connection>, data = 0x7fbec8002d60, name =
<value optimized out>}<br>
__PRETTY_FUNCTION__ = "dummy_start"<br>
#8 0x00000039c0407aa1 in start_thread (arg=0x7fbee634b700) at
pthread_create.c:301<br>
__res = <value optimized out><br>
pd = 0x7fbee634b700<br>
now = <value optimized out><br>
unwind_buf = {cancel_jmp_buf = {{jmp_buf =
{140457882728192, -1282427221842092215, 140457891359648,
140457882728896, 140457375968976, 3, 1246981791513812809,
-1278908234687602871}, mask_was_saved = 0}}, priv = {pad = {0x0,
0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype =
0}}}<br>
not_first_call = <value optimized out><br>
pagesize_m1 = <value optimized out><br>
sp = <value optimized out><br>
freesize = <value optimized out><br>
#9 0x00000039c00e8c4d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115<br>
No locals.<br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<div class="moz-signature">-- <br>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>