[asterisk-bugs] [JIRA] (ASTERISK-24850) Lockup with no calls being processed, probably caused by IAX

Matt Jordan (JIRA) noreply at issues.asterisk.org
Mon Mar 9 11:01:35 CDT 2015


    [ https://issues.asterisk.org/jira/browse/ASTERISK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=225294#comment-225294 ] 

Matt Jordan commented on ASTERISK-24850:
----------------------------------------

Since we discussed this a lot in {{#asterisk-dev}}, I'm including the conversation here.

tl;dr: this is actually a locking inversion in {{chan_lcr}}.

{quote}
{noformat}
21:22.10	mjordan	WIMPy: nope.
21:22.30	mjordan	WIMPy: chan_lcr is doing something wrong by causing a locking inversion
21:23.19	WIMPy	Hmm. And that causes IAX to go to zombie state?
21:23.34	mjordan	lcr_request is holding a channel lock while attempting to get the channels container lock.
21:24.05	mjordan	It would, because lcr_request is holding a channel lock that causes everyone else to pile up
21:24.21	WIMPy	Ok.
21:24.42	mjordan	at least that's what I can tell from the 'core show locks' output.
21:24.55	WIMPy	Now the interesting thing is where to look for that lock.
21:25.25	mjordan	some line number in lcr_request.
21:25.36	mjordan	I would suspect it is attempting to link it into the channels container.
21:25.43	WIMPy	Or why it only seem to happen if IAX is involved.
21:26.19	mjordan	because chan_iax2 has to grab the channels container?
21:26.30	mjordan	Honestly, if you ran a 'core show channels' it would probably block that as well.
21:27.03	mjordan	it also looks like you've got IAX channels in a bridge with the LCR channels
21:27.25	mjordan	which would cause those IAX channels to block while their thread attempts to write to the LCR channel (which is locked by chan_lcr)
21:27.31	WIMPy	The result looks extremely similar to theissue I ran in to on the upgrade from 11.7.0 to 11.7.1, but that did not involve lcr, only iax and sip.
21:27.47	mjordan	that core show locks doesn't show anything that I would blame on chan_iax.
21:28.10	WIMPy	But it seems to have trouble writing to IAX not from.
21:28.46	mjordan	WIMPy: " * \note Absolutely _NO_ channel locks should be held before calling this function."
21:28.57	mjordan	that's ast_channel_alloc, which chan_lcr is calling into, with a channel lock held
21:29.00	WIMPy	All these locking things don't really mean much to me :-(
21:29.02	mjordan	that will absolutely cause a deadlock.
21:29.15	mjordan	Once a channel deadlock happens, everyone gets screwed
21:29.30	mjordan	the channels container is permanently locked, and everyone grinds to a halt
21:29.48	WIMPy	alloc? So that would be when a new channel is set up?
21:30.26	mjordan	WIMPy: that is when you create a new channel, and I can look at chan_lcr and see them doing it, which I should *not* be doing since that is code that isn't licensed back to the project.
21:30.47	mjordan	line 2103, followed by line 2125. https://github.com/osmobuntu/lcr/blob/master/chan_lcr.c
21:30.59	mjordan	It violates the contract of the API, and will absolutely cause deadlocks willy nilly.
21:31.41	WIMPy	Ok. Thanks for pointing me in the right direction.
21:33.02	WIMPy	just wonders what version that is yu found, BTW
21:33.10	WIMPy	you
21:33.42	mjordan	er
21:34.00	mjordan	WIMPy: that lock isn't a channel lock. They globally lock everytime they make a channel. Weird.
21:34.03	mjordan	Hm.
21:34.32	mjordan	they are still deadlocking it.
21:34.37	mjordan	but it's a weird one.
21:34.44	mjordan	mostly because of their stupid global lock.
21:34.57	mjordan	they grab their global lock (chan_lock), then attempt to grab the channel container lock.
21:35.05	WIMPy	They way you describe it makes it interesting that it works at all.
21:35.21	mjordan	that causes a locking inversion when the core - which holds a channel lock - calls ast_write on them
21:35.38	mjordan	that channel lock blocks another thread which has grabbed the global channels lock and is attempting to grab that channel lock
21:35.47	mjordan	that global lock is still the problem.
21:35.56	mjordan	but it is a bit more complex than I thought.
21:36.37	mjordan	WIMPy: generally, you shouldn't hold locks all over the place, and the locking order is supposed to be "channel container, then channel"
21:37.07	mjordan	what they've done is thrown a third lock into the mix (generally a bad idea), then complicate that by having it grabbed when a channel lock may be held.
21:37.29	mjordan	They create the inversion when they hold that global lock and grab the channel container lock by creating a new channel (when the new channel is done being created, it has to go into the container).
21:38.05	mjordan	Why they have a global lock and are locking it left & right, I don't know. That seems like a terrible idea for performance. They probably thought they were solving some concurrency problem, and stepped right into creating worse ones.
21:38.12	WIMPy	That again sounds like it should happen regularly.
21:38.25	mjordan	well, kind of. It involves at least three threads.
21:38.43	WIMPy	The shit is horribly old :-(
21:50.41	WIMPy	Do you think there needs to be a bridged lcr channel when a new lcr channel is created for the issue to happen?
21:56.24	WIMPy	Could it be that the whole lock isn't neccessay at all?
22:35.07	WIMPy	is confused.
22:35.51	WIMPy	While chan_misdn (which obviousely has been the template for chan_cr at some point) doesn't seem to do locking, while chan_dahdi looks like chan_lcr looks like now.
22:38.07	WIMPy	Also alsa does it an console doesn't.
23:05.17	mjordan	chan_dahdi doesn't have a global lock.
23:06.13	mjordan	it has an interface lock for its private structures, but it doesn't use it in the same way.
23:06.30	mjordan	chan_misdn is garbage, so I can't comment on that.
23:07.12	mjordan	granted, everything has issues, but chan_misdn does some horrible stuff with holding onto memory in unsafe ways
23:07.24	WIMPy	I have to admit that I'm too clueless about that locking stuff to see the difference :-(
23:07.30	mjordan	and generally can't be compiled any more, thanks to libmisdn's reliance on an old version of the kernel.
23:08.03	mjordan	WIMPy: lock inversions are hard. In this case, I'm positive it is chan_lcr. They are holding onto their global lock irrespective of the locking contract that a channel driver must adhere to.
23:08.05	WIMPy	Well, it has been "old and deprecated" for many years.
23:08.43	mjordan	to be frank, while deadlocks do still happen in Asterisk from time to time, in the core channel drivers, we have been aggressive about killing them.
23:09.03	WIMPy	Do I get it right that this lock would prevent other threads from accessing an unfinished structure?
23:09.13	mjordan	Richard's latest fix for one in chan_sip is one of the few I've seen crop up in the last year.
23:09.58	WIMPy	Maybe that was the one I got with SIP transfers?
23:10.37	mjordan	WIMPy: maybe. It occurs rarely when you poke a peer while inside of certain messages. He just committed the fix, I think (I was out most last week)
23:11.14	mjordan	WIMPy: I'm not sure what they are doing with that global lock. You don't need to protect something that is allocated on the stack, so I'm not sure why they are aggressively holding it.
23:11.41	WIMPy	Actually that one was a lot more interesting on the also involved IAX side.
23:12.14	mjordan	keep in mind that once something deadlocks the channels container, all channel drivers are screwed
23:12.25	mjordan	what is on the other side of a bridge usually doesn't matter.
23:12.29	WIMPy	I have tried to just remove the locking which does work. But I guess that doesn't really mean much.
23:12.43	mjordan	heh, yeah, now all you know is that you won't deadlock. May blow up instead.
23:12.58	WIMPy	For the sip transfer issue it did.
23:12.58	mjordan	honestly, we get very few deadlocks on the issue tracker when using just the core channel drivers.
23:13.08	mjordan	shrugs
23:13.17	mjordan	I can't recall what that issue was, so I'm not sure
23:14.29	WIMPy	That was the very strange thing that my box would lock up in the same way as on this issue when I tied to transfer a call that come in via IAX on the sip phone.
23:15.19	WIMPy	The thing that made it interesting was that it would happen again on the next try until I restarted the other Asterisk box where the call came from.
23:15.55	WIMPy	That's why I asked that time if there is anything in IAX in any way related to realtime.
23:17.22	*** join/#asterisk-dev superscrat (~asanders at 173-17-133-2.client.mchsi.com)
23:20.17	WIMPy	But back to this one: Can you think of some action that might have an issue with the lock removed?
23:20.45	WIMPy	Something like 'core show channels' displaying incomplete information would be obvious, but harmless.
23:21.05	mjordan	which lock?
23:21.08	mjordan	the one in chan_lcr?
23:21.17	WIMPy	yes
23:23.24	mjordan	I'd say go for it. Particularly the one in the request callback.
23:23.40	mjordan	the other option would be to unlock around the ast_channel_alloc call and then re-lock.
23:23.48	WIMPy	Yes, that's the one I removed.
23:23.58	mjordan	not knowing all of their code, I'm really not sure, but that would prevent the locking inversion.
23:26.43	WIMPy	alloc_call is the only thig happening before. Doesn't sound like something that needs any locks. Maybe I just move the lock to after ast_channel_alloc? That would seem to be the safe bet.
23:27.00	WIMPy	That was actually my first idea until I looked in to the other channels.
23:40.14	WIMPy	OTOH, if other channels work without loking anything, there can't be a need to do it, can there?
23:40.49	file	that's an implementation detail of the channel driver itself
23:41.07	WIMPy	I will try to find out if Jolly remembers why it was there.
23:41.33	WIMPy	But can it be called again for the channel to be created before the function returns?
23:42.08	*** join/#asterisk-dev sgriepentrog (sgriepentr at nat/digium/x-qvecpniglwsxdign)
23:50.07	file	I don't understand the question
23:52.04	WIMPy	Is there any chance the channel will will be called for the channel that might be incompletely created, befor that very function completes?
23:53.11	file	it is possible for something to get the channel after it is allocated.
23:54.38	WIMPy	Yes, but is there anyhting that could have an issue with the incomplete channel?
23:55.03	file	no?
23:55.35	WIMPy	That's what I hoped.
23:56.16	WIMPy	And in the end there would always remain the possibility to hit the very moment beween ast_channel_alloc and any locking.
23:56.52	WIMPy	I think I'm positive that the lock can just go.
23:59.19	WIMPy	I'll keep it running without here and see what Jolly can tell about it.
{noformat}
{quote}



> Lockup with no calls being processed, probably caused by IAX
> ------------------------------------------------------------
>
>                 Key: ASTERISK-24850
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-24850
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: General
>    Affects Versions: 11.16.0
>         Environment: Two Asterisks connected via IAX
>            Reporter: Birger "WIMPy" Harzenetter
>         Attachments: Asterisk-debug.txt
>
>
> The issue starts with warnings like the follogin one
> [Mar  6 09:32:04] WARNING[9013] channel.c: Exceptionally long voice queue length queuing to IAX2/hbc-hah-73
> about 1-2 times per second.
> After some random time in the range from half a minute to 4 minutes it changes to
> [Mar  6 09:36:23] WARNING[9009] chan_iax2.c: Max retries exceeded to host 144.76.51.196 on IAX2/hbc-hah-73 (type = 6, subclass = 2, ts=377730, seqno=114)
> every 10 seconds until I kill (-9) Asterisk.
> In that state Asterisk still answers IAX packets so that it still qualifies OK on the calling instance.
> However it completely stops processing any calls.
> On the calling box all following Dial()s to the locked up one end with DIALSTATUS=CANCEL and HANGUPCAUSE=0 so even they seem to be accepted, just not processed.
> This particular issue started 21 days after upgrading from 11.15.0, but occurred 3 times withing two weeks so far. So I'm unsure if it's related to the upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list