[Asterisk-code-review] pjsip: Rewrite OPTIONS support with new eyes. (asterisk[13])

Richard Mudgett asteriskteam at digium.com
Sun Jan 14 14:28:08 CST 2018


Richard Mudgett has posted comments on this change. ( https://gerrit.asterisk.org/7710 )

Change subject: pjsip: Rewrite OPTIONS support with new eyes.
......................................................................


Patch Set 1: Code-Review-1

(46 comments)

Overall, I like the layered status compositor approach and the pre-setup links between the layers.  It makes the normal status update path faster by stopping updates going higher if they cannot change the next layer's status report.  It also makes the normal path faster because the next higher layer does not need to be found each time.

Updates to the sip_options_aor layer structure seem to be better controlled because the critical work is done by the sip_options_aor serializer to keep the available status accurate.

However, how the relationship of the endpoint and sip_options_aor layers is updated is currently quite fragile.  The order of observer callbacks between endpoint and aor objects is indeterminate as they are executed by different serializers.  The rebuilding and update of the status compositor reporting tree is problematic as a result.  The endpoint and aor layers need to be as independent as possible.  There doesn't seem to be enough protection from normal status updates interfering with the restructuring.  I think it would be much simpler for the endpoint compositor's aor container to also keep the last sip_options_aor available status contribution to make the endpoint compositor available status layer and the sip_options_aor available status layer more independent.

https://gerrit.asterisk.org/#/c/7710/1/funcs/func_pjsip_contact.c
File funcs/func_pjsip_contact.c:

https://gerrit.asterisk.org/#/c/7710/1/funcs/func_pjsip_contact.c@147
PS1, Line 147: 	contact_status = ast_sip_get_contact_status(contact_obj);
There is no protection from contact_status == NULL (Pre-existing)


https://gerrit.asterisk.org/#/c/7710/1/include/asterisk/res_pjsip.h
File include/asterisk/res_pjsip.h:

https://gerrit.asterisk.org/#/c/7710/1/include/asterisk/res_pjsip.h@264
PS1, Line 264: 	/*! If true authenticate the qualify if needed */
             : 	int authenticate_qualify;
contact authenticate_qualify is passed around but not used for anything (Pre-existing)


https://gerrit.asterisk.org/#/c/7710/1/include/asterisk/res_pjsip.h@303
PS1, Line 303: struct ast_sip_contact_status {
The contact_status sorcery object needs to be kept to not change the API.  A third party may have setup an observer.  Mitigating factors are that we never declared the table in alembic so nobody should have a realtime table setup.  Only third party modules may have setup an observer.  Not sure how much we want to care though.


https://gerrit.asterisk.org/#/c/7710/1/include/asterisk/res_pjsip.h@342
PS1, Line 342: 	/*! If true authenticate the qualify if needed */
             : 	int authenticate_qualify;
aor authenticate_qualify is passed around but not used for anything (Pre-existing)


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/location.c
File res/res_pjsip/location.c:

https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/location.c@1252
PS1, Line 1252: 	ast_sorcery_object_set_congestion_levels(sorcery, "contact", -1,
              : 		3 * AST_TASKPROCESSOR_HIGH_WATER_LEVEL);
We should be able to use the default congestion levels for the sorcery/contact serializer with this rewrite.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_configuration.c
File res/res_pjsip/pjsip_configuration.c:

https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_configuration.c@44
PS1, Line 44: 	/*! \brief AORs that we should react to */
            : 	char *aors;
aors no longer referenced


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_configuration.c@1115
PS1, Line 1115: int ast_sip_persistent_endpoint_publish_contact_state(const char *endpoint_name, const struct ast_sip_contact_status *contact_status)
Return void.  The only caller does not care about the return value.  It only wants to publish the contact status for each endpoint it knows about.  What could any caller do about a failed publish anyway?


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c
File res/res_pjsip/pjsip_options.c:

https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@102
PS1, Line 102: 	/*! \brief The names of the AORs feeding this compositor */
             : 	struct ao2_container *aors;
             : 	/*! \brief The number of AORs that are available */
             : 	unsigned int available;
             : 	/*! \brief The name of the endpoint */
Keeping the total available aor count correct is fragile.

More thought on structure updates from observers is needed.  I think it would be much simpler for the compositor's aor container to also keep the last sip_options_aor available status contribution to make the endpoint compositor available status layer and the sip_options_aor available status layer more independent.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@122
PS1, Line 122: 	/*! \brief The endpoint state compositors we are feeding */
             : 	AST_VECTOR(, struct sip_options_endpoint_state_compositor *) compositors;
Does it make sense for multiple endpoints to reference the same aor?
A vector is not needed if it doesn't.  Though, we might have two endpoints temporarily while a config loads.

Add a note that the vector holds a ref to each compositor


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@128
PS1, Line 128: 	/*! \brief If true authenticate the qualify if needed */
             : 	int authenticate_qualify;
authenticate_qualify value is not needed.  We send OPTIONS not receive them.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@364
PS1, Line 364: 	if (!sip_options_contact_statuses) {
             : 		sip_options_contact_statuses = ao2_container_alloc_hash(AO2_ALLOC_OPT_LOCK_MUTEX, 0, CONTACT_BUCKETS,
             : 			sip_contact_status_hash, NULL, sip_contact_status_cmp);
             : 		if (!sip_options_contact_statuses) {
             : 			return NULL;
             : 		}
             : 	}
Put this into a helper routine so the container is always created the same way by the two locations that can create it.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@479
PS1, Line 479: 	AST_VECTOR_FREE(&aor_options->compositors);
Should add an assert here to ensure that the vector is empty.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@485
PS1, Line 485: static int sip_contact_hash(const void *obj, const int flags)
sip_contact_hash replaced by int ast_sorcery_object_id_hash(const void *obj, int flags)


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@506
PS1, Line 506: static int sip_contact_cmp(void *obj, void *arg, int flags)
sip_contact_cmp replaced by int ast_sorcery_object_id_compare(void *obj, void *arg, int flags)


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@613
PS1, Line 613: 		if (ao2_container_count(endpoint_state_compositor->aors)) {
It maybe better if we search for the aor name in the aors container rather than seeing if the container is empty.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@927
PS1, Line 927: 	aor_options->serializer = ast_sip_create_serializer_named(tps_name);
Need to use ast_sip_create_serializer_group_named() to create the serializer


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@979
PS1, Line 979: 	ast_statsd_log_string_va("PJSIP.contacts.states.%s", AST_STATSD_GAUGE,
             : 		"+1", 1.0, ast_sip_get_contact_status_label(contact_status->status));
Should remove this.  Nothing brings the REMOVED contact status statsd guage down except asterisk shutting down.  Temporary contacts from rewrite_contact increase this count.  (Pre-existing)


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1043
PS1, Line 1043: 	int new)
Change new to is_new to avoid using C++ keywords for variables.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1055
PS1, Line 1055: 	existing_permanent_contacts = ao2_container_clone(aor_options->permanent_contacts, 0);
> It seems like this could be expensive?  It can be optimized away if (!aor->
The clone shouldn't be that expensive as there won't be many items in the container.

The container must exist as the sip_options_aor allocation fails if the container couldn't be created.

The locking concern is not a problem as manipulation of the sip_options_aor structure is done by the struct's serializer.  Only one thread can manipulate the struct instance at a time.  Besides, the container was intentionally created without a lock.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1071
PS1, Line 1071: 			ao2_find(existing_permanent_contacts, ast_sorcery_object_get_id(contact), OBJ_NODATA | OBJ_UNLINK);
Add OBJ_SEARCH_KEY to ao2_find here


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1103
PS1, Line 1103: 		aor_options->available = ao2_container_count(aor_options->dynamic_contacts) +
              : 			ao2_container_count(aor_options->permanent_contacts);
All the permanent and dynamic contact statuses need to be set to created/unknown.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1108
PS1, Line 1108: 		aor_options->available = 0;
All the permanent and dynamic contact statuses need to be set to created/unknown/unavailable so the OPTIONS ping will increment the available count.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1125
PS1, Line 1125: 		/* If there is still a qualify frequency then schedule this */
aor_options->qualify_frequency should be updated here to avoid sched thread fetching a partially set value.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1138
PS1, Line 1138: 	aor_options->qualify_frequency = aor->qualify_frequency;
aor_options->qualify_frequency should not be set here


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1139
PS1, Line 1139: 	aor_options->qualify_timeout = aor->qualify_timeout;
aor_options->qualify_timeout should be updated before rescheduling the OPTIONS qualify


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1140
PS1, Line 1140: 	aor_options->authenticate_qualify = aor->authenticate_qualify;
options->authenticate_qualify is set but not used


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1348
PS1, Line 1348: 			ast_sip_persistent_endpoint_update_state(ast_sorcery_object_get_id(endpoint), AST_ENDPOINT_OFFLINE);
This should be deferred until after the loop.  If we don't have a compositor at that point then set endpoint to offline


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1398
PS1, Line 1398: 	for (i = 0; i < AST_VECTOR_SIZE(&aor_options->compositors); ++i) {
The processing of the compositors vector should be done under the aor_options serializer.

Also we should reset the compositors vector after the loop or remove the elements in the loop.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1404
PS1, Line 1404: 		endpoint_state_compositor->aors = 0;
endpoint_state_compositor->aors is leaked here.  Need to ast_str_container_remove(endpoint_state_compositor->aors, aor_options->name).


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1405
PS1, Line 1405: 		endpoint_state_compositor->available = 0;
We should be decrementing the available count if this aor was available before.  The endpoint may have more than one aor contributor.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1422
PS1, Line 1422: 	if (endpoint_state_compositor->aors) {
Need to check if endpoint_state_compositor->aors is not an empty container instead


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1492
PS1, Line 1492: 	ast_sip_push_task_synchronous(management_serializer, sip_options_synchronize_task, &task_data);
ast_sip_push_task_synchronous() can fail.  If so then nothing cleans up the passed in task_data if needed.  Needs to be done throughout the file.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1512
PS1, Line 1512: 	for (; (aor = ao2_iterator_next(&it_aors)); ao2_ref(aor, -1)) {
The compositor lock needs to be held while finding the next compositor->aor since the compositor->aor container has no lock


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1557
PS1, Line 1557: static const struct ast_sorcery_observer endpoint_observer_callbacks = {
The order of observer callbacks between endpoint and aor objects is indeterminate as they are executed by different serializers.  The rebuilding and update of the status compositor reporting tree is problematic as a result.

The endpoint observers must rebuild the endpoint-aor links because they know the relationships.  If an added/updated endpoint references an aor that actually exists but doesn't have a sip_options_aor object yet then a stub object needs to be created until the aor observer callback can add/update the real object into the tree.

The aor observers need to add/update/delete the sip_options_aor objects and update their status to any linked endpoint compositors.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1660
PS1, Line 1660: 	for (i = 0; i < AST_VECTOR_SIZE(&aor_options->compositors); ++i) {
The processing of the compositors vector should be done under the aor_options serializer.

Also we should reset the compositors vector after the loop or remove the elements in the loop.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1723
PS1, Line 1723: 		/* If this is the first contact we need to schedule up qualification */
              : 		if ((ao2_container_count(task_data->aor_options->dynamic_contacts) + ao2_container_count(task_data->aor_options->permanent_contacts)) == 1) {
              : 			ast_debug(3, "Starting scheduled callback on AOR '%s' for qualifying as there is now a contact on it\n",
              : 				task_data->aor_options->name);
              : 			/* We immediately schedule the initial qualify so that we get reachable/unreachable as soon as possible.
              : 			 * Realistically since they pretty much just registered they should be reachable.
              : 			 */
              : 			task_data->aor_options->sched_id = ast_sched_add_variable(sched, 1, sip_options_qualify_aor,
              : 				ao2_bump(task_data->aor_options), 1);
              : 			if (task_data->aor_options->sched_id < 0) {
              : 				ao2_t_ref(task_data->aor_options, -1, "Cleanup failed scheduler add");
              : 				ast_log(LOG_ERROR, "Unable to schedule qualify for contacts of AOR '%s'\n", task_data->aor_options->name);
              : 			}
We need to push the start onto the management_serializer to avoid a race condition that could start the scheduler twice.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@1790
PS1, Line 1790: 		if ((ao2_container_count(task_data->aor_options->dynamic_contacts) + ao2_container_count(task_data->aor_options->permanent_contacts)) == 0) {
              : 			ast_debug(3, "Terminating scheduled callback on AOR '%s' as there are no contacts to qualify\n",
              : 				task_data->aor_options->name);
              : 			AST_SCHED_DEL_UNREF(sched, task_data->aor_options->sched_id, ao2_t_ref(task_data->aor_options, -1, "Delete scheduler entry ref"));
              : 		}
Stoping the scheduler here can deadlock.  Need to push the stopping onto the management_serializer to avoid deadlock and other race conditions with the scheduler.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2009
PS1, Line 2009: 		ast_sched_context_destroy(sched);
Any scheduled AOR qualify ping sip_options_aor ref is leaked here.  Need to push a stopping of all scheduled items task to the management_serializer.  Also need a shutdown flag.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2012
PS1, Line 2012: 	ast_sorcery_observer_remove(ast_sip_get_sorcery(), "contact", &contact_observer_callbacks);
              : 	ast_sorcery_observer_remove(ast_sip_get_sorcery(), "aor", &aor_observer_callbacks);
              : 	ast_sorcery_observer_remove(ast_sip_get_sorcery(), "endpoint", &endpoint_observer_callbacks);
Need to uninstall the observer callbacks before stopping the scheduler.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2015
PS1, Line 2015: 	ast_taskprocessor_unreference(management_serializer);
The management_serializer needs to be shutdown after the ao2_cleanup() of the global containers


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2016
PS1, Line 2016: 	ao2_cleanup(sip_options_aors);
              : 	ao2_cleanup(sip_options_contact_statuses);
              : 	ao2_cleanup(sip_options_endpoint_state_compositors);
Need to set these global containers to NULL after ao2_cleanup()


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2019
PS1, Line 2019: 	pjsip_endpt_unregister_module(ast_sip_get_pjsip_endpoint(), &options_module);
Insert the following before unregistering the module:
* Need to send a final flush task to the management_serializer here to destroy it.
* Need to wait for the management_serializer and the options_aor serializers to shutdown using ast_serializer_shutdown_group_join().
* Need to destroy the scheduler here


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2036
PS1, Line 2036: 	if (pjsip_endpt_add_capability(ast_sip_get_pjsip_endpoint(), NULL, PJSIP_H_ALLOW,
Need to create a shudown_group here using ast_serializer_shutdown_group_alloc()


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2065
PS1, Line 2065: 	ast_taskprocessor_build_name(tps_name, sizeof(tps_name), "pjsip/options/manage");
There is one and only one pjsip/options/manage serializer so no need to add a sequence number by calling ast_taskprocessor_build_name().

Just pass the string to ast_sip_create_serializer_group_named().  The name cannot conflict with an aor named "manage" options serializer because that serializer name has a sequence number appended while the management_serializer will not.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2066
PS1, Line 2066: 	management_serializer = ast_sip_create_serializer_named(tps_name);
Need to use ast_sip_create_serializer_group_named() to create the serializer.


https://gerrit.asterisk.org/#/c/7710/1/res/res_pjsip/pjsip_options.c@2087
PS1, Line 2087: 	sched = ast_sched_context_create();
              : 	if (!sched) {
              : 		ast_res_pjsip_cleanup_options_handling();
              : 		return -1;
              : 	}
              : 
              : 	if (ast_sched_start_thread(sched)) {
              : 		ast_res_pjsip_cleanup_options_handling();
              : 		return -1;
              : 	}
Need to start the scheduler before installing the observer callbacks and creating the management_serializer.  The callbacks depend on the management_serializer and scheduler being already setup.



-- 
To view, visit https://gerrit.asterisk.org/7710
To unsubscribe, visit https://gerrit.asterisk.org/settings

Gerrit-Project: asterisk
Gerrit-Branch: 13
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a5ebbfca9001dfe933eaeac4d3babd8d2e6f082
Gerrit-Change-Number: 7710
Gerrit-PatchSet: 1
Gerrit-Owner: Joshua Colp <jcolp at digium.com>
Gerrit-Reviewer: Corey Farrell <git at cfware.com>
Gerrit-Reviewer: Jenkins2
Gerrit-Reviewer: Joshua Colp <jcolp at digium.com>
Gerrit-Reviewer: Richard Mudgett <rmudgett at digium.com>
Gerrit-Comment-Date: Sun, 14 Jan 2018 20:28:08 +0000
Gerrit-HasComments: Yes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-code-review/attachments/20180114/d97264e3/attachment-0001.html>


More information about the asterisk-code-review mailing list