Leehodgkinson.com

A stroll through the python-social-auth code

2015-03-28

This post contains a rather long-winded meander through a typical python-social-auth sign-in flow. What's the point? Well, personally I found the best way to really get to grips with how python-social-auth works was to just trace through the code, and if nothing else, this blog post gives me some notes to look back on in a few months when everything else has been forgotten. For the purpose of demonstration, I'll be taking as our example a google-plus sign-in flow using Django.

It begins

Like everything in Django, we enter the flow essentially at urls.py. Python-social-auth ("psa" from herein) evolved from django-social-auth, so of course it includes a django app in the source code, and the app is at social/apps/django_app/ (you will also see there are other apps here, such as the one for tornado. The great feature of psa is that the way the code is written makes plugging into different platforms and authentication backends is very easy by writing a new "strategy", "storage" and "backend", but at first glance this can also make the code a little more cryptic).

In this app, first point your text editor to urls.py. When a user initiates a login flow for google-plus they should be directed to the url, social/login/google-plus/ (relative to wherever you included the psa urls file in your main django urls file of course). The string "google-plus" is captured by the regex capture group in the psa urls.py file and passed to the view (in views.py of the app) called "auth"(tip: this URL is assigned the name "begin" and can normally be accessed by reversing "social:begin" with django reverse, remembering that the psa namespace is "social". )

Thus, we next turn to the view function auth in the views.py file. At first glance it doesn't seem to do very much but call another function. The captured string "google-plus" will be taken as the backend argument, along with the request, and we just call another function , do_auth (social/actions.py), with this backend and with the redirect_name (the name of the field specifying where to redirect after the login flow is finished, which is passed in the query string. Usually this field name is "next", and it is passed in the query string like ?next=/someurl/&). The value of this field determines where the user will finally be directed when the entire sign-up/in flow is over, as we will see.

The most interesting thing about this view is the psa decorator. The NAMESPACE param can be set by the user with the URL_NAMESPACE variable in their django settings.py file, but by default it is just the string "social", and so we find ourselves calling the decorator as @psa('social:complete). This argument is the django name for the psa complete url, in other words, reverse("social:complete", args=("google-plus",)) would give the (relative) url social/complete/google-plus. This is going to be the redirect URI that we tell google to callback to (don't confuse this with the redirect we just discussed above that redirects a user some place after the whole flow is over!)

tip: you will actually set SOCIAL_AUTH_URL_NAMESPACE = ..., prepending "SOCIAL_AUTH_" as with all psa settings. The function setting_name of utils.py actually adds the SOCIAL_AUTH prefix to the given setting referenced in the source code.

Let's go to social/apps/django_app/utils.py to see how the psa decorator is defined, and how it works.

The psa decorator

The psa decorator is defined as

This is pretty dizzying code (functions within functions within...), but really it is just a common garden "function decorator with arguments", and once you have understood one of them, you will understand this too. The psa wrapper is going to build and return a decorator (the clue is in the name!). Now is a good time to revise some Python 101 decorators!).

Aside: Inside the decorator function being built, you see the
@wraps(func) line, and so you need to understand what the functools
wraps does. See here for some help with that understanding, but
in brief the wraps utility solves the problem of func losing
certain information, such as its docstring, __name__ attribute,
number of arguments taken etc, when it has been replaced
by the wrapper function the decorator returns. If this is not clear,
see the very readable stack-exchange link I mentioned. It's not
terribly important you understand what wraps does to understand what
psa does, so don't worry if it is still not clear, you can
continue...

If you remember well, decorators in python take a function, here called func, and use it build and return another function, here called wrapper (since it wraps func with some extra code). This wrapper function is returned and it replaces the original function. The following notation:

is just shorthand for

, they are completely equivalent. So in the psa code, the decorator we are building, called "decorator", if used would take the function, "func", and replace it with the function they call "wrapper".

Ignoring what the wrapper function actually does for now, the first question you are probably asking here (especially if previously, you have only worked with basic decorators in python) is "why do we seem to have a decorator within a decorator?". Why did we not just define the decorator like

and use it like

The answer is to do with the fact we wanted it to be a function decorator that could take an argument. All the extra layer of function embedding does is allow us to feed the decorator the redirect_uri argument (the callback for google), other than that it just returns the decorator anyway (see Decorator functions with Decorator arguments.) The decorator with arguments should return a function that will take a function and return another function, i.e. it should really return a normal decorator. In code:

If we act with it using the shorthand:

this translates to

At this stage the arg is available inside the real_decorator, but otherwise we have the usual decorator thing going down.

To cut a long story short, psa is a "function decorator with arguments", it replaces the function it acts on by another function that runs some pre-code before running the original code of the function it replaced.

Wrapper

After all that, what does this replacement function "wrapper" actually do then? Well first we note it has access to all the same arguments and keyword arguments that func did. It takes out specifically the request and backend arguments for convenience, and takes the rest of the args and kwargs via the syntax *args, **kwargs.

The argument redirect_uri that we called psa with in the first place (remember in our Django google-plus walkthrough this is the string "social:complete" that Django can reverse to get the google-plus auth complete URI) is reversed:

so uri will literally read something like (relative to the rest of your Django urls) social/complete/google-plus. This will eventually be passed to google as the callback uri.

Next the wrapper loads the strategy (see the aside below), and sets the social_strategy attribute of the request with the result. Similarly for the backend. Finally, like most decorators it returns the original function, func(request, backend, *args, **kwargs).

Aside: But what is a "strategy" anyway?
In essence a strategy is a collection of methods that will be needed by psa implemented for a specific platform, e.g. django. So let's say
psa needs to build an absolute uri, well we know that in Django that
is done with request.build_absolute_uri(path), so the Django
strategy has a "build_absolute_uri" function that does it just that
way, whereas in another platform, webpy, we have to implement
build_absoluite_uri so it's specific to that platform: web.ctx.protocol + '://' + web.ctx.host + path. Having stategies like
this for each platform, means that the psa main code is written
to work across lots of platforms like django, tornado, webpy with a
universal code body....i.e. to build an absolute uri just set your
stretegy, and then do strategy.build_absolute_uri('/somepath'),
and the function will be implemented according to your platform. Very
nice. Similarly, the concept of "storage" is a clever way to allow universal code to run cross platform without worrying about the individual details of storage specific to each platform (e.g. now storage.user.create_user translates to UserSocialAuth.create_user(...) in django, where this is now a django model using the django orm)

load_strategy and load backend

This does what it says on the tin. Using the get_strategy function of social.strategies.utils:

here the "strategy argument is STRATEGY, which by default is just the string 'social.strategies.django_strategy.DjangoStrategy' (unless you've changed it in your django settings.py by setting SOCIAL_AUTH_STRATEGY), and storage is STORAGE, which by default is social.apps.django_app.default.models.DjangoStorage. Once imported the Strategy is initialized with the Storage (this initializiation just involves setting the strategy attribute of the stoage to the storage we passed in) and then returned.

As for load backend: you guessed it, it just initializes and returns the backend. AUTHENTICAON_BACKEND is set by the user in settings.py as a list of backends to try in order, and this in turn sets the BACKEND parameter in apps/django_apps/utils. The load_backend function takes the strategy we just loaded, the backend which we parsed from the do_auth view ('google-plus' for us), and the redirect uri that we constructed, called uri. The get_backend function takes this BACKEND list, plus the one we actually want backend ("google-plus"), and returns the actual appropriate Backend class. Finally, we initialize this Backend by setting the strategy attribute to the strategy object and the redirect_uri attr to redirect uri we have for the callback, and it is returned.

actions/do_auth.py

Now with that detour of the psa decorator aside, we continue with the google-plus signin flow at do_auth of actions.py.

The data is grabbed from the request object (QueryDict in django):

Using the django strategy, the request_data function is implemented in a way that simply checks if the request method is 'GET' or 'POST' and returns either the request.GET or request.POST parameters accordingly. The exception being when merge is enabled, in which the data from GET and POST is merged and returned.

Some of this data is then saved into the session.

backend.setting means get the setting using

i.e. passing the current backend to get the setting within the argument 'name' (see strategies/base.py), the inclusion of the backend means that instead of looking for SOCIAL_AUTH_FIELDS_STORED_IN_SESSION we actually look for a backend specific setting, SOCIAL_AUTH_GOOGLE_PLUS_FIELDS_STORED_IN_SESSION. This is all done with the setting_name function (which is also the function responsible for prefixing all the settings with "SOCIAL_AUTH"). The function get_setting will be the strategy specific way of getting settings that must be implemented per platform, e.g. in django_stretegy.py.

In other words, a user can set SOCIAL_AUTH_GOOGLE_PLUS_FIELDS_STORED_IN_SESSION in their django settings.py, and this will force psa to store in the session these fields and their associated data if they are present in the request data.

Next we look for the redirect_name (remember by default this is the string "next")
in the data, if it's there grab its value and save it as redirect_uri (NB there is an important distinction to be made here; this redirect_uri is grabbed from the "next" parameter in the GET string on a request if present, e.g. "social/login/google-plus/?next=/after_the_signin_view&", and this is used to redirect the user to some URL when the entire sign-in flow is over. Earlier in the psa wrapper we saw that another redirect_uri, which for the auth view was a URL identified by the reverse of "social:complete", was used in load_backend when initializing the Backend to set the attribute redirect_uri of the backend. The latter will be used to pass to google during the auth request to callback to with the auth code)

If the user has opted in their settings to sanitze redirects, then this function tests the redirect uri grabbed from the next parameter in the query string, by making sure that 1) it's a valid uri 2) if the uri is absolute and has a hostname, this hostname matches what the hostname should be.

We then store this URI to session (again using the django strategy associated with the backend to make sure the appropriate session setting logic is ran). Session will now have a next key of the value of the redirect_uri.

Finally we run

backend.start()

The start method is implemented in backends/base.py in the BaseAuth class, which is the parent class for the OAuthAuth class, which in turn is a parent to BaseOAuth2, which finally is a parent to GooglePlusAuth (along with BaseGoogleOAuth2API).

It's a very short method that runs like

Clean partial pipeline: we will discuss partial pipelines later, but for this just pops the key "partial_pipeline" from the session (see strategies/base.py)

Tests if backend uses a redirect: in backends/base.py this is a function that just returns True, and it is not overriden in oath.py or google.py, so the answer is yes, we do use a redirect.

Builds an auth_url: (see backends/oath.py and BaseOAuth2) First in the auth_url building comes the get_or_create_state. State is the OAuth state parameter that according to the spec can be passed to google and will be returned by google (it's useful for things like CSRF tokens to validate requests). For google-plus the STATE_PARAMETER and REDIRECT_STATE are both None (why?), so the state is always just set to None. The function auth_params contributes next, adding client_id and client_secret, and the response_type (which for us will be "code") It also sets the redirect_uri for google to callback to (ultimately using the redirect_uri attribute of the backend that the psa decorator set . Further we get scope arguments, ultimately this relies on get_scope of backends/google.py, and it adds any scopes the user has defined in settings.py to the default scopes (e.g. "https://www.googleapis.com/auth/userinfo.email" to grab a user's email). The auth_extra_arguments line parses extra authentication arguments the user may have given in settings.py with the AUTH_EXTRA_ARGUMENTS setting (e.g. you could use this setting to pass something like {'access_type': 'offline'}). Finally this dictionary of params is url encoded, and appended to the self.authorization_url(), which returns self.AUTHORIZATION_URL (see backends/oauth.py) and which for google-plus is defined:

AUTHORIZATION_URL = 'https://accounts.google.com/o/oauth2/auth'

Ultimately we have built an auth url that looks something like

which we redirect to with

for which the django strategy ensures this redirect is translates it the appropriate django logic, i.e.

apps/django_app/views.py complete

By virtue of setting our redirect_uri that we sent in auth requests GET params to Google, Google should callback to the reversal of "social:complete" after the users has signed in and consented, along with an "auth code".

The psa wrapper does its magic again, loading the request with the strategy, backend, and again setting the redirect uri to "social:complete" (i.e. we callback to the same view) ... and we call do_complete.

do_complete

The data is grabbed from the request, and we next call user_is_authenticated(user) . This function basically just calls the is_authenticated() method if available and returns the boolean result. The line

means if is_authenticated is True, the user object will be assigned to user, else the user will be assigned None (since if is_authenticated is False there will be no need to check the user obj for truth and python logic jumps to the second branch of the "or"). On the first login we expect the user object to be AnonymousUser
and so here user is set to None.

Next we come to the partial_pipeline_data: this function pops the partial_pipeline var from the session, and if it exists calls

(see strategies/base.py and then pipeline/utils.py).

We will come the pipelines and partial pipelines later, so for now do not worry too much about this. On the first pass, and if with no partial pipelines involved, the next step of the code will be backend.complete.

backend.complete

If not partial, we run backend.complete, which as backends/base.py shows, is just a proxy for auth_complete(*args, **kwargs) (see backends/google.py). We make sure if the access token is present the code param is present, else we raise an exception for missing param. We attempt to grab the access token (this won't be present on the first pass. The redirect uri is not used for callback by google at this stage, we just issue a simple request/response and the user is not sent off anywhere, but google requires the redirect uri as an extra security consistency check).

Assuming we don't have the access token yet we want to use our auth code to make an access token request:

where for google-plus:

the headers are (see backends/oauth.py)

the auth params are the same as those defined for BaseOAuth2 in oauth.py, but also include the access token should it exist, e.g.

With those ingredients, we turn to the request_access_token code. In backends/oauth.py this is just a wrapper for get_json, and in backends/base.py we see this is just a wrapper for self.request(.....).json().

The request method (base.py) is nothing complicated, it just using python requests and the ingredients to actually make the request to google on the provided URL, with the provided method, headers etc...and does some Exception catching. Before returning the response, we call

which just raises a HTTPError of the appropriate type if we get a non 200 response code, you can read more in the pyhon-requests docs.

After getting this JSON response, we check it for errors (we are back in backends/google.py auth_complete now), which involves calling process_error (see backends/oauth.py), which checks the response data for the 'error' key, and if it exists and has value 'denied' or 'access_denied' we raise the exception 'AuthCancled' (the user bailed on us) or if not just a simple AuthFailed.

If no errors, finally call do_auth with the access token (that we should now have if there were no errors) and the rest of the response

See backends/oauth.py for the do_auth code. It involves first using our access token to grab some user data (see backends/google.py BaseGoogleOAuth2API) ensuring we use the correct scope, e.g. https://www.googleapis.com/plus/v1/people/me)....remember get_json as we discussed above is just a request really but jsonified with .json() applied to response. In do_auth we add whatever user data we grabbed back to the response, and update the kwargs with the new response.

With this user info injected to the response and in turn the response injected to the kwargs, we run

which for django just means using the usual django authenticate (from django.contriib.auth)

Django authenticate

This method (see django/contrib/auth/__init__.py) simply loops through the list of backends supplied in the django settings.py, trying the authenticate method of each, one-by-one from top to bottom with the credentials provided, until one accepts and returns a user. Upon getting a user the backend attr is set accordingly and the user is returned.

The authenticate method for the google-plus backend

See backends/base.py. All that happens here is that we check the kwargs backend key matches "google-plus" and if not we return None (which would cause django authenticate to continue its search for a matching backend for these creds). For us google-plus backend matches and we grab the pipeline (this just involves grabbing the list of function path strings from the SOCIAL_AUTH_PIPELINE of settings.py), set the user as not new by default, then if pipeline_index kwarg is present (we've already ran some of the pipeline are up to a given index in the list), we truncate the list from that point, before calling the method pipeline with the remaining list (for us on the first pass the index is non-existant so zero assumed as starting point).

self.pipeline

This just runs the pipeline, checking afterwards that the output of this bit of the pipeline is a dict. NB if it's not a dict the output is simply returned to the user, which is useful to know, since combined with the partial decorator that we'll meet later it means that we can effectively pause the pipeline while we return a HttpResponse, like form or some such, to the user for more input for example at some stage of the pipeline (this could be used to gather extra data from the user during registration for example)

If the pipeline just returned a dict then the user is grabbed from the dict, the attrs social_user and is_new are set accordingly, and the user is returned.

After the pipeline has ran the flow continues in actions.py do_complete, and involves checking the final user obtained is of the correct type specified by the user in the settings.py, USER_MODEL. We also pop the redirect_value from the session that was set earlier, under the session key specified by the redirect_name variable.

If authenticated redirect to the post flow url, else try to login using the login function provided (remember from apps/django_app/default/views.py that we passed the _do_login function (apps/django_app/views.py) to do_complete for this purpose, which just wraps around django login. That essentially finishes the flow.

The pipeline

The power of psa comes from the fact it uses a pipeline and by virtue of this the sign-in flow can be highly customized. Here I do a brief run down of the default pipeline:

auth_social_details

See pipeline/social_auth.py. Usually each pipeline function will return a dict, whose content will be added to the "out" dict, the combined dicts of the all the pipelines so far. The social details function is no exception, and it returns

For google-plus, this get user details func is implemented in backends/google.py in the BaseGoogleAuth class. All it relly does is parse the response for keys like 'email', 'username', 'fullname', 'first_name', etc...

auth_social_uid

For g+ this function is implemented in backends/google.py in the BaseGoogleAuth class, if the user has set USE_UNIQUE_ID in settings.py, then the id from the response is used (for g+ this is known as the "sub id"), else the "email" that was just added to the 'detail' key by the previous pipeline function is used as the user's id.

auth_allowed

Calls backend.auth_allowed(response, details) raising an exception if the answer is False.

As you can see in backends.base.py this allows the psa user to use a whitelist of emails/domains, allowing only those emails/domains on whitelist to sign up.
By default there is no whitelist and anyone can sign up...

social_user

psa uses the UserSocialAuth model (apps/django_app/default/models.py), which is Foreign Key'd to the user model, and stores certain extra info related to social authentication of the user. For example the provider, e.g. 'google-plus', the uid, which must be unique together with the provider. Methods for doing various things like create_user (follow the DjangoMixin hirearchy back), and finally via storage/base.py UserMixin storage of the access and refresh tokens, methods to do a refresh of the token etc...

The social user pipeline function, tries to get this UserSocialAuth object and set it to social. If it exists, we check the user it is associated to matches our user, and we add it the dictionary under the key 'social'.

get_username

Creates username for user if it doesn't exist based on settings

create_user

If user already present just return {'is_new': False}. Otherwise find out which fields a user should have (things like username, email) and as long as some fields exist pass them to strategy.create_user, returning is_new as True since we created a new user and returning this in the dict with the new user.

The strategies create user method is found in strategies/base.py and is just a wrapper for storage.user.create_user, which ultimately comes from apps/django_app/default/models.py DjangoStorage, which inherits from the DjangoUserMixin which has the create user method implemented. It just calls whatever create user method your user model itself defines.

associate user

If the user exists but the social auth user does not exist, the first task is to create the UserSocialAuth object to be linked with this user. Remember backend.strategy.storage is the DjangoStorage model of apps/django_app/default/models.py and the user attribute refers to the UserSocialAuth class. The UserSocialAuth class, inherits from the DjangoUserMixin (see storage/django_orm.py) the create_social_auth class method (so cls in this method refers to UserSocialAuth), so ultimately all we really do is call something tha translates to UserSocialAuth.objects.create..., to create a new social auth user. If no exceptions raised we return:

which will be appended to the out dictionary (notice this is a new association). The purpose of the UserSocialAuth model is to storage things like access tokens and other credentials.

load extra data

This function does what the name suggests, and there is not much to discuss. For google-plus the function is defined in backends/oauth.py and just adds the access token to the data, plus if user has set EXTRA_DATA this is also grabbed and added.

user details

This is also pretty straightfoward. For each name and value in the details dict we get the current attribute of the user under the 'name' key. If there is no current value and this name attribute is not disabled from updating (i.e. not in the SOCIAL_AUTH_PROTECTED_FIELDS setting list), then we update the attribute to the new value, making a note of the change under the "changed" bool of the user.

Extra: Partial pipelines

The partial decorator is a simple decorator. Instead of just running a function in the pipeline, wrapping the function with the partial decorator first calls the function as usual, then decides what to do based on the response. If your function returns a dictionary then its output dict is returned like any regular pipeline and the pipeline trundles on just like as if you hadn't bothered using the decorator at all, but if you function returned something other than a dict (say for e.g. a HttpResponse like a form to collect more user data) then the pipeline up to this point is stored in the session temporarily while the response is returned to the user. The kwargs etc are cleaned and returned as a dict, stored as values, including the pipeline index as 'next', and this is then stored in the session.

Remember in actions.py do complete I deferred discussion of the

logic. Now when partial_pipeline_data (implemented in utils.py) is called, we find we do have a partial_pipeline key set in the session, we call partial_from_session to get all the stuff we stored in the session for the pipeline that had ran up to this point, such as the user, social, kwargs etc, and importantly next which will be the same pipeline index as it was before (we didn't increment it). This means partial does exist and we run backend.continue_pipeline(*xargs, **xkwargs) , which (backends/base.py) just wraps around the backend authenticate method, which for google as we saw earlier is implemented also in backends/base.py and involves running the pipeline at the appropriate pipeline index.

Ultimately all this means is that we have a way to pause the pipeline whilst we do something like gather user data. The pipeline up to that point is kept in the session, and the current pipeline function runs as many times as necessary until the response returned is a dict (i.e. you can keep returning something like a form, or a form with errors as many times as you need, and on success you return a dict to make the pipeline continue where it left off).