Dustycloud.org

Researchcation, concluded

2014-06-03

I just got done with a three week thing I've dubbed "researchcation".
It's exactly what it sounds like, research + vacation.

It's hard for me to take time away from MediaGoblin right now and have
it still meet its goals as a project. On the other hand, there's a
lot that we have planned for the year ahead, but some of it I'm not
really prepared enough for to make optimal decisions on. In addition,
the last year and a half really have just not given me much of a break
at all, and running a crowdfunding campaign (not to mention two over
two years) is really exhausting. (Not that I'm complaining about
success!)

I was feeling pretty close to burnout, but given how much there is to
get done, I decided to take a compromise on this break... instead of
taking a full fledged vacation, I'd take a "researchcation": three
weeks to recharge my batteries and step away from the day to day of
the project. In the meanwhile of that, I'd work on some projects to
prepare me for the year ahead. A number of good things came out of
it, though not exactly the same things I expected coming in. But I
think it was worth the time invested.

My original plan going in was that I would work on two things:
something related to the Pump API and federation, and something
related to deployment. It turns out I didn't get around to the
deployment part, but working on the federation part was insightful,
though not in all the ways I anticipated. Though I've read the
Pump API document
and helped advise a bit on the design of
PyPump (not to take credit for
that, clearly credit belongs to Jessica Tallon generally, not me),
there's nothing really like having a solid project to toss you into
things, and I wanted to take a non-MediaGoblin-codebase approach to
playing around with the Pump API.

I started out by hacking on a small project called PumpBus, which was
going to be a daemon which wrapped pypump and exposed a d-bus API. I
figured this would make writing clients easier (and even make it
possible to write an emacs client... yeah I know). I got far enough
to where I was able to post a message from emacs lisp, then decided
that what I was working on just wasn't that interesting and wasn't
teaching me much more than I already knew.

Given that there was both the "research" but also the "-cation"
component to this, I figured the risks of failure were low, so I'd up
the challenge of what I was working on a bit. I instead started
working on something I've dubbed Pydraulics:
a python-powered implementation of the Pump API. Worst came to worst
I'd learn a few things.

I decided from the outset to keep a few assumptions related to
pydraulics:

The end goal would be to have something that provided interfaces for
object storage and retreival... not to wrap the database itself, but
hopefully there would not be too many views, and maybe this could
happen on a per-view basis. This way you could easily wrap
Pydraulics around whatever application and use the storage/database
it's already using. That's the end goal. (I didn't get there ;))

I'd keep things simple database-wise: assuming you're not providing
your own interface, the default interface provided is postgres +
sqlalchemy only. There's some
new JSON-related features in Postgresql
that are pretty exciting and would be appropriate here.

I'd use this as an oportunity to think about MediaGoblin's codebase.
I decided I'd see how easy or hard it would be to split out
components from MediaGoblin as I needed them into something I dubbed
"libgoblin". For now, I'd allow this to be messy, but hopefully
would give me a chance to think about what libgoblin should be.

I'd also use to think about where MediaGoblin fits as in terms of
recent developments in asynchronous Python coding.

So, what came out of it?

Turns out
SQLAlchemy does a nice job of making use of Postgres' built-in JSON support.
Early tests seemed to indicate that this choice would pay off well.
(Left me wondering: how hard would it be for someone to write a
python API-compatible implementation of pymongo or something?)

I ended up spending a lot more time on the libgoblin side of things
than I expected. I didn't realize that MediaGoblin had become such
a self-contained microframework until this point. I wanted to port
the MediaGoblin OAuth views over to Pydraulics to save time, but it
turned out this required porting over a significant amount of
MediaGoblin code over to libgoblin. I did get the oauth views
working though!

Asynchronous stuff turned out to be interesting to explore, and I'll
expand on what I've been thinking below.

I did end up getting a much, much stronger sense of the Pump API,
which of course was the main goal, though the implementation of that
is not yet complete.

Pondering asynchronous coding developments and
MediaGoblin/libgoblin/pydraulics turned out to be fruitful. Mostly
I have been looking at "what would it take for libgoblin to be
usefully integrated into asyncio?"

This turns out to be a bit more challenging than it appears at the
outset for one reason: mg_globals. mg_globals is a pretty sad
design in MediaGoblin that I'd like to get rid of; basically it
makes it easy to write functions that don't have to have the
database session and friends, template environment and etc passed
into them, because those are set on a global variable level. That
works (but is nasty) as long as you're not in a multithreaded
environment, but breaks as soon as you are. I recently
created a ticket reflecting such,
suggesting switching over to
werkzeug context locals
(Flask makes heavy use of this). Werkzeug's hack is clever, using
thread locals so that even in a multi-threaded environment, the
objects you're accessing are still globals, but they're the right
globals.

But Werkzeug's solution is not good enough for integration with
asyncio, where you might be doing "asynchronous" behavior in the
same thread, suspending and resuming coroutines or coming back to
tasks or etc. As such, it's almost guaranteed in this system that
you'll be clobbering the variables another task needs.

What to do? I did research to see if anyone had ideas. It looks
like you could do such a thing with Task.current_task() in asyncio, and that would be
fairly equivalent. I think you'd need careful implementation
though... if you're not paying close attention you might not attach
the right things to the right subtask, and that whole thing just
seems... fragile. But it still is a neat idea to play around with.

But here's some ideas that I think are neat all combined, related to
this problem:

The idea that the request or asyncio task is the main object that
you attach useful variables to, and you just pass that thing
around as a "universal context" like crazy. (The downside: what
happens when you aren't using asyncio, or don't need an http
request, like in a migration script?)

I like the idea of the application being multi-instance'able, and
then having requests and a local context as a layer on top of
that. So you've got the "instantiated application" layer, and on
top of that the "current request" layer, and at both levels you
can have variables attached (like the database engine on the
application level, but the database session on the request level).
That's an awesome distinction.

But you don't really know whether or not some bit of code is using an
asyncio task, a web request, or whatever to pass around. Here's the
thing though: it doesn't really matter most of the time. With rare
exceptions, you're just looking for $OBJECT.db or $OBJECT.templates or
something. You just need some kind of object you can tack attributes
on to.

So that's my idea in libgoblin/pydraulics: you have an application and
you want to do something with it (handle a request, execute a task,
etc), you can tack stuff onto that object. So either create a fresh
context object to tack stuff onto or just start tacking things onto an
object you have!

Currently, this looks like:

Anyway, simple enough. Then you have request.db made, or if you've
just got a command line script and you need the equivalent of that and
you already have your instantiated application, just run
application.gen_context(). Thus, for utilities that are working with
this application and need a variety of instantiated things (the
database and the template engine and so on and so on) it's easy enough
to just accept "context" as the first argument of the function, then
use context.db and etc. (I've considered using just "c" or "ctx"
instead of "context" as the variable name since it seems so common and
since it conflicts a bit with template context and friends, though
that's not very explicit.) So, this seems good.

At one point I got frustrated with the massive amount of porting to
libgoblin I was having to do and thought "I really should probably
just use Django or Flask itself." However, I found that neither
framework really addresses the asyncio stuff I was dealing with above,
and once I got enough ported of libgoblin over, libgoblin-based
development is very fast and comfortable.

That said, it took up enough time working through those things where I
didn't complete the Pump implementation. That's okay, I've got
enough to do what's required on my end from MediaGoblin (and we've got
good direction and help on the federation end this upcoming year where
the most important thing is that I have a good understanding). I
still think pydraulics is a pretty neat idea, and I may finish it,
though it'll be back-burner'ed for now.

However, libgoblin is something I'm likely to extract. I'm convinced
that MediaGoblin is at a point where it's stable enough to know what
works and doesn't about the technical design, so that gives me a good
basis to know what to build from here. There are other applications
I'd like to build which should mesh nicely with MediaGoblin but which
really don't belong as part of MediaGoblin itself, and would be kind
of hacky add-ons. Clearly this is not the most important development,
but towards the end of the summer as we hopefully get the Python 3
branch merged, I will be looking towards this.

Aside from this, on the "-cation" end of things, I took some time to
relax and also reapproach my health. I may have a separate post on
that soon.

So, that's that. Overall it was productive, but again, not quite in
the ways I was expecting. I feel okay about that though... I wanted
to do some hacking and not feel deeply pressured or stressed about
it... if that wasn't true, I think the "-cation" part wouldn't have
held up. So I feel okay that I wandered a bit, and the other things I
worked on / found I think are important anyhow, and have me much
better prepared for the year ahead. Not to mention the most important
part: I feel pretty refreshed and capable of taking it on!

What's next for the coming week? Well now that this
is all over, I'm organizing plans so we can get rewards out the door
and do project planning for the year ahead. We've got a lot of
promises to fulfill. Better get to it!