Harbar.net

The Playbook Imperative and Changing the Distributed Cache Service Identity

2016-03-21

Introduction

One of the most common challenges facing those operating production SharePoint environments is the “missing playbook”. Even for deployments where operational service management (OSM) skills are strong it is impossible to deliver quality operational service without the playbook. It’s generally pretty uncommon for practitioners to factor OSM considerations into the design, or at least to do it well. Indeed, in many cases it is also impossible to do so completely as so much about the environment will not be known or understood prior to broad platform adoption.

Whilst the playbook is imperative for any system, there is no better product than SharePoint to highlight this factor. The issue is that the vast majority of SharePoint deployments don’t have one. And that’s a shocking if not unsurprising state of affairs.

This article uses a worked example to demonstrate the importance of the playbook, how to define it and then how to create the run book for operations staff. To demonstrate I will focus on everyone’s second favourite SharePoint Service Instance, the delightful and simple to manage Distributed Cache! :) I’ve picked this one as it’s relatively simple but with a catalogue of gotchas, and it also now thanks to SharePoint 2016 has some version variance.

The Playbook

I work a lot with customers in troubleshooting situations or otherwise in a reactive nature due to faults or issues cropping up in the environment. In virtually all cases they could be avoided by better OSM. The vast majority of deployments simply don’t have any. That’s a tough situation to resolve because it will involve a lot of time and energy and probably doesn’t fit into the structure defined by outsourcing agreements, company procedures and so on. What is always virtually non-existent is the playbook.

But what is the playbook? Simply put it’s the definition of what must be done in a given situation. A classic example of not following the playbook with SharePoint is deleting a Site Subscription before deleting the Site Subscription Profile Configuration. In this case you end up with orphaned data inside the UPA which cannot be removed in a supported fashion. Sounds simple, but the problem is the playbook isn’t documented so hundreds of people have made this mistake. In this example the playbook would include details of the order in which things should be deleted when you want to delete a Site Subscription. Note that the playbook doesn’t define how to perform the tasks. That’s a run book. More on run books later.

SharePoint is a complex product. When you add to that fact poor product implementation in some areas, missing and woeful documentation, poor management APIs, poor Windows PowerShell cmdlet implementation and a staggering amount of misinformation and worst practices out there across the interwebz, the importance of the playbook should be clear. All products have strange behaviours, things to be aware of and accommodate etc. In addition, the sheer variance in the size and topology of customer farms means the playbook remains a significant challenge. It’s okay if we need to do some slightly whacky things to manage an environment. However, if we are not provided the information on the what, we can’t do.

It’s somewhat interesting how often Microsoft’s marketing folks talk about how their experience running SharePoint Online can be baked into the on premises product for everyone else to benefit from. That is a good thing no doubt about it. However, what would actually be of vastly more benefit to on premises customers, would be to document some of their operational service management lessons and some of their playbook.

Now we are comfortable with the idea of the playbook, let’s apply it to a worked example.

Changing the Distributed Cache Service Account

This is a perfect example of how not to do things! We have a fairly straightforward requirement. To change the service account used by the Distributed Cache Service. We may want to do this because we made a mistake when deploying the farm originally, or perhaps we are rationalising service accounts. It’s a fundamental operational service procedure, the problem is it’s not actually as straightforward as it should be.

I’m also using this example as it’s something that none of the common scripting solutions out there accommodate, and even the superb xSharePoint doesn’t yet provide. It’s also something that I am forever being asked about. So it’s an excuse to get the scripts out there.

In SharePoint 2013 when we either create the farm, or join a server to a farm we can use the
-SkipRegisterAsDistributedCacheHost switch. We should do this. If we don’t every server in the farm will run Distributed Cache and we’ll need to fix that up. So using the switch we have zero servers in the farm running Distributed Cache.

However due to the “special” nature of this service instance, we cannot change the service account before one machine is running it. Thus we have to start it on a server. We can’t do that from Services on Server or with Start-SPServiceInstance. The reason for all these three things is that the service instance does not yet exist in the farm. We must run Add-SPDistributedCacheServiceInstance (on a server we wish to run Distributed Cache). This will take around two minutes to do its thing.

However, it will always run as the Farm Account by default. It’s effectively baked into the Provision() method of the service instance and there is no possibility to change it prior to starting the service instance on one machine. This is frankly, crap. Especially so given that SharePoint’s built in health analyser will warn us about running services as this account. The product’s default behaviour is to configure defaults at odds with its own deployment best practice. Note this crappy behaviour is so crappy that Microsoft has addressed it in SharePoint 2016. More on that later.

Great, so now we have Distributed Cache up and running on a server. Let’s go ahead and change its account. As pretty much everyone knows we can’t do this via Configure Service Accounts. Even though it will allow us to select the service and then a Managed Account, as soon as we hit OK we’ll see this delightful screen.

This folks, is just lame. We all understand how easy it would be to filter the service on the previous screen. And what’s up with “Sharepoint Powershell commandlets”? Yes, something went wrong indeed. So wrong that my Live Writer spell checker is all red and squiggly right now. The three minutes it would take to change this resource string would be well spent.

But no matter, we can easily find the details on how to do this via Windows PowerShell, it’s even documented by Microsoft over at Manage the Distributed Cache Service in SharePoint 2013. Sure the all on one-line thing is a bit silly, but you’ll find this snippet all over the web in various forms. Thus the first simple piece of the playbook. The service account change must be performed using Windows PowerShell.

However, this script will not work unless you run it on a machine hosting and running the Distributed Cache service instance, a pretty important detail that is not mentioned. Actually it will complete just fine if the machine isn’t hosting Distributed Cache or indeed if it is but the service isn’t started. But it will leave your Cache Cluster broken. There are also a number of other considerations about using this ‘pattern’ to change the service account which we’ll cover later. For now, the second and third parts of the playbook. The service account change must be performed on a server which runs Distributed Cache. And the service account change must take place on a server with the Distributed Cache service running.

Let’s assume we are building a fresh farm and we have joined all our servers to the farm using

-SkipRegisterAsDistributedCacheHost, and then added one Distributed Cache server using Add-SPDistributedCacheServiceInstance. At this point we should change the service account using the above script. Then, when we add the second and third and so on Distributed Cache servers they will pick up the correct identity. This is an important factor in the playbook of topology deployment. Let’s say we have an array of servers and we loop through that to provision the correct service instances. We need to build into that the service account change. Or be willing to change it later across all servers if we want to maintain the beauty of our scripts! Thus the fourth part of the playbook. For new SharePoint 2013 farms, change the service account after deploying the first Distributed Cache server, and before deploying the additional Distributed Cache servers.

Of course if we care about availability (note I said availability, not high availability) we need more than one Distributed Cache server. Actually there really isn’t much point to a “distributed” cache if it’s only on one server. But we really should have three. Yes. Three. Not two. Three. AppFabric, the real software that Distributed Cache simply provides a wrapper for uses a cluster quorum model. This means that three hosts are the minimum optimal configuration. You will get the best performance and reliability from three or more servers. And further if you only have two you will hit issues when attempting to gracefully shut down any single server. None of that is important to the playbook for changing the service account, but it is extremely important more generally.

Ok, so we’ll have three servers in our farm running Distributed Cache. When we deployed our farm we did as above and changed the service account after adding the first. Thus all three machines have the correct identity and life is good. However, in the future we may need to change the service account again. Now what?

We could go ahead and run that script again right? Wrong. It will never work. If you try and run that script on a server in a multiple server Distributed Cache cluster, it will blow up in your face with a TCP port 22234 is already in use error:

What will also have happened is that the server you run the script on will no longer be part of the Cache Cluster and you will only now have two servers in the cluster. The other two servers by the way are still using the old service account. You have just broken your Distributed Cache by running the script provided in the official product documentation. Nice.

What makes this worse, is that many experts will tell you when you see this to run another script provided to “repair” a broken Cache Host, which actually means deleting it. And that’s just bad advice, period. The reality is you should have never tried to change the account like this in a multi server Distributed Cache setup.

Now out there on the interwebz there is a virtual ton of oxen pulling misinformation about the little account change script. Much of it details that the last line which calls Deploy() on ProcessIdentity is not required. It absolutely is required. If you don’t call Deploy() things will not work. Just because some Windows PowerShell completes without error does not mean it has worked! This is especially true with SharePoint PowerShell!

We can run the script without calling Deploy() but it won’t have any effect and will leave our Cache Cluster broken. A broken Cache Cluster is not good. It’s broken! :) What that means is items won’t be put in it. Bad!

There are many things we could try, such as stopping the service instances, running the service account script, then starting the service instances. That will appear to work, and the AppFabric service will be running as the correct account. However, the Service Status of each Cache Host will be UNKOWN. And that means that the Distributed Cache will crash within five minutes and it won’t be working.

Similarly, we could unprovision the service instance (yes that’s different than merely stopping it), run the service account script, then provision the service instances. The exact same result. The AppFabric service will be running as the new account, but the service status of each Cache Host will be UNKOWN. Your cluster is broken.

The bottom line is that any such attempts will fail, even thou on first glance they seem to have worked. If the service status of your cache cluster hosts report UNKOWN, your Distributed Cache is broken and items will not be being cached.

Obviously, this is no good. It’s not part of the playbook to have changed the account, but be left with a broken Cache Cluster!

What we need to do is actually pretty simple. Yup, that’s right. We must remove the second and third servers, so that we only have one. Then run the account change script. Then we can add back the other two servers. This is the way to change the service account for Distributed Cache when you have more than one server in the Cache Cluster. Not documented anywhere else until now. Makes you wonder how many broken Cache Clusters are out there, huh! Or rather just how few people are bothered by the wrong account being used.

The fifth part of the playbook. The service account change must take place when only a single server is in the Cache Cluster.

But wait there is more. If you run the service account change script and it takes less than about eight minutes, something is wrong. Because what Deploy() is doing (and why it’s mandatory) is effectively exporting the configuration, removing the server as a Distributed Cache host, applying the account change, and then re-adding it with the same configuration. Whilst the PowerShell looks like the same “pattern” as you would use to change any other service account in SharePoint, Deploy() is overridden inside the Distributed Cache service instance with its own special sauce. If it doesn’t take around eight minutes, something is wrong. One of the things this also does is to stop the AppFabric Cache Host, but it can often fail at that point and yet it falls through causing an error. This is another cause of the TCP Port 22234 is already in use error. Bad code. Not tested thoroughly enough.

There is no other pattern we can use. We can’t stop or unprovision the service instance as that will leave our Cache Cluster borked. We could choose to make use of the AppFabric cmdlets, but alas doing so is unsupported, that’s not a road we can travel down. We shouldn’t be “hacking” it, we have to tell both SharePoint and AppFabric about the change and to do that, we should use the mechanisms provided by the product, whilst accepting and working around their liabilities. We only have one choice to change the identity. Now this has some pretty serious implications. We are effectively removing all cache hosts. What that means is the sixth part of the playbook. The service account change will result in data loss as all cache hosts must be removed.

This is a significant planning consideration – i.e. when will this change take place, and plan on it not needing to be done often! It also means there is absolutely no point whatsoever in doing a “graceful” shutdown of any server, because we need to shutdown them all.

There’s also one more very important thing to note here. We have no choice but to run the account change script when a single server is running Distributed Cache. Because of this any previous change we have made to the Cache Size will be thrown away on the machines we remove before running the script. Let’s say for example we previously changed our Cache Size to 1200MB. Once we have changed the account and re-added the removed servers our Cache Host configuration will look like this:

As you can see only one server has our desired Cache Size. The other two servers have the default size (which in this example is 410Mb on servers with 8Gb RAM). Microsoft strongly recommend that the Cache Size be the same for all servers in the Cache Cluster and that is why Update-SPDistributedCacheSize sets it that way. I on the other hand state it as a requirement. I’ve seen way too many messed up Distributed Caches to mince my words on this particular topic. Therefore, we need to fix up the Cache Size once the account change has been made, the seventh part of the playbook. The service account change requires that any cache size configuration be re-applied.

Doing this is actually very straightforward (in comparison to the service identity at least). I need to stop all the Distributed Cache service instances, change the Cache Size, and finally restart all the Distributed Cache hosts. This time stopping and starting is entirely acceptable, we do NOT need to unprovision and provision, nor do we need to remove and add any hosts.

All good. Well, not all good. But it’s what it is. What about the recently released SharePoint 2016?

Well the good news is that they have refined the service instance plumbing so that the Distributed Cache service instance exists prior to a server being added as a Distributed Cache host. This means we can set the service identity before we ever add a Distributed Cache host. This works both when using MinRole (by specifying a -ServerRole of DistributedCache) or when using

-ServerRoleOptional. Therefore, we can change up the order of our farm provisioning solution to configure the service account before we join any additional machines to the farm. If we are using MinRole, the first server in the farm should be of role Web or Application, which also means

-SkipRegisterAsDistributedCache is no longer necessary. If using -ServerRoleOptional then we can continue to use -SkipRegisterAsDistributedCache.

However, there is an important detail. In this scenario, we do not call Deploy() on the ProcessIdentity! If we do it will fail with a CacheHostInfo is null error, which of course is expected at this point. This is the one time that not calling Deploy() is OK. Our service identity is updated, and when we later add a Distributed Cache host that identity will be picked up. This is actually one of the best side benefits to MinRole. Another part of our playbook. For new SharePoint 2016 farms, the service account should be changed after farm creation, and before deploying any Distributed Cache servers.

The not so good news, is that changing the service account at a later date, once the Distributed Cache service is provisioned remains exactly the same as with SharePoint 2013. This makes total sense.

Phew. Hopefully now the importance of the playbook is obvious. Even though what we are doing is relatively simple, the devil is in the detail. Our playbook for this operational aspect is as follows.

The Distributed Cache service account change…

must be performed using Windows PowerShell

must be performed on a server which runs Distributed Cache

must take place on a server with the Distributed Cache service running

For new SharePoint 2013 farms, the service account should be changed after deploying the first Distributed Cache server, and before deploying the additional Distributed Cache servers

must take place when only a single server is in the Cache Cluster

will result in data loss as all cache hosts must be removed

requires that any cache size configuration be re-applied

For new SharePoint 2016 farms, the service account should be changed after farm creation, and before deploying any Distributed Cache servers

If we didn’t have this playbook, we’d have no good chance of creating the run book, or the scripts to implement the run book. Because we have it we can produce a run book and scripts much more easily as we have our essential details and we won’t waste time thrashing out hacks that semi work or have environment specifics hard wired into them.

The Run book

Given our playbook above we can easily produce a run book and its implementation. We know it needs to be Windows PowerShell and we also know that we will either be required to run everything on a server running Distributed Cache or via Windows PowerShell Remoting. From an OSM perspective we really should have adopted Remoting as it’s such a common requirement within SharePoint to be able to run commands on a specific server. In addition, many organisations have a machine in the farm for Shell Access. Now the implementation choice is a lifestyle preference. We could use an uber-script, a set of advanced functions, or Desired State Configuration. It’s not the objective of this article to get into that discussion, and there is no single right answer for every situation. I normally try to avoid plumbing in my PowerShell examples as often they can distract from the key concepts being described. In this case though, we really should be using Remoting. This is after all an article with OSM as its focus. I assume you are comfortable with the setup and configuration required for Remoting.

My manager tells me this stuff needs to be run by any monkey they hire in ops! Just kidding, I work for myself :) But I’m inventing a couple of sensible requirements here. I want a modular solution which I can run from a “controller” script. I also want my solution to work no matter how many Distributed Cache servers I have in the Farm. To do this, I’m going to break out the key functionality I need into three scripts:

Set-DistributedCacheIdentity.ps1 – runs on a Distributed Cache host and configures the service account

Remove-DistributedCache.ps1 – removes a server as a Distributed Cache host

Add-DistributedCache.ps1 – adds a server as a Distributed Cache host

I also need another three scripts to deal with the Cache Size:

Update-DistributedCacheSize.ps1 – runs on a Distributed Cache host and sets the cache size

Stop-DistributedCacheCluster.ps1 – stops all Distributed Cache service instances in the farm

Start-DistributedCacheCluster.ps1 – starts all Distributed Cache service instances in the farm

Obviously this could be done in “less” but by building tools rather than scripts we have a far more robust and reusable (superior) end result. There’s a bunch of additional things to build into production scripts such as exception handling, but in order to keep this focused on the task at hand, I’ve neglected that stuff (well that’s my excuse and I’m sticking to it!). Let’s take a look at the “code”:

Set-DistributedCacheIdentity.ps1

All I do here is a little bit of semi-defensive “programming”, if the machine we are running on has a bundle of open connections on the cluster port then when we call Deploy() it will blow up. I avoid that by exiting. If something fails during the identity change we will add the removed hosts back into the cluster. It doesn’t cover all bases, but it’s better than nothing in event of failure here. This leaves the configuration as it was (except for the cache size, which will need to be re-set.)

Remove-DistributedCache.ps1

Dead simple this one, it simply removes the host from the Cache Cluster, we then wait around two minutes for the port to free up. This is something we should do to ensure later actions play nice.

Add-DistributedCache.ps1

As simple as it gets, we just add the server as a Distributed Cache server. Just like when we first created the Farm topology.

Update-DistributedCacheSize.ps1

This one basically stops all Distributed Cache service instances in the farm, sets the Cache Size and then restarts the service instances, we call the following two scripts which may be useful in other situations as well.

Stop-DistributedCacheCluster.ps1

Start-DistributedCacheCluster.ps1

With these six scripts I can use a “controller” script Update-DistributedCacheServiceIdentity.ps1, to wrap everything together. I also have another script that reports on the status of the AppFabric cluster.

This basically checks if I am on a Farm with a single Distributed Cache server, in which case I simply update the account. Otherwise I implement the required model – removing all but one Distributed Cache server, update the account, and add back the removed servers.

Now I can also add another layer to integrate with the Cache Size script and the AppFabric status script and have some fancy admin inputs. For example:

Here’s a sample output from running the above to change the service account:

Note the all important status of Up for each Cache Host. Also note how long it takes to do everything in a Farm with three Distributed Cache servers – around 16 minutes.

Note: I have a DSC version of this tooling and it took less than an hour to create from the above. Whilst DSC is certainly *the* way to go, it is not yet ready for prime time with SharePoint and of course has considerable platform dependencies. The point here is if you make modular tools, it’s a snap to move them over to DSC resources.

Now, with these assets my run book can be:

Log on to a SharePoint Server with a Shell Admin account

Run Update-DistributedCacheServiceIdentity.ps1, providing the name of the Managed Account and Credentials

Run Update-DistributedCacheSize.ps1, providing the Cache Size and Credentials

Pretty simple. Now in the real world that run book would have a bunch of other things such as “don’t be running this whilst Windows Server Automatic Maintenance (TIWorker.exe) is running” and so on, but they are entirely superfluous to the concepts being discussed here.

Conclusion

As is often the case with a complex product like SharePoint and its myriad of foundational technologies, even simple things like changing a service account can have distinct challenges and “devil in the detail”. Because of this, understanding the playbook is paramount to running successful operational service management for SharePoint farms. Unfortunately, this area of guidance is sorely lacking, and indeed the practice discipline itself is very immature across the industry. I deliberately picked a very simple (in the scheme of things) example to demonstrate the “playbook imperative”, there are a multitude of much more complex scenarios which SharePoint administrators face frequently. Sadly, most “experts” bail after a project is complete and never truly get to understand the real world of OSM. Also far too frequently customers fail to invest in this area adequately.

As more and more of our administration models move to Windows PowerShell, perhaps one of the best things out of Redmond in the last ten years, it is long since passed time that SharePoint the product, and SharePoint practitioners in general embraced the idea of tools, not scripts. This after all is a driving force of the PowerShell Manifesto and a primary benefit of PowerShell. Not the ability to run a cmdlet, but to combine them together in interesting ways. Just like a band sounds pretty terrible when just one guy is beating the drums, but when the rest of the rhythm section joins in, beautiful things happen. Building tools is all about understanding the playbook. In some cases, you simply must have one. In others it will merely save you days of effort when building the tools.

I implore Microsoft to invest in providing better documentation and guidance, and pretty please share at least some of their operational service management experiences from operating the planet’s biggest and baddest SharePoint.

Oh, and I’ve shown you how you can change the service account for Distributed Cache, so that it works. In a real server farm instead of a single box.

Get your tools on.

s.