We’ve all been reading a lot of articles that suggest adding a caching layer in front of a database is more efficient than hitting the database directly. Among different caching options there is Do-It-Yourself coding, using open source tools like Memcached, or using Velocity (part of the Microsoft AppFabric) or using Ehcache (for Java developers). Alternatively, you can now find automated caching solutions like SafePeak’s dynamic caching for SQL Server.
Why use a caching solution?
Let’s quickly recap the many reasons to use data caching with SQL Server:
To improve data access read/write performance
To reduce the number queries hitting the of database
To reduce database load (CPU, IO, Memory)
To improve application page load response time (the user’s actual experience)
To distribute repetitive query workloads to separate cache servers
To share the benefit of cached data on one server across all other servers.
To increase operational efficiency, by scaling more data and more users on smaller and fewer SQL Servers machines
Another benefit of caching is its impact on database scalability. SQL Server is expensive to scale with bigger hardware, which also can trigger increased license fees. Fast Flash storage systems, which can be helpful, are also very expensive. Whereas adding a new web/caching server can be much cheaper (and potentially more effective).
So, now that we know why we want to use caching with SQL SERVER, what’s the best way to go about it? Well, you’ll need to decide what’s best for you. Let’s look at three main SQL Server caching options:
SQL Server Memory and IO caching
Application data caching, a Do It Yourself approach using Memcached/similar APIs
SafePeak automated SQL caching
By the way, I’d also like to learn from your experiences, so I look forward to your comments and thoughts on what I discuss here.
SQL Server: Doesn’t it already have memory cache?
To start with, we need to remember that SQL Server has its own memory cache for objects in the database. When data is retrieved, SQL Server maintains its cache and will (if necessary) pull the row from its memory and not hit the disk.
So if SQL Server has its own cache, what’s the benefit of data caching layers such as Memcached, SafePeak or similar?
The answer is this: SQL Server only caches:
Query plans
Pages from the database files.
SQL Server does NOT cache results from a query.
This is the important distinction.
For example, imagine if you have a complex query which uses some aggregation on a lot of data (e.g.: how many different countries we have in our customer database: SELECT DISTINCT Country from Customers GROUP BY country). SQL Server will scan the WHOLE customer table, but your result-set will only be a few entries long. When you reissue your query, SQL Server will reuse the query plan and will rescan the customer table, (and if you are lucky the pages are still in memory)
When you use application cache, you store your result-sets in Memcached RAM. Then reuse them over and over again without connecting to the database server, thus offloading workloads from your database server.
Do-It-Yourself application data caching using Memcached / similar APIs
Application data caching is quite easy to start. However, changing (or building) code for large applications with effective data caching is challenging for a few reasons:
Correctly scoping required code changes;
Delivering and testing actual code changes required to handle caching of all (or most) SQL queries;
Preserving data integrity: how to invalidate all relevant cache items and to handle all kinds of data change events. And in real-life both READs / WRITEs can be complex involving stored-procedures, triggers, views, table joins etc.; and
Maintaining the additional code layer.
One thing to keep in mind is that because DIY and Memcached approaches need to touch app code, they can’t be used to accelerate any off-the-shelf software that relies on SQL Server.
So, while it can be complicated, application performance benefits are always good, so let’s dive into what you would do using Memcached.
About Memcached data caching
Memcached is a big hash table: a key/value store that lives entirely in RAM on multiple servers. It allows you to use RAM from multiple servers as single memory caching space.
Basic data caching using Memcached
Below is a basic tutorial showing (via pseudocode) how you can get started with integrating Memcached into your application. If you’re an application developer, it isn’t something you just “turn on” and then your site goes faster. You have to pay attention. For every query you want to be cached, you have to put (set()) it into cache, get it from cache (get()) and (probably most complicated) make sure you keep the data in the cache correct by removing from cache (delete()) when the data is updated.
The following pseudocode of Memcached example is written in #Perl but can be done in same way in most other languages, including .net, java and php – etc (memcached clients)
Initializing a Memcached Client with a list of your pre-configured Memcached servers:
Wrapping an SQL Query
Memcached is famous for reducing load on SQL databases. Here is some pseudocode showing how to wrap a database query access with a memcached caching layer, by implementing check-in-cache (function get()) and place-into-cache (function set()):
Notice that the SQL query result-set was entered to cache with five-minute expiration time, as an example.
Stale data in cache and invalidation of cached items
Unless you actively invalidate the cached item, when a user (or your database) makes a change, it can take up to five minutes (or more, depending on cache item expiration time) for users to see the correct new data. This is a key issue when implementing a Do-It-Yourself caching strategy.
When a user comes along and edits data, you can keep the cache in sync in two main ways:
You can update the item in cache; or
Delete the old item.
Expiration
The most basic invalidation happens by defining the Expiration period. Even if you‘re actively deleting or overwriting cached data, you’ll still want to have the cache expire occasionally.
Expiration times can be set from 0, meaning “never expire”, to 30 days. Any time higher than 30 days is interpreted as a timestamp date.
delete()
The simplest method of invalidation is to simply delete an entry and have your website re-cache the data the next time it’s fetched. For example, when a user updates her bio, you want her to see her latest info when she reloads the page:
sql = "SELECT * FROM user WHERE user_id = ?"
key = 'SQL:' . user_id<span class="GINGER_SOFTWARE_mark" id=