2016-12-05



In this blog post, I’ll discuss my top five go-to tips for MongoDB troubleshooting.

Every DBA has a war chest of their go-to solutions for any support issues they run into for a specific technology. MongoDB is no different. Even if you have picked it because it’s a good fit and it runs well for you, things will change. When things change – sometimes there is a new version of your application, or a new version of the database itself – you need to have a solid starting place.

To help new DBA’s, I like to point out my top five things that cover the bulk of requests a DBA might need to work on.

Table of Contents

Common greps to use

Did any elections happen? Why did they happen?

Is replication lagged, do I have enough oplog?

Taming the profiler

CurrentOp and killOp Explained

Common greps to use

This issue is all about what are some ways to pair down the error log and make it a bit more manageable. The error log is a slew of information and sometimes, without grep, it’s challenging to correlate some events.

Is an index being built?

As a DBA you will often get a call saying the database has “stopped.” The developer might say, “I didn’t change anything.” Looking at the error log is a great first port of call. With this particular grep, you just want to see if all index builds were done, if a new index was built and is still building, or an index was removed. This will help catch all of the cases in question.

What’s happening right now?

Like with the above index example, this helps you remove many of the messages you might not care about, or you want to block off. MongoDB does have some useful sub-component tags in the logs, such as “ReplicationExecutor” and “connXXX” that can be helpful, but I find it helpful to remove the noisy lines as opposed to the log facility types. In this example, I opted to also not have “| grep -v connection” – typically I will look at the log with connections first to see if they are acting funny, and filter those out to see the core data of what is happening. If you only want to see the long queries and command, replace “ms” with “connection” to make them easier to find.

Did any elections happen? Why did they happen?

While this isn’t the most common command to run, it is very helpful if you aren’t using Percona Monitoring and Management (PMM) to track the historical frequency of elections. In this example, we want up to 20 lines before and after the word “SECONDARY”, which typically guards when a step-down or election takes place. Then you can see around that time if a command was issued, did a network error occur, was there a heartbeat failure or other such scenario.

Is replication lagged, do I have enough oplog?

Always write a single test document just to ensure replication has a recent write:

Checking lag information:

Oplog Size and Range:

Taming the profiler

MongoDB is filled with tons of data in the profiler. I have highlighted some key points to know:

Metric

Description

Filter

Formulated query that was run. Right above it you can find the parsed query. These should be the same. It’s useful to know what the engine was sent in the end.

nReturned

Number of documents to return via the cursor to the client running the query/command.

executionTimeMillis

This used just to be called “ms”, but it means how long did this operation take. Typically you would measure this like a slow query in any database.

total(Keys|Docs)Examined

Unlike returned, this is what might be considered since not all indexes have perfect coverage, and sometimes you scan many documents to find no results.

stage

While poorly named, this will tell you if a collection scan (table scan) or index is used to answer a given operation. In the case of an index, it will say the name.

CurrentOp and killOp explained

When using

to see what is running, I frequently include

so that I can see everything and not just limited items. This makes the

function look and act much more like

from

in MySQL. One significant difference that commonly catches a new DBA off guard is the killing of operations between MySQL and MongoDB. While Mongo does have a handy

function, it is important to know that unlike MySQL – which immediately kills the thread running the process – MongoDB is a bit different. When you run

, MongoDB appends “killed: true” into the document structure. When the next yield occurs (if it occurs), it will tell the operation to quit. This is also how a shutdown works: if it seems like it’s not shutting down, it might be waiting for an operation to yield and notice the shutdown request.

I’m not arguing that this is bad or good, just different from MySQL and something of which you should be aware. One thing to note, however, is that MongoDB has great built in HA. Sometimes it is better to cause an election and let the drivers gracefully handle things, rather than running the

command (unless it’s a write, then you should always try and use

).

Conclusion

I hope you have found some of this insightful. Look for future posts from the MongoDB team around other MongoDB areas we like to look at (or in different parts of the system) to help ourselves and clients get to the root of an issue.

Show more