Years ago I pursued my interest in InnoDB’s architecture and design, and became impressed with its sophistication. Another way to say it is that InnoDB is complicated, as are all MVCC databases. However, InnoDB manages to hide the bulk of its complexity entirely from most users.
I decided to at least outline a book on InnoDB. After researching it for a while, it became clear that it would need to be a series of books in multiple volumes, with somewhere between 1000 and 2000 pages total.
At one time I actually understood a lot of this material, but I have forgotten most of it now.
I did not begin writing. Although it is incomplete, outdated, and in some cases wrong, I share the outline here in case anyone is interested. It might be of particular interest to someone who thinks it’s an easy task to write a new database.
High-Level Outline:
Introduction
Intro to Major Features
The InnoDB Architecture
Indexes
Transactions in InnoDB
Locking in InnoDB
Deadlocks
Multi-Version Concurrency Control (MVCC)
Old Row Versions and the Undo Space
Data Storage and Layout
Data Types
Large Value Storage and Compression
The Transaction Logs
Ensuring Data Integrity
The Insert Buffer (Change Buffer)
The Adaptive Hash Index
Buffer Pool Management
Memory Management
Checkpoints and Flushing
Startup, Crash Recovery, and Shutdown
InnoDB’s I/O Behavior and File Management
Data Manipulation (DML) Operations
The System Tables
Data Definition (DDL) Operations
Foreign Keys
InnoDB’s Interface to MySQL
Index Implementation
Data Distribution Statistics
How MySQL executes queries with InnoDB
Internal Maintenance Tasks
Tuning InnoDB
Mutexes and Latches
InnoDB Threads
Internal Structures
XtraBackup
InnoDB Recovery Tools
Inspecting Status
Section-By-Section Detailed Outline:
Introduction
History of InnoDB, what its roots are, context of the integration into
MySQL and the Oracle purchase, etc
Based on Gray & Reuters book
high-level organization: USER visible things first, after enough
high-level overview to understand the big features and moving parts; then
INTERNALS afterwards.
Intro to Major Features
Transactions
ACID
MVCC
multi-version read consistency
row-level locking
standard isolation levels
automatic deadlock detection
Foreign Keys
Clustered Indexes
Page Compression
Crash Recovery
Exercises
The InnoDB Architecture
This will be a high-level introduction, just enough to understand the
following chapters
Storage on disk
Pages, and page sizes
Extents
Segments
Tablespaces
The main tablespace and its major components
Data pages
data dictionary
insert buffer
undo log area
doublewrite buffer
reserved spaces for hardcoded offsets
see Vadim’s diagram on HPM blog
individual tablespaces for file-per-table
redo log files
Storage in memory
the buffer pool and its major components
Data pages
other memory usage
adaptive hash index
Major data structures
LRU list
flush list
free list
Exercises
Indexes
clustered
logically nearby pages can be physically distant, so sequential scan isn’t
guaranteed to be sequential
no need to update when rows are moved to different pages
changing PK value is expensive
insert order
random inserts are expensive: splits, fragmentation, bad space utilization/fill factor
sequential inserts build least fragmented table
consider auto_increment
optimize table rebuilds, will compact PK but not secondary
they are built by insertion, not by sort, except for plugin fast creation
secondaries can be dropped and recreated in fast mode in the plugin to defrag
but this won’t shrink the tablespace
primary vs secondary
secondary causes 2 lookups, also might cause lookup to PK to check row
visibility because secondary has version per-page, not per-row
PK ranges are very efficient
PK updates are in-place, secondary are delete+insert
automatic promotion of unique secondary
auto-creation of a primary key
primary columns are stored in secondary indexes
long primary keys expensive
uniqueness; is checking deferred, or immediate?
there is no prefix compression as there is in myisam, so indexes can be larger
rows contain trxn info, but secondary indexes do also at the page level
(enables index-covered queries; more on this later)
Exercises
Transactions in InnoDB
A transaction is a sequence of actions that starts in one legal state, and
ends in another legal state (need a good definition)
from Heikki’s slides: atomic (all committed or rolled back at once),
consistent (operate on a consistent ivew of the data, and leave the data
in a consistent state at the end); isolated (don’t see effects of other txns
on the system until after commit); durable (all changes persist, even after
a failure)
consistency means that any data I see is consistent with all other data I see
at a single point in time (there are exceptions to this)
How are they started, and in what conditions?
what is the transaction ID and its relationship to the LSN?
what is the system version number (LSN)?
what is a minitransaction/mtr?
when are they committed?
how are savepoints implemented?
what happens on commit?
XA and interaction with the binary logs
fsync()s
group commit
what happens on rollback?
they are meant to be short-lived and commit; what if they stay open a long
time or roll back?
Locking in InnoDB
locking is needed to ensure consistency, and so that transactions can
operate in isolation from each other
row-level
non-locking consistent reads by default
pessimistic locking
locks are stored in a per-page bitmap
compact: 3-8 bits per lock
sparse pages have more memory and CPU overhead per locked row
no lock escalation to page-level or table-level
row locking
S locks
X locks
there is row-level locking on indexes, (but it may require access to PK
to see if there is a current version of the row to lock, right?)
Heikki says a delete-marked index record can carry a lock. What are the
cases where this would happen? Why does it matter? I imagine that a
DELETE statement locks the row, deletes it, and leaves it there locked.
supremum row can carry a lock, but infimum cannot (maybe we should discuss
later in another chapter)
table locking - IS and IX locks
InnoDB uses multiple-granularity locking, see http://en.wikipedia.org/wiki/Multiple_granularity_locking
also called two-phase locking, perhaps? Baron’s a bit confused
before setting a S row lock, it sets an intention lock IS on the table
before setting a X row lock, it sets an intention lock IX on the table
auto-increment locks
needed for stmt replication
before 5.0: table-level for the whole statement, even if the insert provided the value
released at statement end, not txn end
this is a serious bottleneck
5.1 and later: two more types of behavior (Look up _____’s issues, I think they
had some problem with it; also Mark Callaghan talked about a catch-22 with it)
row replication lets lock be released faster, or avoid it completely
complex behavior: interleaved numbers, gaps in sequences
locks set by types of statements
select doesn’t set locks unless it’s serializable
select lock in share mode
sets shared next-key locks on all index records it sees
select for update
sets exclusive next-key locks on all index records it sees, as well as locking the PK
insert into tbl1 select from tbl2
sets share locks on all rows it scans in tbl2 to prevent phantoms
can be reduced in 5.1 with read-committed; does a consistent non-locking read,
but requires rbr or it will be unsafe for replication
update and delete set exclusive next-key locks on all records they find (in pk/2nd?)
insert sets an x-lock on the inserted record (not next-key, so it doesn’t prevent others)
if there is a duplicate key error, sets s-lock on the index record (why?)
replace is like an insert
if there is a key collision, places x-next-key-lock on the updated row
insert…select…where sets x-record-lock on each destination row, and S-next-key-locks
on source rows, unless innodb_locks_unsafe_for_binlog is set
create…select is an insert..select in disguise
if there is a FK, anything that checks the constraint sets S-record-locks on the records
it looks at. Also sets locks if the constraint fails (why?)
lock types and compatibility matrix
how is lock wait timeout implemented, and what happens to trx that times out?
infimum/supremum “records” can be locked but don’t correspond to real rows
(should we defer this till later?)
gap locking, etc are discussed later
Deadlocks
how are deadlocks detected?
cycle in the waits-for graph
or, too many recursions (200)
or, too many locks checked (1 million, I think?)
performance impact, disabling
do txn check when the set a lock (I think so) or is there a deadlock thread as Peter’s slides say?
deadlock behavior
how is a victim chosen? which one is rolled back?
txn that modified the fewest rows
how about deadlocks with > 2 transactions involved and SHOW INNODB STATUS
causes of deadlocks
foreign keys
gap locking
transactions accessing table through different indexes
how to reduce deadlocks
make txns use same index in same order
short txns, fewer modifications, commit often
use rbr and read_committed in 5.1, reduces gap locking
in 5.0, use innodb_locks_unsafe_for_binlog to remove gap locking
Multi-Version Concurrency Control (MVCC)
ensures isolation with minimal locking
isolation means that while transactions are changing data, other transactions
see only a legal state of the database – either as of the start of the txn
that is changing stuff, or at the end after it commits, but not midway
readers can read without being blocked by writers, and vice versa
writers block each other
TODO: some of the details here need to be moved to the next chapter, or they need to be merged
old row versions are kept until no longer visible, then deleted in background (purge)
Read views
how LSN is used for mvcc, “txn sees between X and Y”: DB_TRX_ID (6 bytes?) column
updates write a new version, move the old one to the undo space, even before commit
short rows are faster to update
whole rows are versioned, except for BLOBs
deletes update the txn ids, leave it in place – special-case of update
inserts have a txn id in the future – but they still write to undo space and it is
discarded after commit
mvcc causes index bloat for deletions, but not for updates; updates cause undo space bloat
what the oldest view is used for, in srv0srv.c
Transaction isolation levels
SERIALIZABLE
Locking reads as if LOCK IN SHARE MODE.
Bypass multi versioning
No consistent reads; everyone sees latest state of DB.
REPEATABLE-READ (default)
Read commited data at it was on start of transaction
the snapshot begins with the first consistent read in the txn
begin transaction with consistent snapshot starts it “now”
update/delete use next-key locking
Is only for reads; multiversioning is for reads, not for writes.
For example, you can delete a row and commit from one session. Another session can
still see the row, but if it tries to update it, will affect 0 rows.
FOR UPDATE and LOCK IN SHARE MODE and INSERT..SELECT will drop out of
this isolation level, because you can only lock or intend to update the
most recent version of the row, not an old version.
http://bugs.mysql.com/bug.php?id=17228
“transaction isolation level question from a student in class” thread Mar 2011
“Re: REPEATABLE READ doesn’t work correctly in InnoDB tables?” ditto
READ-COMMITED
Read commited data as it was at start of statement – “up to date”
each select uses its own snapshot; internally consistent, but not over whole txn
in 5.1, most gap-locking removed; requires row-based logging.
unique key checks in 2nd indexes, and some FK checks, still need to set gap locks
prevents inserting child row after parent is deleted
which FK checks?
READ-UNCOMMITED
Read non committed data as it is changing live
No consistency, even within a single statement
which is best to use? is read-committed really better, as a lot of people believe?
http://www.mysqlperformanceblog.com/2011/01/12/innodb-undo-segment-siz-and-transaction-isolation/
phantom rows: you don’t see it in the first query, you see it when you query again;
stmt replication can’t tolerate them (not a problem in row-based); avoided by gap locking
Lock types for MVCC:
Next-key locks
gap locks
gap locks are simply a prohibition from inserting into the gap
they don’t give permission to do anything, they just block others
example: if I hold an exclusive next-key lock on a record, I can’t always insert
into the gap before it; someone might have a gap lock or a waiting next-key lock
request.
different modes of locks are allowed to coexist, because of gap merging via purge
holding a gap lock prevents others from inserting, but doesn’t permit me to insert;
I must wait for conflicting locks to be released (many txns can hold the same gap lock)
supremum can be gap-locked, infimum can’t – why not?
which isolation levels use them?
types: next-key (locks key and gap before it), gap lock (just the gap before the key),
record-only (just the key), insert-intention gap lock (held while waiting to insert
into a gap).
searches for a unique key use record-only locks to minimize gap locking (pk updates,
for example)
insert-intention locking
index locking (http://dom.as/2011/07/03/innodb-index-lock/)
when MVCC is bypassed:
updates: can only update a row that exists currently (what about delete?)
select lock in share mode
sets shared next-key locks on all index records it sees
select for update
sets exclusive next-key locks on all index records it sees, as well as locking the PK
insert into tbl1 select from tbl2
sets share locks on all rows it scans in tbl2 to prevent phantoms
can be reduced in 5.1 with read-committed; does a consistent non-locking read,
but requires rbr or it will be unsafe for replication
locking reads are slower, because they have to set locks (check for deadlocks)
How locks on a secondary index must lock the primary key, the impact of
this on the txn isolation level
indexes contain pointers to all versions
Index key 5 will point to all rows which were 5 in the past
TODO: clarify this with Peter
Exercises
is there really such a thing as a unique index in InnoDB?
Old Row Versions and the Undo Space
old row versions are used for MVCC and rollback of uncommitted txns that fail
they provide Consistency: each txn sees data at a consistent point in time
so old row versions are needed
cannot be updated: history cannot change; thus old row versions aren’t locked
each row in index contains DB_ROLL_PTR column, 7 bytes, points to older version
They are stored in a linked list, so a txn that reads old rows is slow,
and rows that are updated many times are very slow, and long running txns
that update a lot of rows can impact other txns
“rows read” is logical at the mysql level, but at the innodb level, many rows could be read
there is no limit on number of old versions to keep
history list
rollback segment / rseg
what is the difference between this and an undo segment / undo tablespace?
purge of old row versions
how it is done in the main loop
how it is done in a separate thread
interaction with long-running transactions, when a row version can be
purged, txn isolation level of long-running transactions
it leaves a hole / fragmentation (see __________?id=17673)
purging can change gaps, which are gap-locked; what happens to them?
a deleted row is removed from an index; two gaps merge. The new bigger gap
inherits both the locks from the gaps, and the lock that was on the deleted row
innodb_max_purge_lag slows down updates when purge falls behind
Data Storage and Layout
Tablespaces
what is in the global tablespace; why it grows; why it can’t be shrunk
main tablespace can be multiple files concatenated
legacy: raw device
tablespace header => id, size
segments, how they are used
leaf and non-leaf node segments for each index (makes scans more sequential IO)
thus each index has two segments for these types of pages
rollback segment
insert buffer segment
segment allocation: small vs large, page-at-time vs extent-at-a-time
how free pages are recycled within the same segment
when a segment can be reused
All pages in extent must be free before it is used in
different segment of same tablespace
file-per-table
advantages: reclaim space, store data on different drives (symlinking and its pitfalls),
backup/restore single tables, supports compression
disadvantages: filesystem per-inode mutexes, longer recovery, uses more space
free space within a segment can be used by same table only
how to import and export tables with xtradb, how this is different from
import and export in standard innodb
file growth/extension, and how it is done
ibdata1: innodb_autoextend_increment
individual table .ibd files don’t respect that setting
http://bugs.mysql.com/56433 bug about mutex lock during extension, Yasufumi patched
file formats (redundant, compact, barracuda, etc)
Page format
types of pages
Row format
never fragmented, except blobs are stored in multiple pieces
there can be a lot of empty space between rows on the page
infimum/supremum records
how SHOW TABLE STATUS works: see issue 17673
Data Types
Data types supported, and their storage format
Nulls
are nulls equal with respect to foreign keys? what about unique indexes?
Exercises
Large Value Storage and Compression
Page compression
new in plugin
requires file-per-table and Barracuda
Pages are kept uncompressed in memory
TODO: Peter says both compressed and uncompressed can be kept in memory
compression is mostly per-page
uses zlib, zlib library version must match exactly on recovery, because inflate/deflate sizes must match exactly, so can’t do recovery on different mysql version than the crash was on, ditto for xtrabackup backup/prepare; if libz is not linked statically, this can cause problems (use ldd to see); recovery might be immature for compressed table spaces. http://bugs.mysql.com/bug.php?id=62011
TODO: peter says Uses fancy tricks: Per page update log to avoid re-compression
not really configurable
syntax: ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; Estimate how well the data will compress
problems:
fs-level might be more efficient b/c page size is too small for good compression ratio
we have to guess/hint how much it can be compressed
setting is per-table, not per-index, but indexes vary in suitability
TODO: Peter says KEY_BLOCK_SIZE=16; - Only compress externally stored BLOBs - Can reduce size without overhead
Blob/large value storage (large varchar/text has same behavior)
small blobs are stored in the page whole, if they fit (max row len: ~8000 bytes, ~1⁄2 page)
http://www.facebook.com/notes/mysql-at-facebook/how-many-pages-does-innodb-for-tables-with-large-columns/10150481019790933
large blobs are stored out-of-page on a dedicated extent, first 768 bytes in-page
same allocation rules as for any extent: page by page, then extent at a time
this can waste a lot of space; it makes sense to combine blobs if possible
Barracuda format lets us store the whole thing out-of-page, without the 768-byte prefix
no need to move blobs to their own table – innodb won’t read them unless needed
but the 768-byte prefix can make rows larger anyway
blob I/O is always “pessimistic”
how are BLOBs handled with MVCC and old row versions?
externally stored blobs are not updated in-place, a new version is created
does a row that contains a blob, which gets updated without touching the blob, create a new
row that refers to the same copy of the blob as the old row does?
how is undo space handed? TODO: in-page/in-extent with the row, or in the undo log area?
The Transaction Logs
circular
File format: not page formatted, record formatted
512 byte units (prevents o_direct, causes read-around writes)
if logs fit in os buffer, may improve performance; otherwise puts
pressure on memory, written circularly and never read except for
read-around writes, so OS caching is useless
tunable in XtraDB
records are physiological: page # and operation to perform
records are idempotent
only redo, not undo
what LSN is (bytes written to tx log, tx ID, system version number)
where it is used (each page is versioned, each row has 2 lsns in the pk)
given a LSN, where does it point to in the physical files? (it’s modulo, I think)
Changing the log file size
Headers and magic offsets, e.g. where the last checkpoint is written
Never implemented:
multiple log groups
log archiving (maybe implemented once, then removed?)
Ensuring Data Integrity
Page checksums (old, new, and faster implementations)
checked when page is read in
updated when it is flushed
how much overhead this causes
disable-able, not recommended
the doublewrite buffer
it isn’t really a “buffer” like other buffers
avoiding torn pages / partial page writes
it is a short-term page-level log; pages contain tablespaceid+pageid
process: write to buffer; sync; write original location; sync
after crash recovery, we check the buffer and the original location, update original if needed
unlike postgres, we don’t write a full page to log after checkpoint (logs aren’t page-oriented)
how big it is; configuring it to a different file in xtradb
uses sequential IO, so overhead is not 2x
higher overhead on SSD, plus more wear
safe to disable on ZFS
Exercises
The Insert Buffer (Change Buffer)
changes to non-unique secondary index leaf pages that aren’t in the buffer pool are saved for later
it is transactional, not a volatile cache
how much performance improvement it gives: said to be 15x reduction in random IO
how it is purged/merged
in the background, when there is time. This is why STOP SLAVE can trigger a huge flood of IO.
the rate is controlled by innodb_io_capacity, innodb_ibuf_accel_rate
done by main thread (?)
might not be fast enough – dedicated thread in xtradb
also (transparently) in the foreground, when the page that has un-applied changes is read from disk for some other reason.
if a lot of changes need to be merged, it can slow down page reads.
it is changed to “change buffer” in recent plugin
tunability
it can take up to 1⁄2 of the buffer pool, which isn’t tunable in standard innodb
xtradb lets you disable it, and set the max size
newer innodb plugin changed insert buffer to change buffering, and lets you disable them
disabling can be good for SSDs (why?)
inspecting status; status info after restart only shows since restart, not over the lifetime of the database
it is stored in ibdata1 file. Pages are treated same as normal buffer pool pages, subject to LRU etc.
what happens on shutdown: you can set fast_shutdown off, so a full merge happens (slow)
after a restart, it can slow down, because pages aren’t in the buffer pool, so random IO is needed to find them and merge changes into them
Things that were designed but never implemented
multiple insert buffers
The Adaptive Hash Index
think of it as “recently accessed index cache”
it’s a kind of partial index: build for values that are accessed often
fast lookups for records recently accessed, which are in the buffer pool
is a btree that works for both pk and secondary indexes
can be built for full index entries, and for prefixes of them, depending on how they are looked up
not configurable, except you can disable it
how much does it help performance?
there is only one, it has a single mutex, can slow things a lot
xtradb lets you partition it
Buffer Pool Management
dirty pages vs clean pages
the insert buffer in memory
configuration and status inspection
the LRU list: pages in leat recently used order; has midpoint in newer innodb
young-old sublists and performance: http://www.mysqlperformanceblog.com/2011/11/13/side-load-may-massively-impact-your-mysql-performance/
the flush list: pages in oldest modified order
the free list: pages that are not used
overhead per buffer pool page makes it consume more memory than expected,
e.g. see http://www.facebook.com/note.php?note_id=491430345932 and
http://www.mysqlperformanceblog.com/2010/08/23/innodb-memory-allocation-ulimit-and-opensuse/
multiple buffer pools
Memory Management
system versus own malloc
the additional memory pool: stores dictionary
the adaptive hash index in memory
latch/lock storage (is that stored in the buffer pool?)
Checkpoints and Flushing
fuzzy vs sharp checkpoints
what happens when innodb has not enough free pages, or no space in the log files?
checkpoint spikes/stalls/furious flushing
smoothing out checkpoint writes
flush algorithms
standard in old innodb
adaptive checkpointing (xtradb)
adaptive flushing (innodb-plugin)
neighbor page flushing: next/prev pages (hurts SSD; tunable in xtradb)
flushing and page replacement
why page replacement? must clean a page before we can replace it with another from disk.
server tries to keep some pages clean: 10% in older versions, 25% in newer (innodb_max_dirty_pages_pct)
LRU algorithms: old, new
two-part lru list to guard against wiping out on scans (midpoint insertion)
the lru page replacement algorithm is explained by Inaam:
http://www.mysqlperformanceblog.com/2011/01/13/different-flavors-of-innodb-flushing/
1) if a block is available in the free list grab it.
2) else scan around 10% or the LRU list to find a clean block
3) if a clean block is found grab it
4) else trigger LRU flush and increment Innodb_buffer_pool_wait_free
5) after the LRU flush is finished try again
6) if able to find a block grab it otherwise repeat the process scanning deeper into the LRU list
There are some other areas to take care of like having an additional LRU
for compressed pages with uncompressed frames etc. And
Innodb_buffer_pool_wait_free is not indicative of total number of LRU
flushes. It tracks flushes that are triggered above. There are other
places in the code which will trigger an LRU flush as well.
flush list
contains a list of pages that are dirty, in LSN order
the main thread schedules some flushes to keep clean pages available
this is a checkpoint, as well, because it flushes from the end of the flush list
innodb_io_capacity is used by innodb here, but not by xtradb
assumed to be the disk’s writes-per-second capacity
Peter writes: Affects number of background flushes and insert buffer
merges (5% for each). What does 5% mean?
when the server is idle, it’ll do more flushing
flushing to replace is done in the user thread
What happens on shutdown
Startup, Crash Recovery, and Shutdown
What is done to boot the system up at start?
Fast vs slow shutdown
implications for the insert buffer,
http://dev.mysql.com/doc/innodb/1.1/en/innodb-downgrading-issues-ibuf.html
What is done to prepare for shutdown?
setting innodb_max_dirty_pages_pct to prepare
you can’t kill the server and it is blocking, so shutdown can take a while otherwise
What structures in the server have to be warmed up or cooled off? e.g. LRU
list, dirty pages…
stages of recovery: doublewrite restore, redo, undo
redo is synchronous: scan logs, read pages, compare LSNs
it happens in batches
undo is in the background since 5.0
faster with big logs (why?)
alter table commits every 10k rows to avoid long undos
very large dml is a problem, causes long undo after crash; don’t kill long txns lightly
how long will recovery take? 5.0 and 5.1 had slow algorithm; fixed in newer releases
buffer pool size also matters; in old versions, configure for small size, then restart
http://bugs.mysql.com/bug.php?id=29847
larger logs = longer recovery, but it also depends on row sizes, database size, workload
are there cases when recovery is impossible? during DDL, .FRM file is not atomic
how innodb checks and uses the binary log during recovery
the recovery threads – transactions are replayed w/o mysql threads, so they
look different
InnoDB’s I/O Behavior and File Management
How files are created, deleted, shrunk, expanded
How InnoDB opens data files: o_direct, etc
buffered vs direct IO
buffered:
advantage: faster warmup, faster flushes, reduce inode locking on ext3
bad: swap pressure, double buffering, loss of effective memory
direct:
Optimistic vs pessimistic IO
http://dom.as/2011/07/03/innodb-index-lock/
How InnoDB opens log files
always buffered, except in xtradb
How InnoDB writes and flushes data files and log files
the log buffer
flushing logs to disk; innodb_flush_log_at_trx_commit; what is safe in what conditions
I/O threads
the dedicated IO threads
the main thread does IO in its main loop
dedicated threads for purge, insert buffer merge etc
read-ahead/prefetches for random and sequential IO; how an extent is determined to
need prefetching
don’t count on it much
random read-ahead removed in version X, added back; impact of it
merging operations together, reordering, neighbor page operations
http://dom.as/2011/07/03/innodb-index-lock/
async io
simulated: arrays, slots
native on Windows and in Linux in version 1.1
which I/O operations can be foregrounded and backgrounded
most writes are in the background
flushes can be sync if there are no free pages
log writes can be sync or async, configurable
thresholds: 75% and 85% by default (confirm)
what operations block in innodb? background threads sometimes block
foreground threads; MarkC has written about;
http://bugs.mysql.com/bug.php?id=55004
I/O operations for things like insert buffer merge (causes reads) and old row version purge
the purpose of files like “/tmp/ibPR9NL1 (deleted)”
Data Manipulation (DML) Operations
select
insert
update
delete
does it compact? ___________?id=14473
The System Tables
sys_tables
sys_indexes
sys_foreign
sys_stats
sys_fields
Data Definition (DDL) Operations
How CREATE TABLE works
How ALTER TABLE works
Doesn’t it internally commit every 10k rows?
create index
fast index creation; sort buffers; sort buffer size
Creates multiple transactions: see email Re: ALTER TABLE showing up more than once in ‘SHOW ENGINE INNODB STATUS’ Transaction list?
optimize table
analyze table
How DROP TABLE works
with file-per-table, it is DROP TABLESPACE, which blocks the server; see
https://bugs.launchpad.net/percona-server/+bug/712591
InnoDB’s internal stored procedure language
Foreign Keys
implications for locking: causes additional locking, opportunities for deadlocks
cascades
nulls
rows that point to themselves, or rows that have cycles; can they be deleted?
is checking immediate, or deferred? it is immediate, not done at commit.
names are case sensitive
indexes required; change in behavior in 4.1
data types must match exactly
how they interact with indexes
appearance in SHOW INNODB STATUS
they use the internal stored procedures
InnoDB has to parse the SQL of the CREATE statement
InnoDB’s Interface to MySQL
The Handler interface
Relationship with .frm file
Built-In InnoDB
The InnoDB Plugin
Converting rows to MySQL’s row format
what columns are in every table (and can’t be used in a real table)
Communicating ha::rows_in_range and ha::info and other statistics
Communicating index capability bits (enables covering index queries)
Interaction with the query cache
MySQL thread statuses
they appear in INNODB STATUS
what statuses can be present while query is inside innodb: “statistics”
for example
Implementation in ha_innodb.cc
Hacks: magic CREATE TABLE statements like innodb_table_monitor,
parsing the SQL for FK definitions
how table and row locks are communicated between engine and server
innodb_table_locks=1 means that innodb knows about server table locks; what does it do
with them?
the server knows about row locks – and it can tell innodb to release non-matched rows?
Exercises
Index Implementation
built on b-trees
leaf vs non-leaf nodes, the row format on them
secondary indexes
data pages store the rows in a heap within the page
page fill factor
page merges and splits
is something special done during deletes?
Data Distribution Statistics
how they are gathered
inaccuracy
configurability of whether to gather them or not
stability/randomness (older InnoDB isn’t properly random and is non-uniform)
how many samples? config options that affect that
ability to stop resampling
ability to store persistently with innodb_use_sys_stats_table
How MySQL executes queries with InnoDB
high-level overview of optimization, statistics
table locking and lock releasing
releasing rows locks for rows eliminated by WHERE clause in 5.1; isolation
index-only (covering index) queries
how it works
when an index query must look up the pk (when a page has a newer lsn
than the query’s lsn; rows in secondary indexes don’t have LSNs, only the
page does)
what extra columns are included in the secondary indexes
Internal Maintenance Tasks
Old Row Purge
Insert Buffer Merge
The statistics collector
rules for when stats are recomputed
by mysql: at first open, when SHOW TABLE STATUS / INDEX commands are used (configured with innodb_stats_on_metadata) or when ANALYZE TABLE is used)
by innodb: after size changes 1/16th or after 2B row insertions
(disable with innodb_stats_auto_update=false)
stats are computed when table is first opened, too
bug: stats not valid for an index after fast-create (http://bugs.mysql.com/bug.php?id=62516)
Jervin’s blog: http://www.mysqlperformanceblog.com/?p=7516&preview=true
Tuning InnoDB
buffer pool size
using multiple buffer pools
log file size (1h worth of log writes)
log buffer size (10 sec worth of log writes; bigger for blobs; error if too small)
checkpoint behavior
flush_logs_at_trx_commit
dirty page pct in buffer pool
setting it lower doesn’t smooth IO by causing constant writing – it causes much more
IO and doesn’t give a buffer to absorb spikes.
o_direct
all configuration variables
Mutexes and Latches
how innodb implements rw-locks and mutexes
list of the major ones, and what they are for
the order they are locked in
log buffer
buffer pool
adaptive hash index
new_index->mutex
sync array / sync_array
ut_delay
kernel->mutex
http://www.mysqlperformanceblog.com/2011/12/02/kernel_mutex-problem-cont-or-triple-your-throughput/comment-page-1/#comment-850337
approaches to scalability: XtraDB (split into many), InnoDB 1.1 (multiple buffer pools)
what are spins, spin rounds, OS waits, how are they configurable
what is the OS wait array
what are condition variables, and what are they for
broadcasts are expensive as the txn list grows, according to mark callaghan?
what are innodb events? See http://mysqlha.blogspot.com/2011/02/this-happens-when-you-dont-have.html
InnoDB Threads
thread per connection
thread statuses (innodb internal ones, not mysql ones)
user threads
recovery threads
IO threads
main thread
schedules other things
flush
purge
checkpoint
insert buffer merge
deadlock detector
monitoring thread
error monitor thread; see sync/sync0arr.c
statistics thread (??)
log thread
purge thread
innodb_thread_concurrency and the queue, concurrency tickets
limit includes threads doing disk io or storing data in tmp table
Internal Structures
Data dictionary
auto-increment values; populated at startup
statistics
system info
the size overhead per tabel can be 4-10k; this is version dependent
xtradb lets you limit this
Arrays
os_aio_read_array, os_aio_write_array, os_aio_ibuf_array
mutex/latch/semaphore/whatever arrays
data structures and what they’re used for
heaps (rows in the page, for example)
b-trees
linked lists (?)
arrays
XtraBackup
InnoDB Recovery Tools
Inspecting Status
SHOW STATUS counters
Innodb_buffer_pool_wait_free, per Inaam, is not indicative of total number of
LRU flushes. It tracks flushes that are triggered from an LRU flush; there are more
places in the code which will trigger an LRU flush as well.
show innodb status
innodb status monitor
lock monitor
tablespace monitor
writing to a file
truncation and the in-memory copy and its filehandle
information_schema tables in the plugin (esp. locks and trx)
how it attempts to provide a consistent view of the tables; consult
https://bugs.launchpad.net/bugs/677407
show mutex status
Further reading:
http://blogs.innodb.com/wp/2011/04/mysql-5-6-multi-threaded-purge/
XDES
The InnoDB core sub-systems are:
The Locking sub-system
The Transaction sub-system
MVCC views
(http://blogs.innodb.com/wp/2011/04/mysql-5-6-innodb-scalability-fix-kernel-mutex-removed/)
The Wikipedia article on InnoDB?
InnoDB does bulk commits for things like ALTER, every 10k rows, to avoid problems internally.
Photo credits: iceberg
PlanetMySQL Voting: Vote UP / Vote DOWN