2013-06-19

‎Character sets: new section

← Older revision

Revision as of 14:05, 19 June 2013

Line 675:

Line 675:

Is section [[Manual:Backing_up_a_wiki#Tables|Tables]] still relevant? Maybe it would be useful to have an explanation of which character sets to look out for, by MediaWiki version. When installing 1.19.2 -- 1.20.2 there is no option for latin1, it's utf8 or a default of binary. --[[User:Robkam|Robkam]] ([[User talk:Robkam|talk]]) 22:16, 22 December 2012 (UTC)

Is section [[Manual:Backing_up_a_wiki#Tables|Tables]] still relevant? Maybe it would be useful to have an explanation of which character sets to look out for, by MediaWiki version. When installing 1.19.2 -- 1.20.2 there is no option for latin1, it's utf8 or a default of binary. --[[User:Robkam|Robkam]] ([[User talk:Robkam|talk]]) 22:16, 22 December 2012 (UTC)

+

+

== Character sets ==

+

+

I've moved some overly detailed information here that used to be on the main page. [[User:Graham87|Graham87]] ([[User talk:Graham87|talk]]) 14:05, 19 June 2013 (UTC)

+

=== Character set ===

+

+

{{warning| In some common configurations of '''MySQL 4.1 and later''', mysqldump can corrupt MediaWiki's stored text. If your database's character set is set to "latin1" rather than "UTF-8", mysqldump in 4.1+ will apply a character set conversion step which can '''corrupt text containing non-English characters''' as well as punctuation like "smart quotes" and long dashes used in English text.}}

+

+

You can see which character set your tables are using with a mysql statement like SHOW CREATE TABLE text; (including the semicolon). The last line will include a DEFAULT CHARSET clause.

+

+

If the last line does not include a DEFAULT CHARSET clause then there is another way if you know that nobody has changed the character set of the database server since it was installed and the wiki's database was created using the default character set of the database. The STATUS command displays the database server's default character set next to Server characterset:. Here is an example output:

+

+

mysql> status

+

- - - - - - - - -

+

mysql Ver 12.22 Distrib 4.0.20a, for Win95/Win98 (i32)

+

+

Connection id: 13601

+

Current database:

+

Current user: root@localhost

+

SSL: Not in use

+

Server version: 4.0.20a-nt

+

Protocol version: 10

+

Connection: localhost via TCP/IP

+

Client characterset: latin1

+

'''Server characterset: latin1'''

+

TCP port: 3306

+

Uptime: 27 days 4 hours 58 min 26 sec

+

+

Use the option --default-character-set=latin1 on the mysqldump command line to avoid the conversion if you find it set to "latin1".

+

+

Like this:

+

+

+

Also one can try --default-character-set=binary . [http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL “Convert latin1 to UTF-8 in MySQL”] on Gentoo Linux Wiki has more information.

+

+

==== Latin-1 to UTF-8 conversion ====

+

{{note|In the following I intentionally use different input and output file names for commands using ''sed'' because the -i (inplace) option of ''sed'' throws problems on very big dumps. The described procedure was used several times and works 100% reliably. The steps do not change your existing database. You can use the old wiki with the old database until your new wiki runs with the new database, the UTF-8 copy clone of the old one. This section is contributed and updated by --[[User:Wikinaut|Wikinaut]] 10:31, 18 February 2010 (UTC). Feedback is welcome.}}

+

+

When you want to upgrade from a rather old Mediawiki installation with Latin-1 to UTF-8 which might be tricky depending on your operating system and MySQL settings - in my example from Mediawiki 1.5 (2004) to 1.15.1 (2009) - perform the following steps as found in the article ''[http://tlug.dnho.net/?q=node/276 Convert a MySQL DB from latin1 to UTF8]'' and further adapted to Mediawiki specialities (DBNAME is the name of your wiki database):

+

mysqldump -u root -p --opt --default-character-set=latin1 --skip-set-charset DBNAME > DBNAME.sql

+

+

Then use ''sed'' to change character settings latin1 to utf8:

+

+

+

Every character in UTF-8 needs up to 3 bytes, thus it is necessary to decrease one key which is done with the following command

+

Relevant sources:

+

* [[bugzilla:1322|Bug 1322 "Specified key was too long" (MySQL error) during installation"]]

+

* [http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_15/phase3/maintenance/tables.sql?revision=50060&view=markup MediaWiki maintenance/tables.sql]

+

* I found different MediaWiki versions already using different cl_sortkey length and took this into account below

+

* [http://bugs.mysql.com/bug.php?id=4541 "Specified key was too long; max key length is 1000 bytes"]: ''"Truncate so that the cl_sortkey key fits in 1000 bytes"''

+

A further problem which prevents reimporting the database was the ''math'' table (''ERROR line 389: Duplicate entry
''
for key 1'' when trying to import the mysqldump). I solved it by simply deleting the ''math'' table content, as this is only a cache and need not to be imported when upgrading.

+

+

sed -e 's/`cl_sortkey` varchar(255)/`cl_sortkey` varchar(70)/gi' DBNAME2.sql > DBNAME21.sql

+

sed -e 's/`cl_sortkey` varchar(86)/`cl_sortkey` varchar(70)/gi' DBNAME21.sql > DBNAME22.sql

+

sed -e 's/`cl_sortkey`(128)/`cl_sortkey`(70)/gi' DBNAME22.sql > DBNAME23.sql

+

sed -e '/^INSERT INTO `math/d' DBNAME23.sql > DBNAME3.sql

+

+

From here I then created a new database DBNEW and then imported the dumpfile

+

mysql -u root -p -e "create database DBNEW"

+

mysql -u root -p --default-character-set=utf8 DBNEW

+

+

Now start a fresh MediaWiki installation and use your new wiki database name DBNEW - actually the UTF-8 converted copy of your untouched old DBNAME wiki - and the database copy will be automatically upgraded to the recent MediaWiki database scheme. Several successful conversions from MediaWiki 1.5 to MediaWiki 1.15.1 under PHP 5.2.12 (apache2handler) and MySQL 4.1.13 have been made.

+

+

==== Latin-1 to UTF-8 conversion under Windows ====

+

#Dump your Database as usual.

+

#Convert your Database using the character set conversion utility [http://www.marblesoftware.com/Marble_Software/Charco.html Charco]

+

#Replace all ''latin1'' thru ''utf8'' inside the dump.

+

#Import the dump into a new DB or overwrite the old.

+

#Ready

+

Tested under WindowsXP. Mediawiki 1.13.2 dumped under EasyPHP 1.8.0.1. Converted with Chargo 0.8.1. Imported to XAMPP 1.7.3. Updated to Mediawiki 1.15.1.

+

+

==== Latin-1 to UTF-8 conversion under Mac ====

+

+

# First export your Database as usual, separated into schema and data. You can use the terminal command mysqldump, which the official installer places in /usr/local/mysql/. Note that in the following lines, the lack of spaces between -u and username and -p and password is deliberate:

+

#*./mysqldump --default-character-set=latin1 --skip-set-charset -d -u''user'' -p''password'' ''DBNAME'' > ~/db_schema.sql

+

#*./mysqldump --default-character-set=latin1 --skip-set-charset -t -u''user'' -p''password'' ''DBNAME'' > ~/db_data.sql

+

# The database exports are now in your personal folder. Convert both exports with [http://www.marblesoftware.com/Marble_Software/Charco.html Charco] from ISOlatin1 to UTF-8. Append "_utf8" to the output file names and fix the .txt extension that Charco enforces back to .sql.

+

# Open the file ~/db_schema_utf8.sql with Text Editor and replace each "DEFAULT CHARSET=latin1" phrase with "DEFAULT CHARSET=utf8"

+

# Make a new database using [http://www.sequelpro.com/ Sequel Pro], with encoding "UTF-8 Unicode (utf8)".

+

#* Import the ~/db_schema_utf8.sql file into your new database

+

#* Import the ~/db_data_utf8.sql file into your new database

+

#* ensure that your wiki user has access to the new database by adding a relevant line in your MYSQL database in the DB table

+

# Change the variable ''$wgDBname'' in your LocalSettings.php to reflect the name of the new database. Then test if everything works. If not, flip back to the old database and try a different method.

+

# (Optional) Delete the old database, and (also optional) rename the new database to the old database and revert the change in ''$wgDBname''.

+

+

This sequence was adapted from [http://khelll.com/blog/mysql/changing-database-encoding-from-latin-to-utf8/ Khelll's Blog], and used for Mediawiki 1.19.2 and MySQL 5.1.57. It will fix encodings that already show up as garbled under an updated wiki installation as well.

+

+

==== Repairing corrupted character sets ====

+

In case your database's character set got corrupted (see warning above), an easy way to fix the corrupted characters and remedy the situation for future backups has been posted in this [http://www.webyog.com/faq/content/34/152/en/my-special-characters-display-as-_noisy_-latin-characters-in-sqlyog.html source]

+

+

Directly changing all latin1-encoded columns to UTF-8 won't help, as MySQL will just transform the erroneous characters directly. The remedy is to change the wrongly encoded latin1 string type (char/varchar/TEXT) into a binary type (binary/varbinary/BLOB). A conversion into a UTF8-encoded string type (char/varchar/TEXT) will then fix all your previously erroneous characters to their proper representation.

+

+

In short: latin1 char/varchar/TEXT -> binary/varbinary/BLOB -> UTF8 char/varchar/TEXT

+

+

Also don't forget to change the default charset for your database and the single tables to UTF-8, so your character sets won't get corrupted again.

Show more