utf8 vs utf8mb4 in MySQL: The Complete Technical Guide
MySQL's utf8 character set is a misnomer β it is not a true UTF-8 implementation. It encodes characters using only 1 to 3 bytes, which means it silently drops or rejects any Unicode code point above U+FFFF, including every emoji and a significant portion of supplementary CJK characters. utf8mb4 is MySQL's correct, full UTF-8 implementation, supporting 1 to 4 bytes per character and the complete Unicode range. For any production database built after 2010, utf8mb4 is the only defensible choice.
This guide explains exactly why that distinction matters, where the original utf8 design went wrong, how to migrate safely, and how to configure MySQL correctly at the server, database, table, and connection level.
The Core Problem: Why MySQL's utf8 Is Broken by Design
The UTF-8 encoding standard (RFC 3629) defines a variable-width scheme that uses 1 to 4 bytes to represent every valid Unicode code point β over 1.1 million possible characters. When MySQL introduced its `utf8` character set in version 4.1, the implementation was intentionally capped at 3 bytes per character. This was a deliberate engineering shortcut, not an oversight.
At the time, the InnoDB row format imposed a 767-byte limit on index key prefixes. Supporting 4-byte characters would have reduced the maximum indexed prefix length for `VARCHAR` columns, creating index compatibility problems. The 3-byte cap was a pragmatic workaround that became a long-term liability.
The practical consequence: any Unicode code point in the Supplementary Multilingual Plane (SMP) β code points U+10000 and above β cannot be stored in a `utf8` column. This includes:
- All standard emoji (U+1F600 and beyond)
- Mathematical alphanumeric symbols (U+1D400βU+1D7FF)
- Musical notation symbols
- Historic scripts such as Linear B, Gothic, and Cuneiform
- Supplementary CJK Unified Ideographs (U+20000βU+2A6DF)
- Certain currency symbols and technical operators added in recent Unicode versions
When an application attempts to insert a 4-byte character into a `utf8` column, MySQL either returns a `Incorrect string value` error or, if `sql_mode` is permissive, silently truncates the data. Silent truncation is arguably the more dangerous outcome β your application receives no error, but your data is corrupted.
utf8mb4: The Correct Implementation
MySQL introduced utf8mb4 in version 5.5.3 (released 2010) specifically to address this deficiency. The `mb4` suffix stands for "multi-byte, 4 bytes maximum." It is a strict superset of `utf8` β every character representable in `utf8` is identically representable in `utf8mb4`. There is no data loss when migrating from `utf8` to `utf8mb4`.
utf8mb4 maps directly to the RFC 3629 UTF-8 standard. It handles the full Unicode code space from U+0000 to U+10FFFF without restriction.
utf8 vs utf8mb4: Feature Comparison
| Feature | utf8 (MySQL) | utf8mb4 |
|---|
| — | — | — |
|---|
| Bytes per character | 1β3 | 1β4 |
|---|
| Unicode coverage | BMP only (U+0000βU+FFFF) | Full (U+0000βU+10FFFF) |
|---|
| Emoji support | No | Yes |
|---|
| Supplementary CJK | No | Yes |
|---|
| RFC 3629 compliant | No | Yes |
|---|
| Max index prefix (InnoDB, 4KB pages) | 767 bytes | 767 bytes (191 chars) |
|---|
| Max index prefix (innodb_large_prefix) | 3072 bytes | 3072 bytes (768 chars) |
|---|
| Storage overhead vs latin1 | Identical for ASCII | Identical for ASCII |
|---|
| Recommended for new projects | No | Yes |
|---|
| MySQL version introduced | 4.1 | 5.5.3 |
|---|
Collation Choices Within utf8mb4
Selecting utf8mb4 as your character set is only half the decision. The collation determines how strings are compared, sorted, and indexed. The wrong collation causes subtle, hard-to-debug query behavior.
utf8mb4_unicode_ci
Based on the Unicode Collation Algorithm (UCA). Handles language-specific sorting rules correctly. Slightly slower than `utf8mb4_general_ci` due to more complex comparison logic, but the performance difference is negligible on modern hardware.
utf8mb4_general_ci
A simplified collation that does not fully implement UCA. Faster in benchmarks from the early 2010s, but the speed advantage is irrelevant on current CPUs. It handles some edge cases incorrectly β for example, it treats certain German characters as equivalent when they should not be. Avoid for new projects.
utf8mb4_0900_ai_ci
Available in MySQL 8.0+. Based on Unicode 9.0 with accent-insensitive (`ai`) and case-insensitive (`ci`) comparison. This is the recommended default for MySQL 8.0 and later. It is faster than `utf8mb4_unicode_ci` and more accurate.
utf8mb4_bin
Binary comparison β case-sensitive, accent-sensitive, no locale-specific rules. Use when you need exact byte-level matching, such as for password hashes or case-sensitive identifiers.
Recommendation: Use `utf8mb4_0900_ai_ci` on MySQL 8.0+. Use `utf8mb4_unicode_ci` on MySQL 5.7 and earlier.
Storage and Index Implications
A common concern when migrating from utf8 to utf8mb4 is storage overhead. In practice, the impact is minimal:
- ASCII characters (U+0000βU+007F) still occupy exactly 1 byte in both encodings.
- Most Latin, Greek, Cyrillic, Arabic, and Hebrew characters occupy 2 bytes in both encodings.
- CJK characters in the BMP occupy 3 bytes in both encodings.
- Only supplementary characters (emoji, supplementary CJK) require 4 bytes β and these were simply unrepresentable in utf8 before.
The real index concern is the 767-byte InnoDB index prefix limit on older configurations. With utf8mb4, a 4-byte-per-character worst case means a 191-character `VARCHAR` index prefix hits the 767-byte ceiling. With `utf8`, the same ceiling allowed 255 characters. If you have `VARCHAR(255)` columns with full-column indexes, you may encounter `Specified key was too long` errors during migration.
Solutions:
- Enable `innodb_large_prefix = ON` (MySQL 5.6/5.7) to raise the limit to 3072 bytes.
- Use `ROW_FORMAT=DYNAMIC` or `ROW_FORMAT=COMPRESSED` on affected tables.
- In MySQL 8.0, `innodb_large_prefix` is enabled by default and the parameter is removed.
- Shorten index prefixes: `INDEX (column(191))` instead of `INDEX (column(255))`.
This is the most common migration failure point and the one most frequently underdocumented in basic guides.
How to Migrate a MySQL Database from utf8 to utf8mb4
Migration is straightforward but requires precision. Skipping any layer β server, database, table, or connection β leaves your application silently falling back to the old encoding.
Step 1: Back Up the Database
Never modify character encoding on a live database without a verified backup.
“`bash
mysqldump -u username -p –single-transaction –routines –triggers
database_name > database_backup_$(date +%F).sql
“`
The `–single-transaction` flag ensures a consistent snapshot for InnoDB tables without locking. Store the backup in a location separate from the database server before proceeding.
Step 2: Update the MySQL Server Configuration
Edit `/etc/mysql/my.cnf` or `/etc/mysql/mysql.conf.d/mysqld.cnf` depending on your distribution:
“`ini
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
For MySQL 5.6/5.7 only β remove on MySQL 8.0
innodb_large_prefix = ON
innodb_file_format = Barracuda
innodb_file_per_table = ON
“`
Restart MySQL:
“`bash
sudo systemctl restart mysql
“`
Step 3: Convert the Database
“`sql
ALTER DATABASE database_name
CHARACTER SET = utf8mb4
COLLATE = utf8mb4_unicode_ci;
“`
Step 4: Convert All Tables
Generate and execute `ALTER TABLE` statements for every table. Running them manually on large schemas is error-prone. Use this query to generate the statements automatically:
“`sql
SELECT CONCAT(
'ALTER TABLE `', TABLE_NAME, '` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;'
)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'database_name'
AND TABLE_TYPE = 'BASE TABLE';
“`
Execute each generated statement. The `CONVERT TO CHARACTER SET` syntax changes both the table default and all existing character columns in a single operation.
Step 5: Fix Index Length Errors
If you encounter `Specified key was too long; max key length is 767 bytes`, identify the offending index:
“`sql
— Change full-column index to prefix index
ALTER TABLE table_name DROP INDEX index_name;
ALTER TABLE table_name ADD INDEX index_name (column_name(191));
“`
For WordPress databases specifically, the `wp_options` table's `option_name` column and the `wp_postmeta` `meta_key` column are common sources of this error.
Step 6: Verify the Conversion
“`sql
— Check server-level variables
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
— Check a specific table
SHOW CREATE TABLE table_nameG
— Check all columns in a database
SELECT TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'database_name'
AND DATA_TYPE IN ('char', 'varchar', 'text', 'tinytext', 'mediumtext', 'longtext');
“`
Every `CHARACTER_SET_NAME` value should read `utf8mb4`.
Step 7: Update Application Connection Strings
The server and schema encoding means nothing if your application connects using the wrong character set. The connection-level encoding overrides the server default.
PHP (PDO):
“`php
$dsn = 'mysql:host=localhost;dbname=database_name;charset=utf8mb4';
$pdo = new PDO($dsn, $user, $pass, [
PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci"
]);
“`
PHP (MySQLi):
“`php
$mysqli = new mysqli('localhost', $user, $pass, $db);
$mysqli->set_charset('utf8mb4');
“`
Python (mysql-connector-python):
“`python
cnx = mysql.connector.connect(
host='localhost', user=user, password=pass,
database=db, charset='utf8mb4', collation='utf8mb4_unicode_ci'
)
“`
Node.js (mysql2):
“`javascript
const pool = mysql2.createPool({
host: 'localhost', user: user, password: pass,
database: db, charset: 'utf8mb4'
});
“`
Failing to set the connection charset is the single most common reason why emoji still fail to insert after a supposedly complete migration.
WordPress-Specific Considerations
WordPress has shipped with utf8mb4 as its default character set since version 4.2 (April 2015). If you are running a WordPress installation on an older database that was never migrated, the `wp-config.php` file may still contain:
“`php
define('DB_CHARSET', 'utf8');
“`
Change this to:
“`php
define('DB_CHARSET', 'utf8mb4');
define('DB_COLLATE', 'utf8mb4_unicode_ci');
“`
WordPress also includes a built-in upgrade routine (`maybe_convert_table_to_utf8mb4()`) that runs during core updates. However, this routine does not always catch every table, particularly those created by plugins. Running the manual `ALTER TABLE` approach described above is more reliable.
On a VPS Hosting environment with root access, you can automate this entire process with a shell script and schedule it as a one-time cron job, giving you full control over timing and logging.
Performance Considerations
The performance impact of utf8mb4 versus utf8 is negligible for the vast majority of workloads:
- Read queries: No measurable difference for BMP characters. Supplementary characters require one additional byte of I/O, which is absorbed by buffer pool caching.
- Write queries: Identical for ASCII and BMP content. Marginally higher for supplementary characters.
- Index operations: The reduced maximum prefix length (191 vs 255 characters for full-width indexes) can affect query plans if you have full-column indexes on long `VARCHAR` columns. Audit your indexes before and after migration.
- Memory: MySQL allocates fixed-width buffers for string operations based on the maximum bytes per character. Switching from utf8 (3 bytes max) to utf8mb4 (4 bytes max) increases the memory allocated for in-memory sort buffers and temporary tables by approximately 33% for string-heavy operations. On a Dedicated Server with ample RAM, this is inconsequential. On a memory-constrained shared environment, monitor `sort_buffer_size` and `tmp_table_size` after migration.
When utf8 Is Still Acceptable
There is a narrow set of legitimate reasons to retain `utf8`:
- Strict legacy compatibility: An application using an unmaintained ORM or database driver that cannot handle 4-byte characters. This is a technical debt problem, not a reason to keep utf8 indefinitely.
- Read-only archival databases: If a database will never receive new writes and the existing data contains no supplementary characters, migration adds risk with no benefit.
- Hard storage constraints: In extreme edge cases β embedded systems or severely capacity-constrained environments β the marginal storage difference might matter. This does not apply to any standard web hosting scenario.
In every other case, utf8mb4 is the correct choice. The argument that utf8 saves storage space is technically true only for supplementary characters, which were unrepresentable in utf8 anyway. You are not saving space on data you could not store.
Choosing the Right Hosting Environment for MySQL utf8mb4
Proper utf8mb4 configuration requires access to the MySQL server configuration file (`my.cnf`). This rules out most shared hosting environments where you cannot modify server-level variables.
For full control over MySQL character encoding, collation, InnoDB settings, and connection parameters, you need either a VPS Hosting plan with root access or a Dedicated Server. Both give you direct access to `/etc/mysql/my.cnf`, the ability to restart the MySQL service, and the freedom to configure `innodb_large_prefix`, `ROW_FORMAT`, and other parameters that affect utf8mb4 migration success.
If you manage multiple databases or client sites, a VPS with cPanel provides a graphical interface for database management while retaining the underlying server access needed for character set configuration. For teams that prefer command-line flexibility with a lightweight panel, VPS Control Panels offer several alternatives suited to different operational workflows.
For projects that also require secure data transmission, pairing your database migration with a properly configured SSL Certificate ensures that utf8mb4-encoded data is protected in transit, not just at rest.
Technical Decision Checklist
Use this checklist before and after any utf8 to utf8mb4 migration:
Pre-migration:
- [ ] Full `mysqldump` backup verified and restorable
- [ ] MySQL version confirmed (5.5.3+ required for utf8mb4)
- [ ] `innodb_large_prefix` status checked (enable if on MySQL 5.6/5.7)
- [ ] All `VARCHAR(255)` columns with full-column indexes identified
- [ ] Application connection charset code reviewed and updated
- [ ] Maintenance window scheduled for production databases
Post-migration:
- [ ] `SHOW VARIABLES LIKE 'character_set%'` shows `utf8mb4` at server level
- [ ] `SHOW CREATE TABLE` confirms `utf8mb4` on all converted tables
- [ ] `information_schema.COLUMNS` query confirms no remaining `utf8` columns
- [ ] Application-level `SET NAMES utf8mb4` or equivalent confirmed in connection code
- [ ] Emoji insertion test passed on a representative table
- [ ] Query performance baseline compared to pre-migration metrics
- [ ] Index lengths verified β no silent truncation of long indexed values
FAQ
Does migrating from utf8 to utf8mb4 cause data loss?
No. utf8mb4 is a strict superset of MySQL's utf8. Every character stored in a utf8 column is identically representable in utf8mb4. The migration is non-destructive for existing data. The only risk is index length errors on `VARCHAR(255)` columns with full-column indexes, which must be resolved by shortening the index prefix.
Why do emoji still fail to insert after I converted my tables to utf8mb4?
The most common cause is the application connection charset. If your PHP, Python, or Node.js code connects without explicitly specifying `utf8mb4`, MySQL uses the server's `character_set_client` default for that session. Add `SET NAMES utf8mb4` or the equivalent charset parameter to your connection configuration.
What is the difference between utf8mb4_unicode_ci and utf8mb4_0900_ai_ci?
`utf8mb4_unicode_ci` is based on Unicode 4.0 collation rules and is the standard choice for MySQL 5.7. `utf8mb4_0900_ai_ci` is based on Unicode 9.0, is the default in MySQL 8.0, and is both faster and more linguistically accurate. Use `utf8mb4_0900_ai_ci` on MySQL 8.0+ for new projects.
Will switching to utf8mb4 increase my database storage size significantly?
In practice, no. ASCII and most BMP characters use the same number of bytes in both encodings. Only supplementary characters (emoji, supplementary CJK) use 4 bytes β and those were unrepresentable in utf8 before. The memory overhead for sort buffers increases by roughly 33% for string-heavy operations, but this is negligible on any modern server.
Can I configure utf8mb4 on shared hosting?
Partially. You can set the character set at the database and table level using SQL `ALTER` statements, and you can specify the charset in your application's connection string. However, you cannot modify `my.cnf` or restart MySQL on shared hosting. Server-level defaults will remain unchanged, which means new databases created through the hosting panel may default to utf8. Full utf8mb4 configuration requires a VPS or dedicated server with root access.
