What is Base64 Encoding? A Complete Developer Guide
The first time Base64 cost me real money, I had embedded a 1.4 MB hero image as a data URI in a marketing email.
The QA inbox rendered it fine. Gmail truncated the email at around 102 KB and silently dropped half the body. The campaign went out with a broken layout to roughly 60,000 subscribers before anyone noticed.
I had used Base64 the way most developers first learn it: as a vague "wrap binary in text" tool, without understanding the size cost, the encoding rules, or where it stops being a good idea. This guide is the explanation I wish I had read before that email went out.
Table of Contents
- What Base64 Actually Is
- The Algorithm: 6-Bit Chunking, Worked by Hand
- Why the Equals Signs Show Up
- The 33 Percent Tax (and Where It Comes From)
- URL-Safe Base64 and Why It Exists
- Where You Actually Meet Base64 in the Wild
- When Base64 Is the Wrong Answer
- Mistakes I Have Made and Reviewed
- Key Takeaways
- Closing Thoughts
Base64 is one of those primitives that everybody uses and almost nobody understands deeply. It shows up in JWTs, in HTTP Basic Auth headers, in email attachments, in data URIs, in PEM certificates, in OAuth flows, and in every "encode this image and paste it" Stack Overflow answer ever written.
I have been writing back-end and integration code for about six years. Most of the bugs I have shipped that touched Base64 came from treating it as a black box. The fixes were always the same: read the spec, count bits, accept that the output is bigger than the input, and choose the right variant for the channel you are putting it through.
This is the long-form explanation. If you only ever needed to convert an image, the StackConvert Base64 tool already does that in your browser. This guide is for the day you need to know why the output looks the way it does.
What Base64 Actually Is
Base64 is a binary-to-text encoding. It takes any sequence of bytes and represents them using a fixed alphabet of 64 printable ASCII characters: A through Z, a through z, 0 through 9, and the two symbols + and /. There is also a 65th character, =, used only for padding at the end.
The reason it exists at all is older than most people guess. The original use case in the 1980s was email. SMTP was a 7-bit text protocol, so anything binary, like an image or a Word document, had to be re-expressed as printable text before it could traverse a mail relay. The MIME working group standardized Base64 in RFC 1421 and later RFC 2045, and the modern definition lives in RFC 4648.
Three things to internalize before anything else:
- Base64 is not encryption. Anyone with the string can decode it in milliseconds. Treating Base64 as obfuscation is a classic junior-engineer mistake and a classic security review finding.
- Base64 is not compression. The output is always larger than the input. Roughly 33% larger, exactly, for reasons we will work out in a minute.
- Base64 is not a hash. A hash is one-way and lossy. Base64 is two-way and lossless. Encode then decode and you get the original bytes back, byte for byte.
The single line that captures it: Base64 is a way to put binary data inside a text channel without anything breaking on the way through.
The Algorithm: 6-Bit Chunking, Worked by Hand
Bytes are 8 bits. The Base64 alphabet has 64 symbols, which fits exactly into 6 bits (26 = 64). The whole encoding is built around that mismatch between 8 and 6.
The algorithm:
- Take the input bytes and write out their bits as one long stream.
- Split that stream into groups of 6 bits instead of 8.
- Look up each 6-bit value in the Base64 alphabet (0 maps to
A, 1 toB, and so on, up to 63 mapping to/). - If the input did not divide evenly into 6-bit groups, pad with
=at the end.
Let me walk through it with the three bytes that spell Cat:
Input bytes: C a t
ASCII codes: 67 97 116
Binary (8-bit): 01000011 01100001 01110100
Concatenate: 010000110110000101110100
Regroup as 6: 010000 110110 000101 110100
Decimal: 16 54 5 52
Base64 char: Q 2 F 0
Result: "Q2F0"Three input bytes (24 bits total) became four output characters (4 × 6 = 24 bits). That ratio, 3 input bytes to 4 output characters, is the heart of Base64 and the source of every other property of the encoding.
Anything you can compute about Base64 size, performance, or padding falls out of that 3-to-4 relationship.
Why the Equals Signs Show Up
The 3-to-4 ratio is clean only when the input length is a multiple of 3. The world is rarely that polite. So Base64 has to define what to do with leftover bytes.
The rule is simple. Base64 always emits output in 4-character blocks. If the input does not provide enough bits to fill a final block, the encoder appends = characters as placeholders.
- Input length divisible by 3: no padding.
Cat->Q2F0, no equals signs. - Input length leaves 1 leftover byte: two padding characters.
C(1 byte) ->Qw== - Input length leaves 2 leftover bytes: one padding character.
Ca(2 bytes) ->Q2E=
That gives you a useful party trick. If you see exactly one = at the end of a Base64 string, the original payload's length mod 3 was 2. If you see two, the original payload's length mod 3 was 1. If you see none, the original was a clean multiple of 3.
Here is a subtlety I have seen burn teams more than once. The padding is technically optional in some specs (RFC 4648 explicitly allows omitting it when the length is recoverable, and most JWT libraries strip it). But many strict decoders will reject unpadded input. Whether you can drop the = depends entirely on which decoder is on the other end.
Default to keeping the padding. Strip it only when the spec for the destination format says you must.
The 33 Percent Tax (and Where It Comes From)
The most common Base64 myth I hear from junior developers is "it just adds a few bytes." It does not. The output is exactly 4/3 the size of the input, before padding. That is a fixed multiplier, not an overhead.
The arithmetic:
output_length = ceil(input_length / 3) * 4
For a 1 MB file:
1,048,576 bytes in
ceil(1,048,576 / 3) * 4 = 349,526 * 4 = 1,398,104 bytes out
Overhead: ~33.3%
For a 10 MB file:
+3.33 MB of pure encoding overheadThat overhead is why my email blew up. A 1.4 MB image became roughly 1.87 MB of Base64 text, plus the surrounding HTML, plus quoted-printable encoding from the email client adding line breaks and escapes. The final body comfortably exceeded Gmail's 102 KB clipping threshold.
Two more things compound the cost in real systems:
- JSON re-escaping. If a Base64 string travels inside a JSON string field and gets re-escaped (for example through a logging pipeline), you can pay the encoding tax twice.
- Network frames. A 33% larger payload means more TCP segments, more chance of fragmentation, and measurably slower TLS handshakes for large in-memory blobs.
If you are routinely Base64-encoding files larger than a few hundred kilobytes and shipping them inside JSON, the overhead is no longer an academic concern. It is a budget line.
URL-Safe Base64 and Why It Exists
Standard Base64 uses + and / in its alphabet. Both are reserved characters in URLs. + is interpreted as a space in application/x-www-form-urlencoded bodies, and / is a path separator. Putting a regular Base64 string into a URL without escaping it is asking for breakage.
RFC 4648 section 5 defines a variant called base64url:
+becomes-/becomes_- Padding
=is usually stripped
Same encoding, same size, just a different alphabet that is safe to drop into a URL or an HTTP header without percent-escaping.
You meet base64url in three places almost daily:
- JWT. Every part of a JWT, the header, the payload, the signature, is base64url-encoded. The signature is raw bytes, base64url'd. The header and payload are JSON, base64url'd.
- OAuth 2.0 PKCE. The code challenge is a SHA-256 hash, base64url-encoded.
- WebPush and modern crypto. Anything that lives in a URL fragment or an authorization header tends to use base64url for the same reason.
The cost of using the wrong variant is usually invisible until production. A standard Base64 string with a / in it gets URL-encoded by your client, your server decodes the URL but not the Base64, the resulting byte stream contains a literal %2F instead of /, and your signature check fails on roughly 1.5% of requests.
That bug is hard to reproduce in a unit test, easy to ship, and a nightmare to debug when it happens at 0.4% of traffic.
Where You Actually Meet Base64 in the Wild
1. Data URIs in HTML and CSS
The classic data:image/png;base64,iVBORw0K... pattern. Useful for small inline assets where avoiding an HTTP request is worth more than the bigger payload. Bad for anything over a few kilobytes, and bad for anything that browsers should be able to cache independently.
2. JWT
A JWT is three base64url segments joined with dots: HEADER.PAYLOAD.SIGNATURE. The header and payload are JSON objects encoded as base64url so that they can travel safely in Authorization headers and URL fragments. The signature is raw bytes, encoded the same way.
Important: the encoding is not the security. A JWT payload is trivially decodable by anyone who sees the token. The integrity comes from the signature, not from the encoding. If you find yourself thinking "I will Base64 the user ID before putting it in the JWT," you have misunderstood the model.
3. HTTP Basic Auth
The Basic Auth header is Authorization: Basic <base64(username:password)>. That is it. The credentials are joined with a colon, the whole string is Base64-encoded, and that is what goes over the wire.
Three implications people miss:
- Without HTTPS, Basic Auth is functionally plaintext. Base64 is not encryption. I have seen production APIs leak credentials in logs because someone wrote a debug line that dumped the Authorization header.
- Colons in usernames break the protocol. The decoder splits on the first colon. Some libraries handle this gracefully, some do not.
- Non-ASCII characters in passwords are ambiguously specified. Different clients encode them differently. If your user base is multilingual, test it.
4. Email and MIME Attachments
The original use case. When you attach a file to an email, the attachment is Base64-encoded into the message body with a Content-Transfer-Encoding: base64 header. The line length is wrapped at 76 characters because old SMTP relays would corrupt longer lines.
If you have ever seen an email "source" view full of long blocks of capital letters and digits ending in =, that is what you were looking at.
5. PEM-Encoded Keys and Certificates
The -----BEGIN CERTIFICATE----- blocks you see in TLS configs are Base64-encoded DER bytes wrapped in delimiters. SSH public keys (ssh-rsa AAAAB3NzaC1y...) work the same way. Inside the long string is a binary key, Base64'd so it can live in a config file.
6. Storing Small Binary Blobs in JSON or Databases
When you need to round-trip binary through a system that only handles text, Base64 is often the cheapest option. Storing thumbnails in a JSON column. Sending a small PDF in a webhook. Embedding a binary key in a configuration file.
Cheapest is not the same as best, which brings us to the next section.
When Base64 Is the Wrong Answer
I have torn out more Base64 than I have written. It is easy to reach for, hard to remove later. Here is when I push back in code review.
1. Large file uploads through a JSON API
If your API accepts user uploads, do not Base64 the file into a JSON field. Use multipart/form-data or pre-signed S3 URLs. The 33% size tax, the memory overhead of holding the entire encoded string, and the JSON parser having to ingest a multi-megabyte string field will all bite you in production. I have seen ingestion services run out of memory at 60 RPS for exactly this reason.
2. Performance-critical paths
Base64 encode/decode is fast in absolute terms but cheap to misuse. Encoding a binary blob, sending it over the wire, decoding it, then immediately re-encoding it for storage is a pattern I see often. Each round trip costs CPU and allocations. If the data is binary on both ends, find a binary channel.
3. As an obfuscation layer
If your reasoning is "I do not want a casual viewer to see the contents," Base64 fails. Anyone with a browser console can decode it in two seconds. If the data needs to be hidden, it needs to be encrypted, full stop.
4. For binary data inside a binary protocol
Base64 is for getting binary through text. If you control both ends and the channel is already binary (gRPC, MessagePack, raw TCP), there is no reason to encode. I have reviewed PRs that Base64'd a payload before sending it over a protobuf field, "for safety." The protobuf field already accepts arbitrary bytes. The Base64 was a 33% tax for nothing.
5. As a "cache key" for binary content
A Base64-encoded blob is a long string. Long strings are bad cache keys: they bloat memory, they slow down hash table lookups, and they tend to get logged. Hash the binary content and use the hash as the key. If you need a refresher on the trade-offs there, the hash algorithm comparison walks through which to pick.
Mistakes I Have Made and Reviewed
Treating Base64 as confidential
The classic. A junior dev hides an internal user ID by Base64-encoding it in a URL parameter. Two weeks later a security review flags it as a tracker for sensitive data. Base64 is reversible. If it is sensitive, encrypt or sign it.
Mixing Base64 and base64url
Producing a token with one variant and consuming it with the other. The decoder either fails outright (good) or silently produces wrong bytes (bad). Always pick a variant explicitly. If your library has both encode and encodeURLSafe, never use the default for anything that touches a URL or an HTTP header.
Forgetting that line length matters in MIME
Some legacy SMTP relays corrupt Base64 strings longer than 76 characters per line. Most modern Base64 functions emit one long line by default. If you are generating MIME bodies, wrap the lines, or use the language's MIME-specific Base64 helper.
Decoding without validating the input
If a Base64 string contains invalid characters or wrong padding, decoders behave inconsistently. Some throw. Some silently produce garbage. Some produce a partial result and stop. If your input is untrusted, validate it explicitly before decoding, and treat any error as a hard rejection at the boundary.
Re-encoding what is already encoded
I once debugged a webhook that arrived with a Base64 payload that decoded into another Base64 payload. The producer had encoded once for the JSON layer and once for "safety." The result was a 1.78x size cost and an extra decode step on every consumer. If you find yourself encoding twice, something earlier in the stack is wrong.
Key Takeaways
- Base64 is a binary-to-text encoding, not a security primitive. Reversible, public, predictable.
- 3 input bytes become 4 output characters. Everything else about size and padding follows from that ratio.
- Output is 33 percent larger than input. Treat it as a budget line, not a rounding error.
- Use base64url anywhere a URL or header is involved. Standard Base64 has
+and/that break URL contexts. - Default to keeping padding. Strip it only when the destination spec demands.
- Never use Base64 for confidentiality. If it needs to be hidden, encrypt it or sign it.
- Avoid Base64 for large binaries inside JSON. Use multipart uploads or signed URLs.
Closing Thoughts
Base64 is forty years old and still everywhere. It is one of those layers of the stack that everybody touches and almost nobody examines. Most of the time, that is fine. The encoding does its job, the decoder gets the original bytes back, and life moves on.
But the cases where Base64 hurts are quiet ones. A marketing email that gets clipped in Gmail. A signature check that fails on 1% of traffic because of a stray +. A JSON ingestion service that runs out of memory because somebody decided to upload a 30 MB PDF as an inline string. None of those failures look like Base64 at first glance, and that is exactly why they are worth understanding the encoding for.
The mental model I use now: Base64 is a translator between two worlds, binary and text. It is fast, lossless, and predictable. It is not free, it is not secret, and it is not a substitute for picking the right channel in the first place.
If you want to play with it interactively, the StackConvert Base64 tool handles the encode/decode round trip in your browser, so the bytes never leave your machine. Drop an image in, look at the size, look at the trailing equals signs, and the whole thing stops being a black box.
What is the strangest place you have seen Base64 show up in production?