Base CS from zero CS · 01 · 03

Encoding the world

The same bit pattern can mean a number, a letter, a colour, or a sound. The meaning lives in the encoding — the agreed rule that assigns meaning to bits. Without knowing which encoding was used, a bit pattern is meaningless noise.

CS ◷ 22 min

You already know how to write a number in binary. Now consider this: the 8-bit pattern 01000001 is 65 in decimal. But your computer displays that same pattern on screen as the letter A. Tomorrow, in a photo, those same eight bits might be one channel of a red pixel. Next week, in an audio file, they might represent the loudness of a sound at one instant in time.

The bits did not change. What changed is the encoding — the agreed rule that says “interpret this pattern as a letter” or “interpret it as a brightness value.” Without the encoding, bits are meaningless. The encoding is the entire story.

Goal

After this lesson you can explain what an encoding is and why one must be agreed in advance, describe how ASCII maps numbers to characters (including the codes for key characters), explain at a conceptual level how Unicode/UTF-8 extends that idea, describe how RGB uses 8 bits per channel to encode colour, and explain how digital audio samples continuous sound as a sequence of numbers.

Bits have no inherent meaning. A bit is just an electrical state: high voltage or low voltage, magnetised or not, charged or uncharged. The hardware stores and moves these states without knowing what they represent. Meaning is imposed from the outside by software and conventions. Two programs that both read the byte 01000001 can arrive at completely different conclusions — one says “the number 65”, another says “the letter A” — and both are correct, each under their own encoding. There is no contradiction, because the bits themselves have no opinion.

This is the central insight of this lesson: an encoding is a shared convention that assigns meaning to a bit pattern. The sender and the receiver must agree on the encoding in advance, or communication fails.

ASCII: the first universal agreement for text. In the early 1960s, every computer manufacturer used a different convention for mapping numbers to characters. Sharing files between machines was a constant headache. In 1963 the American Standards Association published ASCII — the American Standard Code for Information Interchange — a 7-bit encoding that assigns a specific integer (0 through 127) to each English letter, digit, punctuation mark, and a handful of control codes (newline, tab, backspace, and so on).

Because 7 bits hold 2⁷ = 128 values, ASCII covers exactly 128 code points. The important ones to remember: uppercase ‘A’ is 65, uppercase ‘Z’ is 90; lowercase ‘a’ is 97, lowercase ‘z’ is 122; the digit ‘0’ is 48, ‘9’ is 57. Notice that uppercase and lowercase letters are exactly 32 apart — that is not an accident, it is a design choice that lets you flip between cases by toggling a single bit.

ASCII is stored in 8-bit bytes with the high bit always zero. A text file that uses only ASCII characters is a sequence of bytes in the range 0–127.

▸Edge cases

What about accented letters like é or ñ, Cyrillic, Chinese, or emoji? ASCII does not cover them — it only has 128 slots, all reserved for English. Dozens of incompatible 8-bit extensions were invented to fill the upper 128 slots, which led to the “mojibake” problem: open a file with the wrong encoding and every non-ASCII character becomes garbage (e.g., “Gar\xE7on” displayed as “GarÃ§on”). The world needed a single unified encoding.

Unicode and UTF-8: one encoding for every writing system. Unicode is not an encoding — it is a standard that assigns a unique number (called a code point, written U+XXXX) to every character in every human writing system. As of 2024, Unicode defines over 140,000 code points covering 161 scripts plus emoji and historic symbols. ‘A’ is U+0041 (decimal 65, the same as ASCII), ‘Ж’ is U+0416, ’😀’ is U+1F600.

UTF-8 is the most common way to actually store Unicode code points as bytes. It is variable-length: code points 0–127 (the ASCII range) are stored as a single byte, identical to ASCII. Code points above 127 use 2, 3, or 4 bytes. The first byte of a multi-byte sequence signals how many bytes follow, so a decoder always knows where each character starts. Because all ASCII text is valid UTF-8, adding UTF-8 support is backward-compatible — no existing ASCII file breaks.

The conceptual leap is that Unicode separates two things ASCII conflated: the abstract identity of a character (its code point, an integer) and the concrete bytes used to store it (the encoding, UTF-8). You can store the same code point in UTF-8, UTF-16, or UTF-32 — different bytes, same character.

0100
0001

65 = 'A'

0100
0010

66 = 'B'

0100
0011

67 = 'C'

0100
0001

65 = 'A'

Four bytes of ASCII-encoded text. The highlighted byte 01000001 (decimal 65) decodes as the letter 'A' under ASCII. Under a numeric encoding the same byte would be the integer 65.

RGB: encoding colour as three numbers. A computer screen is a grid of tiny dots called pixels. Each pixel emits light by mixing three colour components: Red, Green, and Blue. The intensity of each component is stored as a number. With 8 bits per channel, each component has 2⁸ = 256 levels (0 means none of that colour, 255 means full intensity). Three 8-bit channels together give 24 bits per pixel.

24 bits can represent 2²⁴ = 16,777,216 distinct colours — enough for photorealistic images. Pure red is (255, 0, 0); pure green is (0, 255, 0); pure blue is (0, 0, 255); white is (255, 255, 255) — all channels at maximum; black is (0, 0, 0) — all channels at zero. The colour you see is an encoding: three bytes that a display circuit interprets as light intensities.

Higher-end formats use 10 or 12 bits per channel (the “HDR” in modern displays), giving billions of colours. Lower-end formats use fewer bits and sacrifice colour smoothness. The principle stays the same: colour is a convention for interpreting numbers.

PCM audio: encoding sound as a sequence of measurements. Why does this matter to an engineer? Because every time you work with audio APIs, streaming pipelines, or voice interfaces, you are handling PCM data — and the artefacts you will debug (clicks, pops, distortion) trace directly back to the parameters below.

Sound is a physical wave — air pressure that rises and falls over time. To store sound digitally, a device (a microphone or ADC — analogue-to-digital converter) samples the air pressure at regular intervals and records each measurement as an integer. This technique is called PCM (Pulse-Code Modulation).

Two parameters determine quality: sample rate (how many measurements per second) and bit depth (how many bits each measurement uses). CD-quality audio uses 44,100 samples per second (44.1 kHz) and 16 bits per sample, giving 2¹⁶ = 65,536 possible amplitude levels. Professional audio often uses 24 bits (16 million levels) at 96 kHz.

The key insight is the same as before: those 16-bit integers are meaningful only if the decoder knows they represent air-pressure amplitudes sampled at 44.1 kHz. Store the same integers in a text file and a text editor will display garbage characters, because it is applying the wrong encoding.

▸Why this works

Why must encodings be agreed in advance? Because bit patterns carry no self-describing label. A byte 01000001 cannot say “I am a letter.” The sender and receiver must share a prior agreement — a protocol, a file-format header, a MIME type, a file extension — that names the encoding. Without that agreement, the receiver is guessing. Sometimes the guess is right (ASCII-range bytes in a text file are probably text). Often it is wrong, producing mojibake, image corruption, or audio distortion.

Worked example

Decoding the same three bytes three different ways.

Bytes: 01000001 01000010 01000011

As decimal numbers: 65, 66, 67.

As ASCII text: ‘A’, ‘B’, ‘C’ (because A=65, B=66, C=67).

As RGB colour: red=65 (dark red), green=66 (dark green), blue=67 (dark blue) — a very dark, almost-black colour with equal amounts of each channel, slightly biased toward blue. (RGB 65, 66, 67 is a near-black grey.)

All three interpretations are valid. The bytes are identical. The meaning depends entirely on the encoding applied by the reader.

Practice 0 / 5

The ASCII code for 'A' is 65. What is the ASCII code for 'C'? (Hint: letters are consecutive in ASCII.)

How many bits are used for one RGB colour channel (0–255)?

How many total bits does a single RGB pixel use (8 bits per channel, 3 channels)?

ASCII uses 7 bits. How many distinct characters can it encode?

The ASCII code for lowercase 'a' is 97. What is the ASCII code for lowercase 'z'? (There are 26 letters; 'a' is the first.)

Check yourself

Quiz

You open a file and see garbled text like 'GarÃ§on' instead of 'Garçon'. What is the most likely cause?

Recap

Bits have no inherent meaning — they are just patterns of 0s and 1s. An encoding is a shared convention that assigns meaning to those patterns. ASCII uses 7 bits to map 128 code points to English characters (‘A’=65, ‘a’=97, ‘0’=48). Unicode extends the idea to every writing system by assigning a unique code point to every character; UTF-8 stores those code points in 1–4 bytes, backward-compatible with ASCII. RGB colour uses 3 channels of 8 bits each (0–255), giving 16,777,216 possible colours per pixel. PCM audio samples sound pressure at regular intervals (e.g. 44,100 per second) and stores each sample as a fixed-width integer. The same byte sequence is a number, a letter, a colour, or a sound depending only on which encoding the reader applies — without that agreement, communication breaks. Now when you see garbled text, a corrupted image, or audio that sounds like static, your first question should be: “Did the sender and receiver agree on the encoding?”

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.