Observability OBS · 02 · 05

PII redaction and log injection

Logs are the easiest signal to leak sensitive data into and the hardest to clean up. Redact at the source, defend in depth at the collector, and never interpolate user input into log messages.

OBS Senior ◷ 12 min

Level

FoundationsJuniorMiddleSenior

A team completes their SOC 2 audit. The auditor finds raw email addresses and auth tokens in three months of indexed log storage. The logs were never flagged as a data store. The remediation takes six weeks and costs more than the original logging infrastructure.

Why logs are a PII risk

Logs are generated in the hot path of every request. They accumulate request bodies, headers, database queries, stack traces — any of which can carry sensitive data if the engineer was not deliberate.

Common PII patterns found in real incidents: full request bodies including Authorization headers and session tokens, email addresses logged as userId context, phone numbers in user-submitted form fields, raw SQL queries containing user input, stack traces with object dumps that include customer records.

The problem compounds: logs are retained for weeks or months, replicated to cold storage, and readable by a much wider audience than the primary database. A breach of the log backend exposes data that was never intended to live there.

PII source	How it enters logs	Mitigation
Auth tokens / passwords	Full request body logged at DEBUG or on error	Deny-list `req.headers.authorization`, `body.password`
Email addresses	Logged as user context or search query parameter	Log opaque `user_id` only; deny-list `user.email`
Payment data (PAN, CVV)	Raw payment form body logged on validation error	Deny-list `payment.card.*`; log only last-4 and token
Stack traces with data	Exception serializer dumps the full object graph	Truncate exception body at the logger; log error type and message only

Redaction at the source

The logger SDK is the right first line of defence. Every major production logger supports a deny-list that strips known-sensitive paths before serialization:

pino (Node.js): redact option takes an array of dot-notation paths (['req.headers.authorization', 'body.password', 'user.email']). Pino replaces the values with [Redacted] before writing. Overhead: approximately 2 ms per 1,000 messages for 5 fields.
log4j2 (JVM): pattern converters and Rewrite appenders can strip fields by key name.
structlog (Python): a processor in the processor chain can strip or hash sensitive keys before rendering.
slog (Go): a custom Handler wraps the base handler and filters attributes before writing.

Static redaction (known field paths) is cheap. Dynamic redaction (regex over the full message body) is 4-10x more expensive and should be reserved for unstructured fields where you cannot enumerate the paths in advance.

The cost gap is the whole layering argument: cheap static deny-listing belongs in the request hot path, while regex scrubbing — an order of magnitude dearer — runs off-path at the collector as the safety net.

Together these SDK-level hooks give you one place to add a field path and have it stripped from every log line in that service, regardless of which library or call site emits it. Without this layer, a single refactor that widens the logged object is enough to leak data you thought was safe.

Defence in depth: the collector tier

Application-level redaction catches the deliberate paths. The collector tier catches what the application misses — third-party libraries that log request objects, errors that serialize unexpected context, or a new service that has not yet adopted the per-org wrapper logger.

The OTel Collector, Fluent Bit, and Vector each support redaction processors that run regex patterns over log record attributes before shipping. The pattern: strip known PII formats (email addresses, phone number patterns, credit-card-like sequences) on every line regardless of field name. This adds latency to the collector pipeline (20-50 µs per line for a 5-pattern set) but runs off the critical application path and is worth the cost.

Production rule: deny-list at the source for known fields; regex scrubber at the collector as the safety net. Never rely on the log backend’s own masking as the first or only layer — by then the data has already left the host.

▸Why this works

Use opaque internal identifiers in logs. A log line should carry user_id: 42 (an internal integer), never email: alice@example.com. When a support engineer needs to debug Alice’s issue, they look up her internal id in the primary database (an audited operation), then query logs by id. This is intentional friction: the audit trail records who looked up whose data. Right-to-erasure under GDPR is also simpler: when Alice’s account is deleted from the primary store, all her logs are de-identified by severing the mapping — no logs need to be rewritten.

The redaction gate sits before the sink: sensitive fields are stripped in-process, so only the de-identified record crosses the trust boundary to the collector and indexed store.

Log injection: the security failure-mode

Log injection (CWE-117) occurs when user-supplied input is concatenated or interpolated into a log message without sanitization. Two attack shapes:

Newline injection: the input contains a literal newline. A log line such as INFO: received comment: <user input> splits into two lines when the input embeds a newline followed by a crafted payload — ERROR: admin user X deleted production database. Downstream parsers, SIEM rules, and audit-trail consumers treat both lines as real records. Audit-trail manipulation is the textbook scenario.

JSON structural injection: in JSON output, if user input is concatenated into a string field without escaping — rather than being passed as a value to the JSON serializer — special characters can close the current string and inject new key-value pairs, corrupting the log record and potentially bypassing SIEM filters that rely on specific field values.

The Log4Shell vulnerability (CVE-2021-44228, December 2021) was a related but distinct failure: log4j2’s message lookup feature treated log message content as a template, allowing JNDI lookups in user-controlled strings to trigger remote code execution. The lesson generalizes: log message content is an input surface.

Prevention is structural: never interpolate user input into log message strings. Always pass user input as a typed field attribute, letting the JSON serializer handle escaping:

// Vulnerable — user input in message string
logger.info(`received comment: ${req.body.comment}`);

// Safe — user input as a typed field
logger.info({ event: "comment_received", comment: req.body.comment }, "comment received");

The serializer escapes the field value, making injection structurally impossible regardless of what the input contains.

Quiz

A team logs raw user-submitted strings into a message field. A user submits text containing a literal newline followed by a crafted JSON object that looks like a real ERROR log line. What is the failure class?

Quiz

In a production-ready logging setup, where should PII redaction run?

Order the steps

Order the steps to remediate a PII leak discovered in production logs:

1 Identify the source: which log call or library is emitting the sensitive field
2 Add the field path to the logger SDK deny-list immediately to stop new leakage
3 Add the regex scrubber at the collector tier as a second-pass safety net
4 Assess the exposure window: how long were logs stored, who had access
5 File a data-incident report per GDPR / compliance requirements if personal data was exposed
6 Add a CI check or PR template field to prevent reintroduction

Recall before you leave

01
Explain log injection (CWE-117): the two attack shapes and the structural prevention.
02
Why is redaction at the log backend insufficient as a sole layer?
03
What is the right way to handle 'I need to debug a specific user's issue' without logging PII?

Recap

Logs accumulate PII through request bodies, headers, SQL queries, and stack traces — often without engineers intending it. Two layers of redaction are required: the logger SDK deny-list strips known-sensitive field paths before serialization (cheap, deterministic), and a collector-tier regex scrubber catches what the application misses (slightly more expensive, catches accidents and third-party libraries). Never interpolate user input into log message strings — pass it as a typed attribute and let the JSON serializer escape it, making log injection structurally impossible. Use opaque internal IDs (not email, phone, or name) so logs are safe by design and GDPR erasure is handled by severing the ID mapping rather than rewriting log archives. The cost of a PII postmortem — compliance notification, audit, remediation — dwarfs the overhead of source-level redaction. Now when you see a PR that adds a new field to a log call, you know the first question to ask: does this field touch user data, and is the deny-list entry already there?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Sampling strategies and log costmiddle

unlocks

Trace context propagation in logssenior

appears again in297

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Mini OAuth 2.0 + PKCE loginImplement the authorization-code + PKCE flow end to end against a real provider, so you understand every redirect and token instead of trusting a library.