awesome-everything RU
↑ Back to the climb

Observability

PII redaction and log injection

Crux Logs are the easiest signal to leak sensitive data into and the hardest to clean up. Redact at the source, defend in depth at the collector, and never interpolate user input into log messages.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 12 min

A team completes their SOC 2 audit. The auditor finds raw email addresses and auth tokens in three months of indexed log storage. The logs were never flagged as a data store. The remediation takes six weeks and costs more than the original logging infrastructure.

Why logs are a PII risk

Logs are generated in the hot path of every request. They accumulate request bodies, headers, database queries, stack traces — any of which can carry sensitive data if the engineer was not deliberate.

Common PII patterns found in real incidents: full request bodies including Authorization headers and session tokens, email addresses logged as userId context, phone numbers in user-submitted form fields, raw SQL queries containing user input, stack traces with object dumps that include customer records.

The problem compounds: logs are retained for weeks or months, replicated to cold storage, and readable by a much wider audience than the primary database. A breach of the log backend exposes data that was never intended to live there.

PII sourceHow it enters logsMitigation
Auth tokens / passwordsFull request body logged at DEBUG or on errorDeny-list req.headers.authorization, body.password
Email addressesLogged as user context or search query parameterLog opaque user_id only; deny-list user.email
Payment data (PAN, CVV)Raw payment form body logged on validation errorDeny-list payment.card.*; log only last-4 and token
Stack traces with dataException serializer dumps the full object graphTruncate exception body at the logger; log error type and message only

Redaction at the source

The logger SDK is the right first line of defence. Every major production logger supports a deny-list that strips known-sensitive paths before serialization:

  • pino (Node.js): redact option takes an array of dot-notation paths (['req.headers.authorization', 'body.password', 'user.email']). Pino replaces the values with [Redacted] before writing. Overhead: approximately 2 ms per 1,000 messages for 5 fields.
  • log4j2 (JVM): pattern converters and Rewrite appenders can strip fields by key name.
  • structlog (Python): a processor in the processor chain can strip or hash sensitive keys before rendering.
  • slog (Go): a custom Handler wraps the base handler and filters attributes before writing.

Static redaction (known field paths) is cheap. Dynamic redaction (regex over the full message body) is 4-10x more expensive and should be reserved for unstructured fields where you cannot enumerate the paths in advance.

Defence in depth: the collector tier

Application-level redaction catches the deliberate paths. The collector tier catches what the application misses — third-party libraries that log request objects, errors that serialize unexpected context, or a new service that has not yet adopted the per-org wrapper logger.

The OTel Collector, Fluent Bit, and Vector each support redaction processors that run regex patterns over log record attributes before shipping. The pattern: strip known PII formats (email addresses, phone number patterns, credit-card-like sequences) on every line regardless of field name. This adds latency to the collector pipeline (20-50 µs per line for a 5-pattern set) but runs off the critical application path and is worth the cost.

Production rule: deny-list at the source for known fields; regex scrubber at the collector as the safety net. Never rely on the log backend’s own masking as the first or only layer — by then the data has already left the host.

Why this works

Use opaque internal identifiers in logs. A log line should carry user_id: 42 (an internal integer), never email: alice@example.com. When a support engineer needs to debug Alice’s issue, they look up her internal id in the primary database (an audited operation), then query logs by id. This is intentional friction: the audit trail records who looked up whose data. Right-to-erasure under GDPR is also simpler: when Alice’s account is deleted from the primary store, all her logs are de-identified by severing the mapping — no logs need to be rewritten.

Log injection: the security failure-mode

Log injection (CWE-117) occurs when user-supplied input is concatenated or interpolated into a log message without sanitization. Two attack shapes:

Newline injection: the input contains a literal newline. A log line such as INFO: received comment: <user input> splits into two lines when the input embeds a newline followed by a crafted payload — ERROR: admin user X deleted production database. Downstream parsers, SIEM rules, and audit-trail consumers treat both lines as real records. Audit-trail manipulation is the textbook scenario.

JSON structural injection: in JSON output, if user input is concatenated into a string field without escaping — rather than being passed as a value to the JSON serializer — special characters can close the current string and inject new key-value pairs, corrupting the log record and potentially bypassing SIEM filters that rely on specific field values.

The Log4Shell vulnerability (CVE-2021-44228, December 2021) was a related but distinct failure: log4j2’s message lookup feature treated log message content as a template, allowing JNDI lookups in user-controlled strings to trigger remote code execution. The lesson generalizes: log message content is an input surface.

Prevention is structural: never interpolate user input into log message strings. Always pass user input as a typed field attribute, letting the JSON serializer handle escaping:

// Vulnerable — user input in message string
logger.info(`received comment: ${req.body.comment}`);

// Safe — user input as a typed field
logger.info({ event: "comment_received", comment: req.body.comment }, "comment received");

The serializer escapes the field value, making injection structurally impossible regardless of what the input contains.

Quiz

A team logs raw user-submitted strings into a message field. A user submits text containing a literal newline followed by a crafted JSON object that looks like a real ERROR log line. What is the failure class?

Quiz

In a production-ready logging setup, where should PII redaction run?

Order the steps

Order the steps to remediate a PII leak discovered in production logs:

  1. 1 Identify the source: which log call or library is emitting the sensitive field
  2. 2 Add the field path to the logger SDK deny-list immediately to stop new leakage
  3. 3 Add the regex scrubber at the collector tier as a second-pass safety net
  4. 4 Assess the exposure window: how long were logs stored, who had access
  5. 5 File a data-incident report per GDPR / compliance requirements if personal data was exposed
  6. 6 Add a CI check or PR template field to prevent reintroduction
Recall before you leave
  1. 01
    Explain log injection (CWE-117): the two attack shapes and the structural prevention.
  2. 02
    Why is redaction at the log backend insufficient as a sole layer?
  3. 03
    What is the right way to handle 'I need to debug a specific user's issue' without logging PII?
Recap

Logs accumulate PII through request bodies, headers, SQL queries, and stack traces — often without engineers intending it. Two layers of redaction are required: the logger SDK deny-list strips known-sensitive field paths before serialization (cheap, deterministic), and a collector-tier regex scrubber catches what the application misses (slightly more expensive, catches accidents and third-party libraries). Never interpolate user input into log message strings — pass it as a typed attribute and let the JSON serializer escape it, making log injection structurally impossible. Use opaque internal IDs (not email, phone, or name) so logs are safe by design and GDPR erasure is handled by severing the ID mapping rather than rewriting log archives. The cost of a PII postmortem — compliance notification, audit, remediation — dwarfs the overhead of source-level redaction.

Connected lessons
appears again in268
Continue the climb ↑Trace context propagation in logs
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.