Monitoring Your Mail Stack: Proving It Works

A correctly configured mail server that silently stops processing mail is worse than a misconfigured one. Here's how to monitor every layer and catch failures before users do.

Configuration Is Half the Job

A correctly configured mail server that stops processing mail due to a crashed service, a full disk, or a stale certificate is worse than a misconfigured one — because the operator may not notice until users report missing mail.

The gap between "the server is configured correctly" and "the server is delivering mail right now" is where monitoring lives. Every service in the mail stack should be continuously verified, and failures should be detected and remediated automatically.

Protocol-Level Health Checks

Process monitoring ("is the PID running?") is not sufficient. A Postfix process that is running but not accepting SMTP connections is functionally dead. The monitoring system should verify that each service responds to its native protocol:

ServiceHealth CheckRecovery Action
PostfixSMTP handshake on port 25Automatic restart
IMAP serverIMAP protocol test on port 143Automatic restart
DKIM signing serviceTCP connection on signing portAutomatic restart
SRS serviceTCP connection on SRS portAutomatic restart
Spam filterHTTP health endpointAutomatic restart; disable after repeated failures
DNS resolverDNS protocol queryAutomatic restart; disable after repeated failures
RedisRedis PING commandAutomatic restart; alert on excessive memory

The "disable after repeated failures" pattern (typically 3 restarts within 5 monitoring cycles) prevents a crash loop from consuming all system resources. If a service cannot stay running after three restart attempts, something is fundamentally wrong and human intervention is needed.

Monitoring intervals should be short enough to catch failures quickly (2 minutes is reasonable) but not so frequent that they generate excessive load or false positives during brief service restarts.

Pipeline Integrity Monitoring

Beyond individual service checks, the entire mail processing pipeline should be validated end-to-end. For a forwarding service, this means verifying that:

  1. Postfix is accepting mail — the SMTP protocol check covers this
  2. The BCC mailbox is being drained — if the logging script crashes, the mailbox grows without bound
  3. Log entries are reaching the database — the processing script may be running but failing silently
  4. The spam filter is scanning messages — not silently passing everything through

A simple file size check on the BCC mailbox catches the most common failure mode: the processing script crashes or hangs, and the mailbox grows indefinitely. If the mailbox exceeds a threshold (50MB is reasonable for a server processing thousands of messages daily), an alert should fire immediately.

This is a separate check from the service health checks — it validates the data flow, not just the service availability.

System-Level Monitoring

Three system metrics directly affect mail server reliability:

Memory usage — a mail server under memory pressure will start rejecting connections. Alert at 90% sustained over several monitoring cycles. Redis, the spam filter, and the Bayesian classifier are the primary memory consumers.

Load average — sustained high load indicates either a spam flood, a misconfiguration, or a resource exhaustion problem. Alert when the 5-minute load average exceeds the number of CPU cores for more than two monitoring cycles.

Disk usage — a full disk causes catastrophic failures across all services. Mail queues, log files, and the BCC mailbox are the primary disk consumers. Alert at 85% (warning) and 95% (critical).

Alerting Strategy

Monitoring without alerting is just logging. The alerting strategy should follow two principles:

  1. Alert on actionable conditions. "Postfix restarted successfully" is informational. "Postfix failed to restart after 3 attempts" is actionable. Only the second should wake someone up.
  1. Alert via a channel that works when the mail server is down. If the monitoring system sends alerts via the same mail server it's monitoring, a mail server failure means no alert is sent. Use an independent alerting channel — a separate mail server, SMS, or a push notification service.

The Operational Discipline

Monitoring is not a one-time setup. It requires ongoing attention:

  • Review alerts weekly. If an alert fires frequently but never requires action, it should be tuned or removed — alert fatigue is the enemy of operational reliability.
  • Test the monitoring system itself. Deliberately stop a service and verify that the alert fires and the automatic restart works.
  • Monitor the monitoring. If the monitoring service itself crashes, everything appears healthy when it isn't. A separate external uptime check (even a simple HTTP ping from an external service) provides a safety net.

The goal is not zero alerts. The goal is that every alert represents a real problem that either was automatically remediated or requires human attention — and that no real problem goes undetected.

More articles

The Email Deliverability Landscape in 2026 Mar 29, 2026 Postfix TLS: Beyond Opportunistic Encryption Mar 14, 2026 SPF, DKIM, DMARC, and ARC: The Complete Authentication Stack Feb 27, 2026