Zevs @ zevs.gg

Building SMSCode: A Production Rust Service from Scratch

Mar 6 · 20min

SMSCode is a virtual number platform — users purchase temporary phone numbers to receive SMS verification codes. Think of it as a marketplace that connects users who need phone numbers with SMS providers who supply them, handling payments, order lifecycle, and real-time delivery in between.

This isn’t a toy project or a demo. It’s a production system with real money, real users, and real SMS providers that can fail in creative ways. Building it taught me more about Rust in production than any tutorial or book ever could.

This post is a deep dive into how it works — the architecture decisions, the patterns that survived production, and the ones that didn’t.

The Migration Story

SMSCode started on Directus, a Node.js headless CMS. For the early stage, Directus was perfect — it gave me a database, an admin panel, and a REST API with zero custom code. I could focus on building the product instead of building infrastructure.

But as the platform grew past a few thousand users, the cracks showed:

  • Query performance — Directus abstracts SQL behind a generic query builder. As relationships got complex, generated queries became inefficient and hard to optimize.
  • Business logic limitations — order lifecycle management, atomic balance operations, and provider integration needed logic that didn’t fit Directus’s hook system.
  • Memory usage — the Node.js process was consuming 300-400MB at rest, spiking higher under load.
  • Concurrency — background tasks (order expiration, provider reconciliation) competed with API requests on the same event loop.

I needed a custom backend. The question was: what do I build it with?

I considered staying in the Node.js ecosystem — Hono or Fastify would have been the safe choice. But the workload profile of SMSCode is unusual: it’s not just request-response. There are 10 background tasks running on configurable intervals, real-time SSE streams, webhook processing, and atomic financial operations. All running 24/7 with zero tolerance for memory leaks or GC pauses.

Rust with Axum was the answer. Axum gave me the web framework, Tokio gave me the async runtime for background tasks and SSE, and Rust’s type system gave me guarantees that the financial logic would be correct.

The migration took about three weeks. Memory usage dropped from ~400MB to ~12MB. Tail latency dropped by 10x. And the codebase became easier to reason about, not harder — because the type system caught entire categories of bugs at compile time. You can see the result live at smscode.gg.

Architecture Overview

The system has three main components:

┌──────────────────┐     ┌──────────────────┐
│   Astro (Web)    │     │  Admin SSR       │
│   SSR pages      │     │  TanStack Start  │
│   Port 4321      │     │  Port 3001       │
└────────┬─────────┘     └────────┬─────────┘
         │ axumFetch()            │ axumFetch()
         │ (internal auth)        │ (admin auth)
         ▼                        ▼
┌──────────────────────────────────────────────┐
│              Axum API (Rust)                  │
│              Port 3000                        │
│                                               │
│  /internal/*  ── Web frontend calls          │
│  /v1/*        ── Public REST API             │
│  /webhooks/*  ── SMS + payment callbacks     │
│  + 10 background tasks (Tokio)               │
└──────┬──────────┬──────────┬─────────────────┘
       │          │          │
       ▼          ▼          ▼
  PostgreSQL  Redis × 3   SMS Providers
  (17 tables) (session,   Payment Gateways
              cache,
              ratelimit)

Astro is a thin proxy — it handles SSR rendering and session cookies but delegates all business logic to Axum via internal HTTP calls. This is a deliberate choice: Axum is the single source of truth. The web frontend (what you see at smscode.gg) can be rebuilt or replaced without touching business logic.

The admin panel is a separate TanStack Start (React SSR) app. It communicates with Axum over the internal network, never exposed publicly. Different deployment cadence, different auth model, clean separation.

Axum handles everything else: API routes, webhooks, background tasks, metrics, and real-time event streaming. A single Rust binary.

Application State

One of the first things you design in an Axum app is your AppState — the shared state injected into every request handler. Getting this wrong means refactoring hundreds of handlers later.

#[derive(Clone)]
pub struct AppState {
    inner: Arc<AppStateInner>,
}

struct AppStateInner {
    pool: PgPool,
    redis_session: ConnectionManager,
    redis_cache: ConnectionManager,
    redis_ratelimit: ConnectionManager,
    config: Config,
    providers: ProviderRegistry,
    payment_client: reqwest::Client,
    webhook_client: reqwest::Client,
    email_client: reqwest::Client,
    metrics: Metrics,
    prom_handle: PrometheusHandle,
    push_service: Option<WebPushService>,
}

A few things to note:

Three separate Redis connections — session, cache, and rate limit data live on three different Redis instances. This means I can flush the cache or rate limit data without killing user sessions. During development, I accidentally flushed Redis once. With a single instance, that would have logged out every user. With isolation, it was just a cache miss.

Separate HTTP clients — payment gateway calls get a 10-second timeout. Webhook dispatch gets a 3-second timeout with no redirect following. SMS provider calls go through a proxy with their own circuit breaker. Each client is tuned for its specific use case.

Arc<AppStateInner> patternAppState is a thin Clone-able wrapper around an Arc. Cloning the state for each handler is just incrementing a reference count — zero allocation. The inner struct holds the actual data, which is shared across all handlers and background tasks.

Optional servicespush_service: Option<WebPushService> is None when VAPID keys aren’t configured. The app gracefully degrades instead of panicking on startup. Same pattern for payment gateways and Google OAuth.

Error Handling

Every Axum handler returns Result<impl IntoResponse, AppError>. The AppError enum covers every failure mode:

#[derive(Debug, thiserror::Error)]
pub enum AppError {
    #[error("Authentication required")]
    Unauthorized,

    #[error("Insufficient balance")]
    InsufficientBalance,

    #[error("{0}")]
    NotFound(String),

    #[error("{0}")]
    Validation(String),

    #[error("Too many requests")]
    RateLimit,

    #[error("Too many requests")]
    RateLimitWithRetry(i64),

    #[error("Provider error: {0}")]
    Provider(String),

    #[error(transparent)]
    Internal(#[from] anyhow::Error),

    #[error("Database error")]
    Database(#[from] sqlx::Error),
}

The IntoResponse implementation maps each variant to the correct HTTP status code and a consistent JSON envelope:

{
  "success": false,
  "error": {
    "code": "INSUFFICIENT_BALANCE",
    "message": "Not enough balance"
  }
}

Two important details:

Internal errors are opaqueAppError::Internal and AppError::Database both return a generic "Internal server error" to the client while logging the full error with tracing::error!. Users never see stack traces or database error messages.

Rate limit with retryRateLimitWithRetry(i64) adds a Retry-After header to the response, telling clients exactly when they can retry. This is critical for the public API where automated clients need to back off gracefully.

The ? operator makes error propagation clean across the entire callstack:

async fn create_order(
    State(state): State<AppState>,
    auth: AuthUser,
    Json(input): Json<CreateOrderInput>,
) -> Result<Json<OrderResponse>, AppError> {
    let product = get_product(&state, input.product_id)
        .await?  // → AppError::NotFound if missing
        .ok_or_else(|| AppError::NotFound("Product not found".into()))?;

    check_cancel_limit(&state, auth.user_id)
        .await?;  // → AppError::Conflict if exceeded

    let balance = atomic_debit(&state, auth.user_id, product.price)
        .await?;  // → AppError::InsufficientBalance

    let number = state.providers()
        .get_number(&params)
        .await
        .map_err(|e| AppError::Provider(e.to_string()))?;

    // ... create order record
    Ok(Json(order.into()))
}

Every ? either propagates to the correct HTTP error or converts automatically via #[from]. No try/catch chains, no forgotten error handling.

Atomic Balance Operations

Financial operations are the most critical part of the system. A user’s balance must always be consistent with their transaction ledger. Getting this wrong means users lose money or get free credit.

The core pattern is debit-first, refund-on-failure:

1. BEGIN TRANSACTION
   a. UPDATE users SET balance = balance - $price
      WHERE id = $user_id AND balance >= $price
      RETURNING balance
   b. INSERT transaction (ORDER_DEBIT, -$price, balance_after)
   c. COMMIT

2. Call SMS provider (external API — OUTSIDE transaction)

3. On success: INSERT order record
4. On failure: BEGIN → credit balance back + INSERT ORDER_REFUND → COMMIT

The debit happens before the provider call. Why? Because the provider call can take seconds. If we deducted after, a user could spend the same balance twice by firing two concurrent requests.

The SQL itself is a single atomic statement with a CAS (compare-and-swap) guard:

UPDATE users
SET balance = balance - $1
WHERE id = $2 AND balance >= $1
RETURNING balance

If the balance is insufficient, zero rows are returned, and we map that to AppError::InsufficientBalance. No race condition possible — PostgreSQL’s row-level locking guarantees atomicity.

To prevent integer overflow on refunds (balance is BIGINT):

UPDATE users
SET balance = balance + $1
WHERE id = $2 AND balance <= 9223372036854775807 - $1
RETURNING balance

A background task (reconcile_balances) runs every 30 minutes, comparing each user’s balance against the sum of their transaction ledger. Any mismatch is logged and alerted. In months of production, it has never found one — but I sleep better knowing it’s checking.

Background Tasks with Leader Election

SMSCode runs 10 background tasks on configurable intervals. These handle everything from expiring old orders to syncing the product catalog from SMS providers.

The challenge: I want to run multiple Axum instances for high availability, but background tasks should only run on one instance at a time. The solution is Redis-based leader election using SETNX:

async fn acquire_leader_lock(
    redis: &ConnectionManager,
    task: &str,
    instance_id: &str,
    ttl_secs: u64,
) -> bool {
    let key = format!("task_lock:{task}");
    let result: RedisResult<Option<bool>> = redis::cmd("SET")
        .arg(&key)
        .arg(instance_id)
        .arg("NX")        // Only set if not exists
        .arg("EX")        // Expire after TTL
        .arg(ttl_secs)
        .query_async(&mut redis.clone())
        .await;

    matches!(result, Ok(Some(true)))
}

Each task tick: try to acquire the lock. If another instance already holds it, skip. If we get it, run the task, then release the lock with a Lua CAS script (only delete if we still own it):

if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
end
return 0

The lock TTL is max(interval × 2, 30s) — a safety net so that if an instance crashes mid-task, the lock expires and another instance picks up. Normal runs release immediately.

The task loop itself is shutdown-aware using Tokio’s CancellationToken:

loop {
    tokio::select! {
        _ = shutdown.cancelled() => break,
        _ = interval.tick() => {
            if !acquire_leader_lock(...).await { continue; }

            tokio::select! {
                _ = shutdown.cancelled() => {
                    release_leader_lock(...).await;
                    break;
                }
                result = func(state.clone()) => {
                    // log and record metrics
                    release_leader_lock(...).await;
                }
            }
        }
    }
}

The nested select! ensures that even a long-running task is interrupted promptly on SIGTERM — critical for zero-downtime deployments.

Provider Integration with Circuit Breaker

SMS providers are external APIs that fail in exciting ways — timeouts, rate limits, malformed responses, or just going down entirely. A failing provider shouldn’t take down the entire platform.

Each provider implements the ProviderClient trait:

#[async_trait]
pub trait ProviderClient: Send + Sync {
    fn code(&self) -> &str;
    fn circuit_state(&self) -> (u32, bool);

    async fn get_number(&self, params: &GetNumberParams)
        -> Result<GetNumberResult, ProviderError>;
    async fn cancel_activation(&self, id: &str)
        -> Result<bool, ProviderError>;
    async fn get_status(&self, id: &str)
        -> Result<StatusResult, ProviderError>;
    // ... more methods
}

The Send + Sync bounds are required because providers are stored as Arc<dyn ProviderClient> in the ProviderRegistry and shared across all Tokio tasks. This is where Rust’s type system shines — the compiler enforces that provider implementations are thread-safe.

Each provider has a built-in circuit breaker. After N consecutive failures, the circuit opens and subsequent calls fail fast without hitting the external API. After a cooldown period, the circuit enters a half-open state and allows one probe request. If it succeeds, the circuit closes and normal operation resumes.

The ProviderRegistry loads provider configs from the database with a 5-minute cache, dispatching to the correct client based on the protocol column (e.g., sms_activate, sms_bower). Adding a new provider protocol means implementing the trait — the registry handles discovery and caching automatically.

Real-Time with Server-Sent Events

When a user purchases a number and waits for the SMS code, they need to know the moment it arrives. Polling every few seconds works but wastes resources. SSE gives us push-based updates with minimal overhead.

The architecture:

Browser EventSource
    → Astro /api/orders/stream (session check)
        → Axum /internal/user/orders/stream (Redis pub/sub)

    Redis PUBLISH ← webhook handler (OTP received)
                  ← background task (order expired)
                  ← order action (created, canceled)

When an SMS webhook arrives with an OTP code, the handler updates the database and publishes an event to the user’s Redis channel:

// Publish order event to Redis pub/sub
publish_order_event(&state, user_id, OrderEvent {
    order_id,
    status: "OTP_RECEIVED".into(),
    phone_number: Some(phone),
    otp_code: Some(code),
    otp_message: Some(message),
}).await;

The SSE handler subscribes to the user’s channel and streams events as they arrive. The Astro layer validates the session cookie and proxies the stream, adding auth without exposing the internal Axum endpoint.

Fallback: If the SSE connection drops (tab backgrounded, network hiccup), the frontend falls back to polling. On reconnect, it polls once immediately to catch any events missed during the gap — because Redis pub/sub is fire-and-forget, events during a disconnection are lost.

This works across multiple Axum replicas because Redis pub/sub broadcasts to all subscribers. A webhook hitting replica A publishes to Redis, and the SSE connection on replica B receives it instantly.

Four Auth Layers

Different consumers need different auth mechanisms:

LayerMechanismUsed by
InternalShared secret + user identity headerAstro → Axum
AdminSession header over internal networkAdmin panel → Axum
Public APIAuthorization: Bearer <token>External API consumers
WebhookSignature verificationSMS providers, payment gateways

The internal auth uses a shared secret between Astro and Axum — only valid over the internal network (never exposed to the internet). Astro reads the session cookie, decrypts it (AES-256-GCM), and forwards the user identity in a separate header.

The admin auth uses a separate session store in Redis. Admin sessions carry permissions (role, permissions array), checked by the dashboard_auth middleware before any handler runs.

All key comparisons use constant-time comparison to prevent timing attacks. Password hashing uses argon2id via spawn_blocking — heavy CPU work runs on Tokio’s blocking thread pool, keeping the async runtime responsive.

Monitoring

You can’t run a financial platform blind. SMSCode exports metrics at two levels:

In-memory metrics — a custom Metrics struct using DashMap and ring buffers tracks per-endpoint latency (p50/p95/p99), provider call performance, cache hit rates, balance operations, and background task health. Exposed via /internal/metrics as JSON for the admin dashboard.

Prometheus metrics — standard counters and histograms (http_requests_total, http_request_duration_seconds, provider_api_calls_total, etc.) scraped by Prometheus every 15 seconds. Grafana dashboards visualize trends and AlertManager sends notifications to Telegram when thresholds are breached.

The monitoring stack runs on a separate LXC container from the application — if Axum crashes, monitoring stays up and alerts fire.

Infrastructure

Production runs on a single EPYC server with Proxmox VE, using LXC containers (not Docker):

ContainerPurpose
AppAxum API + Astro SSR + Admin panel
DatabasePostgreSQL 17
CacheRedis × 3 (session / cache / ratelimit)
GatewayTraefik + Cloudflare Tunnel
MonitoringGrafana + Prometheus

LXC over Docker because: near-native performance, lower resource overhead, and systemd service management instead of Docker’s restart policies. Each service (smscode-api, smscode-web, smscode-admin) is a systemd unit with proper dependency ordering and env file loading.

Traffic flows through Cloudflare (DDoS protection, TLS termination) → Cloudflare Tunnel → Traefik (routing) → the appropriate service. The only ports exposed to the internet are behind Cloudflare — the actual server IP is hidden.

VLAN segmentation isolates production from infrastructure. The monitoring container can scrape metrics from the app container, but not the other way around.

Deployment

Deployments are automated via a single script:

./scripts/deploy.sh api    # Build + deploy Axum
./scripts/deploy.sh web    # Build + deploy Astro
./scripts/deploy.sh admin  # Build + deploy Admin
./scripts/deploy.sh all    # All three

The script handles building (cross-compilation for Rust, bun run build for the frontends), uploading artifacts to the server, creating backups, restarting systemd services, and running health checks. If the health check fails, it rolls back automatically.

For the Rust binary, I build with --release on my local machine and scp the binary to the server. The binary is ~15MB and starts in milliseconds. Compare that to deploying a Node.js app with node_modules — it’s a refreshingly simple deployment model.

Lessons from Production

After running this system in production, a few hard-won lessons:

Debit first, refund on failure. Never rely on a successful external API call before touching balances. Provider APIs are unreliable. Your database isn’t. Debit the balance, call the provider, and refund if it fails. The alternative — reserving balance or deducting after — creates race conditions that are nearly impossible to reproduce in testing but trivial to trigger in production.

Three Redis instances, not one. The first time I needed to flush the rate limit store during a configuration change, I was glad sessions and cache were on separate instances. Isolation costs almost nothing and saves you from cascading failures.

Circuit breakers are essential. A failing SMS provider without a circuit breaker will eat your timeout budget across every request. With a circuit breaker, failures are detected in milliseconds after the circuit opens, and the system gracefully routes to other providers.

SSE with polling fallback. Pure SSE is fragile — browsers throttle background tabs, mobile connections drop, proxies have idle timeouts. Always have a polling fallback that activates seamlessly. Users don’t care about the transport mechanism; they care that the OTP appears when it arrives.

Reconciliation tasks are non-negotiable. For anything financial, run periodic reconciliation. Our reconcile_balances task has never found a discrepancy — which tells me the atomic operations are correct. But the moment I remove it is the moment a bug will slip through.

Rust’s compile times are the real cost. The initial build of the workspace takes ~2 minutes. Incremental rebuilds are 10-15 seconds. Coming from the instant feedback loop of Node.js, this requires an adjustment in workflow — I batch changes and think more carefully before compiling. But in exchange, when it compiles, it usually works.

What’s Next

The platform is stable and handling production traffic at smscode.gg. Upcoming work includes:

  • Horizontal scaling — the leader-elected task system already supports multiple replicas. Next step is load balancing across multiple Axum instances behind Traefik.
  • More providers — the ProviderClient trait makes adding new SMS providers straightforward. Each new protocol is a separate module implementing the same trait.
  • GraphQL for the admin — the admin panel currently makes many small REST calls. A GraphQL layer would let it fetch exactly the data it needs in a single request.

Building a production Rust service from scratch was a significant investment — but seeing SMSCode handle thousands of concurrent operations at 12MB of memory with sub-millisecond latency made every fight with the borrow checker worth it.

You can find my full tech stack here.

Thanks for reading!

> comment on bluesky / mastodon / twitter
>