📁 Volume IV: Distributed Systems

👁️ Topic 29: Observability

The Three Pillars: Logging, Metrics, Tracing

"You cannot improve what you cannot measure.
Logs tell you WHAT happened.
Metrics tell you HOW OFTEN it happens.
Traces tell you WHERE time was lost.
Observability is the foundation of performance tuning."

⚠️ THE OBSERVABILITY GAP

Most Laravel developers rely solely on logs. Logs tell you something went wrong, but not why. They don't tell you about trends, patterns, or where time is being lost across microservices. True observability requires three pillars: Logging, Metrics, and Tracing. Without all three, you're flying blind.

🔍 The Three Pillars of Observability

THE THREE PILLARS ═══════════════════════════════════════════════════════════════════ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ LOGGING │ │ METRICS │ │ TRACING │ │ │ │ │ │ │ │ │ │ │ │ "User 123 │ │ "500 │ │ "Request │ │ │ │ logged │ │ errors │ │ A → B → │ │ │ │ in at │ │ per │ │ C → D │ │ │ │ 10:05" │ │ minute" │ │ 12ms" │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ WHAT happened? HOW OFTEN? WHERE did time go? │ │ │ └─────────────────────────────────────────────────────────────────┘ EXAMPLE: User reports "page is slow" ┌─────────────────────────────────────────────────────────────────┐ │ Logs: Show errors, warnings, user actions │ │ Metrics: Show 95th percentile latency spiked to 5 seconds │ │ Traces: Show that database query X took 4.8 seconds │ │ │ │ → You fix the slow query. Problem solved. │ └─────────────────────────────────────────────────────────────────┘ WITHOUT tracing: You know the page is slow, but not why. WITHOUT metrics: You don't know it's getting worse. WITHOUT logs: You can't see the error messages.

Pillar	Question Answered	Example	Tools
Logging	What happened? (discrete events)	"User 123 failed to process order"	Laravel Logs, ELK Stack, Loki
Metrics	How often / how much? (aggregated)	"500 errors increased to 50 per minute"	Prometheus, Datadog, New Relic
Tracing	Where did time go? (distributed request flow)	"Database query took 4.8 seconds out of 5s total"	Jaeger, Zipkin, OpenTelemetry

📝 Logging: What Happened?

BAD LOGGING PATTERNS

Logging inside loops (10,000 I/O operations)
Logging sensitive data (passwords, tokens)
Inconsistent log formats (hard to parse)
No correlation IDs across requests
Logging to local files in distributed systems

GOOD LOGGING PATTERNS

// Structured logging (JSON) - machine readable
Log::channel('stack')->info('User registered', [
    'user_id' => $user->id,
    'ip' => $request->ip(),
    'user_agent' => $request->userAgent(),
    'correlation_id' => $this->correlationId,
]);

// With context (Laravel 11+)
Log::withContext([
    'user_id' => auth()->id(),
    'correlation_id' => $this->correlationId,
    'request_id' => request()->id(),
]);

Log::info('Processing order');
Log::info('Order processed');  // Both include context

// Always include correlation ID for distributed tracing
$correlationId = request()->header('X-Correlation-ID', Str::uuid());
Log::shareContext(['correlation-id' => $correlationId]);

CENTRALIZED LOGGING IN LARAVEL

# config/logging.php
'channels' => [
    'cloudwatch' => [
        'driver' => 'custom',
        'via' => App\Logging\CloudWatchLogger::class,
        'level' => env('LOG_LEVEL', 'error'),
    ],
    'paper-trail' => [
        'driver' => 'monolog',
        'handler' => \Monolog\Handler\SyslogUdpHandler::class,
        'url' => env('PAPERTRAIL_URL'),
        'port' => env('PAPERTRAIL_PORT'),
    ],
],

# .env
LOG_CHANNEL=cloudwatch

📊 Metrics: How Often / How Much?

TYPES OF METRICS ═══════════════════════════════════════════════════════════════════ Counter: Always increases (never decreases) ┌─────────────────────────────────────────────────────────────────┐ │ • Total requests received │ │ • Total errors │ │ • Total orders placed │ └─────────────────────────────────────────────────────────────────┘ Gauge: Can go up and down ┌─────────────────────────────────────────────────────────────────┐ │ • Current memory usage │ │ • Active users │ │ • Queue size │ └─────────────────────────────────────────────────────────────────┘ Histogram: Distribution of values ┌─────────────────────────────────────────────────────────────────┐ │ • Request duration (p50, p95, p99) │ │ • Database query time │ │ • Response size │ └─────────────────────────────────────────────────────────────────┘ Timer: Specialized histogram for durations ┌─────────────────────────────────────────────────────────────────┐ │ • API call latency │ │ • Job processing time │ │ • Cache hit/miss latency │ └─────────────────────────────────────────────────────────────────┘

METRICS IN LARAVEL (with Prometheus)

// Install package
composer require ssmdd/laravel-prometheus-exporter

// Create a metric
use Prometheus\Facades\Prometheus;

// Counter
Prometheus::counter('http_requests_total', 'Total HTTP requests')
    ->labels(['method' => $request->method(), 'path' => $request->path()])
    ->inc();

// Histogram (request duration)
$duration = microtime(true) - LARAVEL_START;
Prometheus::histogram('http_request_duration_seconds', 'Request duration in seconds')
    ->labels(['method' => $request->method()])
    ->observe($duration);

// Gauge (active users)
Prometheus::gauge('active_users', 'Currently active users')
    ->set(Cache::get('active_users_count', 0));

// Export metrics endpoint
Route::get('/metrics', fn() => Prometheus::render());

CRITICAL METRICS FOR LARAVEL

Metric	Why It Matters
HTTP request duration (p95, p99)	User experience threshold
Database query count per request	N+1 detection
Error rate (500s per minute)	Service health
Queue size and processing time	Background job health
Cache hit ratio	Cache effectiveness
PHP-FPM active processes	Server capacity
Redis memory usage	Cache health

🔗 Tracing: Where Did Time Go?

DISTRIBUTED TRACING (Request Flow Across Services) ═══════════════════════════════════════════════════════════════════ User Request: GET /api/order/123 Timeline: ┌─────────────────────────────────────────────────────────────────┐ │ [Frontend] ─────┬─────────────────────────────────────────────│ │ Nginx (2ms) │ │ │ Laravel (15ms) ├───┬─────────────────────────────────────────│ │ │ │ │ │ │ ├─DB (10ms) │ │ │ │ ├─Redis (2ms) │ │ │ │ └─API (50ms) │ │ │ │ │ │ │ └──────────────────┴───┴─────────────────────────────────────────┘ WITHOUT tracing: You see "total 79ms" but don't know the API call took 50ms WITH tracing: You see exactly where time is lost COMPONENTS: ═══════════════════════════════════════════════════════════════════ Trace: The entire request journey Span: One operation within the trace (DB query, API call, etc.) Trace ID: Unique ID for the entire request (shared across services) Span ID: Unique ID for each operation Parent Span ID: Links spans together

TRACING IN LARAVEL (with OpenTelemetry)

// Install OpenTelemetry
composer require open-telemetry/opentelemetry
composer require open-telemetry/exporter-jaeger

// app/Http/Middleware/TracingMiddleware.php
use OpenTelemetry\API\Trace\TracerInterface;
use OpenTelemetry\API\Trace\SpanKind;

class TracingMiddleware
{
    public function handle($request, $next)
    {
        $tracer = app(TracerInterface::class);
        
        $span = $tracer->spanBuilder('HTTP ' . $request->method())
            ->setSpanKind(SpanKind::KIND_SERVER)
            ->startSpan();
        
        $scope = $span->activate();
        
        $response = $next($request);
        
        $span->setAttribute('http.method', $request->method());
        $span->setAttribute('http.url', $request->fullUrl());
        $span->setAttribute('http.status_code', $response->getStatusCode());
        $span->end();
        
        $scope->detach();
        
        return $response;
    }
}

// Manual tracing in code
use OpenTelemetry\API\Trace\Span;

$span = Span::getCurrent();
$span->setAttribute('user.id', $user->id);

// Create nested span for database query
$dbSpan = $tracer->spanBuilder('DB Query')
    ->setParent($span->getContext())
    ->startSpan();

try {
    $users = DB::table('users')->get();
    $dbSpan->setAttribute('db.statement', 'SELECT * FROM users');
} finally {
    $dbSpan->end();
}

🔗 Correlation IDs: Connecting Logs, Metrics, and Traces

WITHOUT CORRELATION ID: ═══════════════════════════════════════════════════════════════════ Log: "Payment failed" Log: "User 123" Log: "Order 456" → Are they related? Unknown. Debugging nightmare. WITH CORRELATION ID: ═══════════════════════════════════════════════════════════════════ Log: [correlation-id=abc-123] "Processing payment for user 123" Log: [correlation-id=abc-123] "Calling payment gateway..." Log: [correlation-id=abc-123] "Payment gateway returned error: insufficient funds" Trace: [correlation-id=abc-123] Total time: 250ms [correlation-id=abc-123] ├─ Payment gateway API: 200ms [correlation-id=abc-123] └─ Database update: 5ms → Everything connected. Easy debugging.

CORRELATION ID IN LARAVEL

// app/Http/Middleware/CorrelationIdMiddleware.php
class CorrelationIdMiddleware
{
    public function handle($request, $next)
    {
        $correlationId = $request->header('X-Correlation-Id', (string) Str::uuid());
        
        // Share with logging
        Log::shareContext(['correlation-id' => $correlationId]);
        
        // Share with application
        app()->instance('correlation-id', $correlationId);
        
        $response = $next($request);
        
        $response->header('X-Correlation-Id', $correlationId);
        
        return $response;
    }
}

// Usage in code
$correlationId = app('correlation-id');
Log::info("Processing order {$orderId}", ['correlation-id' => $correlationId]);

// Pass to external services
Http::withHeaders([
    'X-Correlation-Id' => $correlationId
])->post('https://api.payment.com', $data);

🔭 Laravel Telescope (Local/Staging Observability)

INSTALL AND CONFIGURE

composer require laravel/telescope --dev
php artisan telescope:install
php artisan migrate

// config/telescope.php
'enabled' => env('TELESCOPE_ENABLED', true),

// .env (development)
TELESCOPE_ENABLED=true

// .env (production) - use sparingly
TELESCOPE_ENABLED=false  # Never enable in high-traffic production

WHAT TELESCOPE SHOWS

Requests — All HTTP requests with timing
Queries — All database queries with N+1 detection
Jobs — Queue job execution and failures
Logs — All log entries
Mail — Sent emails
Notifications — Sent notifications
Cache — Cache operations
Redis — Redis commands

⚠️ TELESCOPE IN PRODUCTION

Telescope stores EVERYTHING in the database. In high-traffic production, this will fill your database and kill performance. Use Telescope only in development/staging, or use filtered mode for production (sample 1% of requests).

📦 Complete OpenSource Observability Stacks

🟢 LGTM Stack (Grafana)

Loki — Logging
Grafana — Visualization
Tempo — Tracing
Mimir — Metrics
Best for: Grafana users, single pane of glass

🔵 ELK Stack

Elasticsearch — Storage
Logstash — Processing
Kibana — Visualization
Plus: APM for tracing, Beats for metrics
Best for: Logging-first observability

DOCKER-COMPOSE FOR LGTM STACK

# docker-compose.yml
version: '3'
services:
  loki:
    image: grafana/loki:latest
  tempo:
    image: grafana/tempo:latest
  mimir:
    image: grafana/mimir:latest
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_FEATURE_TOGGLES_ENABLE=tempoLokiSearch

📈 Key Metrics Every Laravel App Should Monitor

Category	Metric	Alert Threshold
Application	Response time (p95)	> 500ms
	Error rate (5xx)	> 1%
	Request per second (RPS)	Monitor baseline
Database	Slow queries (> 100ms)	> 10 per minute
	Connection count	> 80% of max
	Replication lag	> 5 seconds
Queue	Queue size	> 1000
	Failed jobs	> 5 per minute
	Job processing time (p95)	> 60 seconds
Infra	CPU usage	> 80%
	Memory usage	> 90%
	Disk usage	> 85%

📝 Topic 29 Summary: Observability

Pillar	Purpose	Tools	Cost
Logging	Debugging, audits, security	ELK, Loki, CloudWatch	Storage & ingestion
Metrics	Trends, alerts, capacity planning	Prometheus, Datadog	Storage & query
Tracing	Performance bottlenecks, dependencies	Jaeger, Tempo, Zipkin	Sampling (1-10% of traffic)

📌 THE RULE: Logging tells you what happened. Metrics tell you how often. Tracing tells you where time went. You need all three to truly understand your system. Start with logs, add metrics, then add tracing when you need to debug distributed performance issues.

NEXT TOPIC PREVIEW

Topic 30: Zero-Trust Architecture — Trust nothing, verify everything. How to build systems that don't rely on internal network security. The future of cloud-native applications.