πŸ“ Volume IV: Distributed Systems

πŸ‘οΈ Topic 29: Observability

The Three Pillars: Logging, Metrics, Tracing

"You cannot improve what you cannot measure.
Logs tell you WHAT happened.
Metrics tell you HOW OFTEN it happens.
Traces tell you WHERE time was lost.
Observability is the foundation of performance tuning."
⚠️ THE OBSERVABILITY GAP

Most Laravel developers rely solely on logs. Logs tell you something went wrong, but not why. They don't tell you about trends, patterns, or where time is being lost across microservices. True observability requires three pillars: Logging, Metrics, and Tracing. Without all three, you're flying blind.

πŸ” The Three Pillars of Observability

THE THREE PILLARS ═══════════════════════════════════════════════════════════════════ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ LOGGING β”‚ β”‚ METRICS β”‚ β”‚ TRACING β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ "User 123 β”‚ β”‚ "500 β”‚ β”‚ "Request β”‚ β”‚ β”‚ β”‚ logged β”‚ β”‚ errors β”‚ β”‚ A β†’ B β†’ β”‚ β”‚ β”‚ β”‚ in at β”‚ β”‚ per β”‚ β”‚ C β†’ D β”‚ β”‚ β”‚ β”‚ 10:05" β”‚ β”‚ minute" β”‚ β”‚ 12ms" β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ WHAT happened? HOW OFTEN? WHERE did time go? β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ EXAMPLE: User reports "page is slow" β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Logs: Show errors, warnings, user actions β”‚ β”‚ Metrics: Show 95th percentile latency spiked to 5 seconds β”‚ β”‚ Traces: Show that database query X took 4.8 seconds β”‚ β”‚ β”‚ β”‚ β†’ You fix the slow query. Problem solved. β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ WITHOUT tracing: You know the page is slow, but not why. WITHOUT metrics: You don't know it's getting worse. WITHOUT logs: You can't see the error messages.
Pillar Question Answered Example Tools
Logging What happened? (discrete events) "User 123 failed to process order" Laravel Logs, ELK Stack, Loki
Metrics How often / how much? (aggregated) "500 errors increased to 50 per minute" Prometheus, Datadog, New Relic
Tracing Where did time go? (distributed request flow) "Database query took 4.8 seconds out of 5s total" Jaeger, Zipkin, OpenTelemetry

πŸ“ Logging: What Happened?

BAD LOGGING PATTERNS
GOOD LOGGING PATTERNS
// Structured logging (JSON) - machine readable
Log::channel('stack')->info('User registered', [
    'user_id' => $user->id,
    'ip' => $request->ip(),
    'user_agent' => $request->userAgent(),
    'correlation_id' => $this->correlationId,
]);

// With context (Laravel 11+)
Log::withContext([
    'user_id' => auth()->id(),
    'correlation_id' => $this->correlationId,
    'request_id' => request()->id(),
]);

Log::info('Processing order');
Log::info('Order processed');  // Both include context

// Always include correlation ID for distributed tracing
$correlationId = request()->header('X-Correlation-ID', Str::uuid());
Log::shareContext(['correlation-id' => $correlationId]);
CENTRALIZED LOGGING IN LARAVEL
# config/logging.php
'channels' => [
    'cloudwatch' => [
        'driver' => 'custom',
        'via' => App\Logging\CloudWatchLogger::class,
        'level' => env('LOG_LEVEL', 'error'),
    ],
    'paper-trail' => [
        'driver' => 'monolog',
        'handler' => \Monolog\Handler\SyslogUdpHandler::class,
        'url' => env('PAPERTRAIL_URL'),
        'port' => env('PAPERTRAIL_PORT'),
    ],
],

# .env
LOG_CHANNEL=cloudwatch

πŸ“Š Metrics: How Often / How Much?

TYPES OF METRICS ═══════════════════════════════════════════════════════════════════ Counter: Always increases (never decreases) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β€’ Total requests received β”‚ β”‚ β€’ Total errors β”‚ β”‚ β€’ Total orders placed β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Gauge: Can go up and down β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β€’ Current memory usage β”‚ β”‚ β€’ Active users β”‚ β”‚ β€’ Queue size β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Histogram: Distribution of values β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β€’ Request duration (p50, p95, p99) β”‚ β”‚ β€’ Database query time β”‚ β”‚ β€’ Response size β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Timer: Specialized histogram for durations β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β€’ API call latency β”‚ β”‚ β€’ Job processing time β”‚ β”‚ β€’ Cache hit/miss latency β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
METRICS IN LARAVEL (with Prometheus)
// Install package
composer require ssmdd/laravel-prometheus-exporter

// Create a metric
use Prometheus\Facades\Prometheus;

// Counter
Prometheus::counter('http_requests_total', 'Total HTTP requests')
    ->labels(['method' => $request->method(), 'path' => $request->path()])
    ->inc();

// Histogram (request duration)
$duration = microtime(true) - LARAVEL_START;
Prometheus::histogram('http_request_duration_seconds', 'Request duration in seconds')
    ->labels(['method' => $request->method()])
    ->observe($duration);

// Gauge (active users)
Prometheus::gauge('active_users', 'Currently active users')
    ->set(Cache::get('active_users_count', 0));

// Export metrics endpoint
Route::get('/metrics', fn() => Prometheus::render());
CRITICAL METRICS FOR LARAVEL
MetricWhy It Matters
HTTP request duration (p95, p99)User experience threshold
Database query count per requestN+1 detection
Error rate (500s per minute)Service health
Queue size and processing timeBackground job health
Cache hit ratioCache effectiveness
PHP-FPM active processesServer capacity
Redis memory usageCache health

πŸ”— Tracing: Where Did Time Go?

DISTRIBUTED TRACING (Request Flow Across Services) ═══════════════════════════════════════════════════════════════════ User Request: GET /api/order/123 Timeline: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ [Frontend] ─────┬─────────────────────────────────────────────│ β”‚ Nginx (2ms) β”‚ β”‚ β”‚ Laravel (15ms) β”œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€DB (10ms) β”‚ β”‚ β”‚ β”‚ β”œβ”€Redis (2ms) β”‚ β”‚ β”‚ β”‚ └─API (50ms) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ WITHOUT tracing: You see "total 79ms" but don't know the API call took 50ms WITH tracing: You see exactly where time is lost COMPONENTS: ═══════════════════════════════════════════════════════════════════ Trace: The entire request journey Span: One operation within the trace (DB query, API call, etc.) Trace ID: Unique ID for the entire request (shared across services) Span ID: Unique ID for each operation Parent Span ID: Links spans together
TRACING IN LARAVEL (with OpenTelemetry)
// Install OpenTelemetry
composer require open-telemetry/opentelemetry
composer require open-telemetry/exporter-jaeger

// app/Http/Middleware/TracingMiddleware.php
use OpenTelemetry\API\Trace\TracerInterface;
use OpenTelemetry\API\Trace\SpanKind;

class TracingMiddleware
{
    public function handle($request, $next)
    {
        $tracer = app(TracerInterface::class);
        
        $span = $tracer->spanBuilder('HTTP ' . $request->method())
            ->setSpanKind(SpanKind::KIND_SERVER)
            ->startSpan();
        
        $scope = $span->activate();
        
        $response = $next($request);
        
        $span->setAttribute('http.method', $request->method());
        $span->setAttribute('http.url', $request->fullUrl());
        $span->setAttribute('http.status_code', $response->getStatusCode());
        $span->end();
        
        $scope->detach();
        
        return $response;
    }
}

// Manual tracing in code
use OpenTelemetry\API\Trace\Span;

$span = Span::getCurrent();
$span->setAttribute('user.id', $user->id);

// Create nested span for database query
$dbSpan = $tracer->spanBuilder('DB Query')
    ->setParent($span->getContext())
    ->startSpan();

try {
    $users = DB::table('users')->get();
    $dbSpan->setAttribute('db.statement', 'SELECT * FROM users');
} finally {
    $dbSpan->end();
}

πŸ”— Correlation IDs: Connecting Logs, Metrics, and Traces

WITHOUT CORRELATION ID: ═══════════════════════════════════════════════════════════════════ Log: "Payment failed" Log: "User 123" Log: "Order 456" β†’ Are they related? Unknown. Debugging nightmare. WITH CORRELATION ID: ═══════════════════════════════════════════════════════════════════ Log: [correlation-id=abc-123] "Processing payment for user 123" Log: [correlation-id=abc-123] "Calling payment gateway..." Log: [correlation-id=abc-123] "Payment gateway returned error: insufficient funds" Trace: [correlation-id=abc-123] Total time: 250ms [correlation-id=abc-123] β”œβ”€ Payment gateway API: 200ms [correlation-id=abc-123] └─ Database update: 5ms β†’ Everything connected. Easy debugging.
CORRELATION ID IN LARAVEL
// app/Http/Middleware/CorrelationIdMiddleware.php
class CorrelationIdMiddleware
{
    public function handle($request, $next)
    {
        $correlationId = $request->header('X-Correlation-Id', (string) Str::uuid());
        
        // Share with logging
        Log::shareContext(['correlation-id' => $correlationId]);
        
        // Share with application
        app()->instance('correlation-id', $correlationId);
        
        $response = $next($request);
        
        $response->header('X-Correlation-Id', $correlationId);
        
        return $response;
    }
}

// Usage in code
$correlationId = app('correlation-id');
Log::info("Processing order {$orderId}", ['correlation-id' => $correlationId]);

// Pass to external services
Http::withHeaders([
    'X-Correlation-Id' => $correlationId
])->post('https://api.payment.com', $data);

πŸ”­ Laravel Telescope (Local/Staging Observability)

INSTALL AND CONFIGURE
composer require laravel/telescope --dev
php artisan telescope:install
php artisan migrate

// config/telescope.php
'enabled' => env('TELESCOPE_ENABLED', true),

// .env (development)
TELESCOPE_ENABLED=true

// .env (production) - use sparingly
TELESCOPE_ENABLED=false  # Never enable in high-traffic production
WHAT TELESCOPE SHOWS
⚠️ TELESCOPE IN PRODUCTION

Telescope stores EVERYTHING in the database. In high-traffic production, this will fill your database and kill performance. Use Telescope only in development/staging, or use filtered mode for production (sample 1% of requests).

πŸ“¦ Complete OpenSource Observability Stacks

🟒 LGTM Stack (Grafana)

  • Loki β€” Logging
  • Grafana β€” Visualization
  • Tempo β€” Tracing
  • Mimir β€” Metrics
  • Best for: Grafana users, single pane of glass

πŸ”΅ ELK Stack

  • Elasticsearch β€” Storage
  • Logstash β€” Processing
  • Kibana β€” Visualization
  • Plus: APM for tracing, Beats for metrics
  • Best for: Logging-first observability
DOCKER-COMPOSE FOR LGTM STACK
# docker-compose.yml
version: '3'
services:
  loki:
    image: grafana/loki:latest
  tempo:
    image: grafana/tempo:latest
  mimir:
    image: grafana/mimir:latest
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_FEATURE_TOGGLES_ENABLE=tempoLokiSearch

πŸ“ˆ Key Metrics Every Laravel App Should Monitor

CategoryMetricAlert Threshold
ApplicationResponse time (p95)> 500ms
Error rate (5xx)> 1%
Request per second (RPS)Monitor baseline
DatabaseSlow queries (> 100ms)> 10 per minute
Connection count> 80% of max
Replication lag> 5 seconds
QueueQueue size> 1000
Failed jobs> 5 per minute
Job processing time (p95)> 60 seconds
InfraCPU usage> 80%
Memory usage> 90%
Disk usage> 85%

πŸ“ Topic 29 Summary: Observability

PillarPurposeToolsCost
Logging Debugging, audits, security ELK, Loki, CloudWatch Storage & ingestion
Metrics Trends, alerts, capacity planning Prometheus, Datadog Storage & query
Tracing Performance bottlenecks, dependencies Jaeger, Tempo, Zipkin Sampling (1-10% of traffic)
πŸ“Œ THE RULE: Logging tells you what happened. Metrics tell you how often. Tracing tells you where time went. You need all three to truly understand your system. Start with logs, add metrics, then add tracing when you need to debug distributed performance issues.
NEXT TOPIC PREVIEW

Topic 30: Zero-Trust Architecture β€” Trust nothing, verify everything. How to build systems that don't rely on internal network security. The future of cloud-native applications.