π Volume IV: Distributed Systems
ποΈ Topic 29: Observability
The Three Pillars: Logging, Metrics, Tracing
"You cannot improve what you cannot measure.
Logs tell you WHAT happened.
Metrics tell you HOW OFTEN it happens.
Traces tell you WHERE time was lost.
Observability is the foundation of performance tuning."
β οΈ THE OBSERVABILITY GAP
Most Laravel developers rely solely on logs. Logs tell you something went wrong, but not why. They don't tell you about trends, patterns, or where time is being lost across microservices. True observability requires three pillars: Logging, Metrics, and Tracing. Without all three, you're flying blind.
π The Three Pillars of Observability
THE THREE PILLARS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β LOGGING β β METRICS β β TRACING β β
β β β β β β β β
β β "User 123 β β "500 β β "Request β β
β β logged β β errors β β A β B β β β
β β in at β β per β β C β D β β
β β 10:05" β β minute" β β 12ms" β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β WHAT happened? HOW OFTEN? WHERE did time go? β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
EXAMPLE: User reports "page is slow"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Logs: Show errors, warnings, user actions β
β Metrics: Show 95th percentile latency spiked to 5 seconds β
β Traces: Show that database query X took 4.8 seconds β
β β
β β You fix the slow query. Problem solved. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
WITHOUT tracing: You know the page is slow, but not why.
WITHOUT metrics: You don't know it's getting worse.
WITHOUT logs: You can't see the error messages.
| Pillar |
Question Answered |
Example |
Tools |
| Logging
| What happened? (discrete events)
| "User 123 failed to process order"
| Laravel Logs, ELK Stack, Loki
|
| Metrics
| How often / how much? (aggregated)
| "500 errors increased to 50 per minute"
| Prometheus, Datadog, New Relic
|
| Tracing
| Where did time go? (distributed request flow)
| "Database query took 4.8 seconds out of 5s total"
| Jaeger, Zipkin, OpenTelemetry
|
π Logging: What Happened?
BAD LOGGING PATTERNS
- Logging inside loops (10,000 I/O operations)
- Logging sensitive data (passwords, tokens)
- Inconsistent log formats (hard to parse)
- No correlation IDs across requests
- Logging to local files in distributed systems
GOOD LOGGING PATTERNS
// Structured logging (JSON) - machine readable
Log::channel('stack')->info('User registered', [
'user_id' => $user->id,
'ip' => $request->ip(),
'user_agent' => $request->userAgent(),
'correlation_id' => $this->correlationId,
]);
// With context (Laravel 11+)
Log::withContext([
'user_id' => auth()->id(),
'correlation_id' => $this->correlationId,
'request_id' => request()->id(),
]);
Log::info('Processing order');
Log::info('Order processed'); // Both include context
// Always include correlation ID for distributed tracing
$correlationId = request()->header('X-Correlation-ID', Str::uuid());
Log::shareContext(['correlation-id' => $correlationId]);
CENTRALIZED LOGGING IN LARAVEL
# config/logging.php
'channels' => [
'cloudwatch' => [
'driver' => 'custom',
'via' => App\Logging\CloudWatchLogger::class,
'level' => env('LOG_LEVEL', 'error'),
],
'paper-trail' => [
'driver' => 'monolog',
'handler' => \Monolog\Handler\SyslogUdpHandler::class,
'url' => env('PAPERTRAIL_URL'),
'port' => env('PAPERTRAIL_PORT'),
],
],
# .env
LOG_CHANNEL=cloudwatch
π Metrics: How Often / How Much?
TYPES OF METRICS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Counter: Always increases (never decreases)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β’ Total requests received β
β β’ Total errors β
β β’ Total orders placed β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Gauge: Can go up and down
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β’ Current memory usage β
β β’ Active users β
β β’ Queue size β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Histogram: Distribution of values
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β’ Request duration (p50, p95, p99) β
β β’ Database query time β
β β’ Response size β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Timer: Specialized histogram for durations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β’ API call latency β
β β’ Job processing time β
β β’ Cache hit/miss latency β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
METRICS IN LARAVEL (with Prometheus)
// Install package
composer require ssmdd/laravel-prometheus-exporter
// Create a metric
use Prometheus\Facades\Prometheus;
// Counter
Prometheus::counter('http_requests_total', 'Total HTTP requests')
->labels(['method' => $request->method(), 'path' => $request->path()])
->inc();
// Histogram (request duration)
$duration = microtime(true) - LARAVEL_START;
Prometheus::histogram('http_request_duration_seconds', 'Request duration in seconds')
->labels(['method' => $request->method()])
->observe($duration);
// Gauge (active users)
Prometheus::gauge('active_users', 'Currently active users')
->set(Cache::get('active_users_count', 0));
// Export metrics endpoint
Route::get('/metrics', fn() => Prometheus::render());
CRITICAL METRICS FOR LARAVEL
| Metric | Why It Matters |
| HTTP request duration (p95, p99) | User experience threshold |
| Database query count per request | N+1 detection |
| Error rate (500s per minute) | Service health |
| Queue size and processing time | Background job health |
| Cache hit ratio | Cache effectiveness |
| PHP-FPM active processes | Server capacity |
| Redis memory usage | Cache health |
π Tracing: Where Did Time Go?
DISTRIBUTED TRACING (Request Flow Across Services)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
User Request: GET /api/order/123
Timeline:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β [Frontend] ββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β Nginx (2ms) β β
β Laravel (15ms) βββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β β β β β
β ββDB (10ms) β β β
β ββRedis (2ms) β β β
β ββAPI (50ms) β β β
β β β β
ββββββββββββββββββββ΄ββββ΄ββββββββββββββββββββββββββββββββββββββββββ
WITHOUT tracing: You see "total 79ms" but don't know the API call took 50ms
WITH tracing: You see exactly where time is lost
COMPONENTS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Trace: The entire request journey
Span: One operation within the trace (DB query, API call, etc.)
Trace ID: Unique ID for the entire request (shared across services)
Span ID: Unique ID for each operation
Parent Span ID: Links spans together
TRACING IN LARAVEL (with OpenTelemetry)
// Install OpenTelemetry
composer require open-telemetry/opentelemetry
composer require open-telemetry/exporter-jaeger
// app/Http/Middleware/TracingMiddleware.php
use OpenTelemetry\API\Trace\TracerInterface;
use OpenTelemetry\API\Trace\SpanKind;
class TracingMiddleware
{
public function handle($request, $next)
{
$tracer = app(TracerInterface::class);
$span = $tracer->spanBuilder('HTTP ' . $request->method())
->setSpanKind(SpanKind::KIND_SERVER)
->startSpan();
$scope = $span->activate();
$response = $next($request);
$span->setAttribute('http.method', $request->method());
$span->setAttribute('http.url', $request->fullUrl());
$span->setAttribute('http.status_code', $response->getStatusCode());
$span->end();
$scope->detach();
return $response;
}
}
// Manual tracing in code
use OpenTelemetry\API\Trace\Span;
$span = Span::getCurrent();
$span->setAttribute('user.id', $user->id);
// Create nested span for database query
$dbSpan = $tracer->spanBuilder('DB Query')
->setParent($span->getContext())
->startSpan();
try {
$users = DB::table('users')->get();
$dbSpan->setAttribute('db.statement', 'SELECT * FROM users');
} finally {
$dbSpan->end();
}
π Correlation IDs: Connecting Logs, Metrics, and Traces
WITHOUT CORRELATION ID:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Log: "Payment failed"
Log: "User 123"
Log: "Order 456"
β Are they related? Unknown. Debugging nightmare.
WITH CORRELATION ID:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Log: [correlation-id=abc-123] "Processing payment for user 123"
Log: [correlation-id=abc-123] "Calling payment gateway..."
Log: [correlation-id=abc-123] "Payment gateway returned error: insufficient funds"
Trace: [correlation-id=abc-123] Total time: 250ms
[correlation-id=abc-123] ββ Payment gateway API: 200ms
[correlation-id=abc-123] ββ Database update: 5ms
β Everything connected. Easy debugging.
CORRELATION ID IN LARAVEL
// app/Http/Middleware/CorrelationIdMiddleware.php
class CorrelationIdMiddleware
{
public function handle($request, $next)
{
$correlationId = $request->header('X-Correlation-Id', (string) Str::uuid());
// Share with logging
Log::shareContext(['correlation-id' => $correlationId]);
// Share with application
app()->instance('correlation-id', $correlationId);
$response = $next($request);
$response->header('X-Correlation-Id', $correlationId);
return $response;
}
}
// Usage in code
$correlationId = app('correlation-id');
Log::info("Processing order {$orderId}", ['correlation-id' => $correlationId]);
// Pass to external services
Http::withHeaders([
'X-Correlation-Id' => $correlationId
])->post('https://api.payment.com', $data);
π Laravel Telescope (Local/Staging Observability)
INSTALL AND CONFIGURE
composer require laravel/telescope --dev
php artisan telescope:install
php artisan migrate
// config/telescope.php
'enabled' => env('TELESCOPE_ENABLED', true),
// .env (development)
TELESCOPE_ENABLED=true
// .env (production) - use sparingly
TELESCOPE_ENABLED=false # Never enable in high-traffic production
WHAT TELESCOPE SHOWS
- Requests β All HTTP requests with timing
- Queries β All database queries with N+1 detection
- Jobs β Queue job execution and failures
- Logs β All log entries
- Mail β Sent emails
- Notifications β Sent notifications
- Cache β Cache operations
- Redis β Redis commands
β οΈ TELESCOPE IN PRODUCTION
Telescope stores EVERYTHING in the database. In high-traffic production, this will fill your database and kill performance. Use Telescope only in development/staging, or use filtered mode for production (sample 1% of requests).
π¦ Complete OpenSource Observability Stacks
π’ LGTM Stack (Grafana)
- Loki β Logging
- Grafana β Visualization
- Tempo β Tracing
- Mimir β Metrics
- Best for: Grafana users, single pane of glass
π΅ ELK Stack
- Elasticsearch β Storage
- Logstash β Processing
- Kibana β Visualization
- Plus: APM for tracing, Beats for metrics
- Best for: Logging-first observability
DOCKER-COMPOSE FOR LGTM STACK
# docker-compose.yml
version: '3'
services:
loki:
image: grafana/loki:latest
tempo:
image: grafana/tempo:latest
mimir:
image: grafana/mimir:latest
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_FEATURE_TOGGLES_ENABLE=tempoLokiSearch
π Key Metrics Every Laravel App Should Monitor
| Category | Metric | Alert Threshold |
| Application | Response time (p95) | > 500ms |
| Error rate (5xx) | > 1% |
| Request per second (RPS) | Monitor baseline |
| Database | Slow queries (> 100ms) | > 10 per minute |
| Connection count | > 80% of max |
| Replication lag | > 5 seconds |
| Queue | Queue size | > 1000 |
| Failed jobs | > 5 per minute |
| Job processing time (p95) | > 60 seconds |
| Infra | CPU usage | > 80% |
| Memory usage | > 90% |
| Disk usage | > 85% |
π Topic 29 Summary: Observability
| Pillar | Purpose | Tools | Cost |
| Logging
| Debugging, audits, security
| ELK, Loki, CloudWatch
| Storage & ingestion
|
| Metrics
| Trends, alerts, capacity planning
| Prometheus, Datadog
| Storage & query
|
| Tracing
| Performance bottlenecks, dependencies
| Jaeger, Tempo, Zipkin
| Sampling (1-10% of traffic)
|
π THE RULE: Logging tells you what happened. Metrics tell you how often. Tracing tells you where time went. You need all three to truly understand your system. Start with logs, add metrics, then add tracing when you need to debug distributed performance issues.
NEXT TOPIC PREVIEW
Topic 30: Zero-Trust Architecture β Trust nothing, verify everything. How to build systems that don't rely on internal network security. The future of cloud-native applications.