Skip to content

Observability Implementation Plan: Grafana Stack on Railway

Overview

Add observability to Diskover using Loki (logs), Prometheus (metrics), and Grafana (visualization), deployed on Railway with local Docker Compose for development.

Key Design Principle: Fully stateless and reproducible - all Grafana dashboards, datasources, and alerts are provisioned from config files. Spin up locally with make observability and get a fully configured stack.


Phase 1: Backend Observability Package

1.1 Add Dependencies

File: backend/go.mod

github.com/prometheus/client_golang v1.19.0

1.2 Create Observability Package

New file: backend/internal/observability/logger.go - Enhanced slog logger with service metadata (name, version, environment) - Consistent labels for Loki indexing

New file: backend/internal/observability/metrics.go - Prometheus metrics definitions: - diskover_http_requests_total (counter by method, path, status) - diskover_http_request_duration_seconds (histogram) - diskover_transactions_total (business metric) - diskover_db_connections_active (gauge)

New file: backend/internal/observability/middleware.go - HTTP metrics middleware for Chi router - Records request count and duration

1.3 Update Configuration

File: backend/internal/config/config.go - Add ObservabilityConfig struct: - MetricsEnabled (bool) - Environment (string) - LogLevel (string)

1.4 Integrate with Application

File: backend/cmd/api/main.go - Replace logger initialization with enhanced logger - Pass environment/version to logger

File: backend/internal/api/router.go - Add metrics middleware to Chi router - Add /metrics endpoint using promhttp.Handler() - Enhance logging middleware with request_id, query params, user_agent


Phase 2: Frontend Logging

2.1 Create Logger Service

New file: frontend/src/services/logger.ts - Buffered log entries (flush every 5s or on 50 entries) - Log levels: debug, info, warn, error - Captures: URL, user agent, timestamps - Sends logs to backend /api/logs endpoint

2.2 Integrate Logging

File: frontend/src/services/api-client.ts - Log API requests (method, url, status, duration) - Log API errors with context

File: frontend/src/main.tsx or create reportWebVitals.ts - Report web vitals (CLS, INP, FCP, LCP, TTFB) to logger

2.3 Error Boundary

New file: frontend/src/components/error-boundary.tsx - React error boundary that logs errors to observability system

2.4 Backend Log Ingestion Endpoint

File: backend/internal/handlers/ (new handler) - Add POST /api/logs endpoint to receive frontend logs - Write to stdout in structured format for Loki pickup


Phase 3: Grafana Provisioning (Stateless Dashboards)

3.1 Directory Structure

observability/
├── grafana/
│   ├── Dockerfile                    # Custom image for Railway
│   └── provisioning/
│       ├── datasources/
│       │   └── datasources.yaml      # Auto-configure Prometheus + Loki
│       ├── dashboards/
│       │   └── dashboards.yaml       # Dashboard provider config
│       └── dashboards-json/
│           ├── diskover-overview.json     # Main overview dashboard
│           ├── diskover-http.json         # HTTP metrics dashboard
│           ├── diskover-logs.json         # Log explorer dashboard
│           └── diskover-business.json     # Business metrics dashboard
├── loki/
│   └── loki-config.yaml
├── promtail/
│   └── promtail-config.yaml
└── prometheus/
    ├── prometheus.yml
    └── alerts/
        └── diskover-alerts.yml    # Alert rules

3.2 Grafana Datasource Provisioning

File: observability/grafana/provisioning/datasources/datasources.yaml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100

3.3 Grafana Dashboard Provisioning

File: observability/grafana/provisioning/dashboards/dashboards.yaml

apiVersion: 1
providers:
  - name: 'Diskover'
    folder: 'Diskover'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards-json

3.4 Pre-built Dashboards

diskover-overview.json - Main Overview Dashboard: - Request rate (requests/sec) - Error rate percentage - P95 latency - Active connections - Recent error logs - Top endpoints by traffic

diskover-http.json - HTTP Performance Dashboard: - Latency heatmap by endpoint - Status code distribution - Slowest endpoints table - Request duration histogram

diskover-logs.json - Log Explorer Dashboard: - Live log stream - Log volume by level (info/warn/error) - Error log count stat - Search by service, level, path

diskover-business.json - Business Metrics Dashboard: - Transactions created - Transaction success/failure rate - Products viewed - User registrations


Phase 4: Local Development Stack (Docker Compose)

4.1 Update Docker Compose

File: docker-compose.yml

Add services:

grafana:
  image: grafana/grafana:latest
  ports:
    - "3001:3000"
  environment:
    - GF_SECURITY_ADMIN_USER=admin
    - GF_SECURITY_ADMIN_PASSWORD=admin
    - GF_AUTH_ANONYMOUS_ENABLED=true
    - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
  volumes:
    - ./observability/grafana/provisioning:/etc/grafana/provisioning
  depends_on:
    - loki
    - prometheus

loki:
  image: grafana/loki:2.9.0
  ports:
    - "3100:3100"
  volumes:
    - ./observability/loki/loki-config.yaml:/etc/loki/local-config.yaml
  command: -config.file=/etc/loki/local-config.yaml

promtail:
  image: grafana/promtail:2.9.0
  volumes:
    - ./observability/promtail/promtail-config.yaml:/etc/promtail/config.yml
    - /var/run/docker.sock:/var/run/docker.sock:ro
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
  command: -config.file=/etc/promtail/config.yml
  depends_on:
    - loki

prometheus:
  image: prom/prometheus:latest
  ports:
    - "9090:9090"
  volumes:
    - ./observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--web.enable-lifecycle'

4.2 Makefile Targets

File: Makefile (add targets)

# Start observability stack only
.PHONY: observability
observability:
    docker-compose up -d grafana loki promtail prometheus

# Start everything including observability
.PHONY: dev-full
dev-full:
    docker-compose up -d

# Stop observability stack
.PHONY: observability-down
observability-down:
    docker-compose stop grafana loki promtail prometheus

# View Grafana logs
.PHONY: observability-logs
observability-logs:
    docker-compose logs -f grafana loki prometheus

# Reset observability data (clean slate)
.PHONY: observability-reset
observability-reset:
    docker-compose down -v grafana loki prometheus
    docker volume rm diskover-app_grafana_data diskover-app_loki_data diskover-app_prometheus_data 2>/dev/null || true

4.3 Configuration Files

loki-config.yaml - Filesystem storage, 24h index period promtail-config.yaml - Docker log scraping with JSON parsing prometheus.yml - Scrape backend at backend:8080/metrics


Phase 5: Railway Deployment

5.1 Deploy Grafana Stack Services

On Railway, create 3 services:

  1. Loki Service
  2. Image: grafana/loki:2.9.0
  3. Port: 3100
  4. Volume: /loki

  5. Prometheus Service

  6. Image: prom/prometheus:latest
  7. Port: 9090
  8. Volume: /prometheus
  9. Config: scrape backend.railway.internal:8080/metrics

  10. Grafana Service

  11. Image: grafana/grafana:latest
  12. Port: 3000
  13. Volume: /var/lib/grafana
  14. Env: GF_SECURITY_ADMIN_PASSWORD, GF_SERVER_ROOT_URL
  15. Mount provisioning configs for stateless dashboards

5.2 Railway Provisioning Options

Option A: Custom Docker Image (Recommended) Create observability/grafana/Dockerfile:

FROM grafana/grafana:latest
COPY provisioning /etc/grafana/provisioning
Build and push to Railway from this directory.

Option B: Railway Volume + Init Script Use Railway's volume mounting to copy provisioning files on startup.

5.3 Update Backend Environment

METRICS_ENABLED=true
ENVIRONMENT=production
LOG_LEVEL=info

Phase 6: Environment Variables

Backend (.env.example)

# Observability
METRICS_ENABLED=true
ENVIRONMENT=development
LOG_LEVEL=info

Frontend (.env)

VITE_ENABLE_LOGGING=true

Files to Create/Modify

New Files

File Purpose
backend/internal/observability/logger.go Enhanced slog logger
backend/internal/observability/metrics.go Prometheus metrics
backend/internal/observability/middleware.go HTTP metrics middleware
frontend/src/services/logger.ts Frontend logger service
frontend/src/components/error-boundary.tsx React error boundary
observability/grafana/provisioning/datasources/datasources.yaml Datasource config
observability/grafana/provisioning/dashboards/dashboards.yaml Dashboard provider
observability/grafana/provisioning/dashboards-json/diskover-overview.json Main dashboard
observability/grafana/provisioning/dashboards-json/diskover-http.json HTTP dashboard
observability/grafana/provisioning/dashboards-json/diskover-logs.json Logs dashboard
observability/grafana/provisioning/dashboards-json/diskover-business.json Business dashboard
observability/loki/loki-config.yaml Loki config
observability/promtail/promtail-config.yaml Promtail config
observability/prometheus/prometheus.yml Prometheus config
observability/grafana/Dockerfile Custom Grafana image for Railway

Modified Files

File Changes
backend/go.mod Add prometheus dependency
backend/internal/config/config.go Add observability config
backend/cmd/api/main.go Use enhanced logger
backend/internal/api/router.go Add metrics endpoint + middleware
frontend/src/services/api-client.ts Add logging
frontend/src/main.tsx Add error boundary, web vitals
docker-compose.yml Add Grafana stack services
Makefile Add observability targets
.env.example Add observability env vars

Quick Start (After Implementation)

Local Development

# Start full stack with observability
make dev-full

# Or start observability only (if backend already running)
make observability

Then open: - Grafana: http://localhost:3001 (admin/admin) - Prometheus: http://localhost:9090 - Loki: http://localhost:3100

All dashboards are pre-configured and ready to use.

Reset to Clean State

make observability-reset
make observability

Verification

Local Testing Checklist

  1. make dev-full starts all services
  2. Grafana at http://localhost:3001 shows "Diskover" folder with 4 dashboards
  3. Datasources (Prometheus, Loki) show "Working" status
  4. Make some API requests to generate traffic
  5. Overview dashboard shows request rate, latency, error logs
  6. Logs dashboard shows live log stream from backend

Metrics Endpoint Check

curl http://localhost:8080/metrics | grep diskover
# Should show: diskover_http_requests_total, diskover_http_request_duration_seconds

Railway Verification

  1. Deploy custom Grafana image with baked-in provisioning
  2. Verify datasources connect to internal URLs
  3. Dashboards appear automatically in "Diskover" folder