Observability Implementation Plan: Grafana Stack on Railway¶
Overview¶
Add observability to Diskover using Loki (logs), Prometheus (metrics), and Grafana (visualization), deployed on Railway with local Docker Compose for development.
Key Design Principle: Fully stateless and reproducible - all Grafana dashboards, datasources, and alerts are provisioned from config files. Spin up locally with make observability and get a fully configured stack.
Phase 1: Backend Observability Package¶
1.1 Add Dependencies¶
File: backend/go.mod
1.2 Create Observability Package¶
New file: backend/internal/observability/logger.go
- Enhanced slog logger with service metadata (name, version, environment)
- Consistent labels for Loki indexing
New file: backend/internal/observability/metrics.go
- Prometheus metrics definitions:
- diskover_http_requests_total (counter by method, path, status)
- diskover_http_request_duration_seconds (histogram)
- diskover_transactions_total (business metric)
- diskover_db_connections_active (gauge)
New file: backend/internal/observability/middleware.go
- HTTP metrics middleware for Chi router
- Records request count and duration
1.3 Update Configuration¶
File: backend/internal/config/config.go
- Add ObservabilityConfig struct:
- MetricsEnabled (bool)
- Environment (string)
- LogLevel (string)
1.4 Integrate with Application¶
File: backend/cmd/api/main.go
- Replace logger initialization with enhanced logger
- Pass environment/version to logger
File: backend/internal/api/router.go
- Add metrics middleware to Chi router
- Add /metrics endpoint using promhttp.Handler()
- Enhance logging middleware with request_id, query params, user_agent
Phase 2: Frontend Logging¶
2.1 Create Logger Service¶
New file: frontend/src/services/logger.ts
- Buffered log entries (flush every 5s or on 50 entries)
- Log levels: debug, info, warn, error
- Captures: URL, user agent, timestamps
- Sends logs to backend /api/logs endpoint
2.2 Integrate Logging¶
File: frontend/src/services/api-client.ts
- Log API requests (method, url, status, duration)
- Log API errors with context
File: frontend/src/main.tsx or create reportWebVitals.ts
- Report web vitals (CLS, INP, FCP, LCP, TTFB) to logger
2.3 Error Boundary¶
New file: frontend/src/components/error-boundary.tsx
- React error boundary that logs errors to observability system
2.4 Backend Log Ingestion Endpoint¶
File: backend/internal/handlers/ (new handler)
- Add POST /api/logs endpoint to receive frontend logs
- Write to stdout in structured format for Loki pickup
Phase 3: Grafana Provisioning (Stateless Dashboards)¶
3.1 Directory Structure¶
observability/
├── grafana/
│ ├── Dockerfile # Custom image for Railway
│ └── provisioning/
│ ├── datasources/
│ │ └── datasources.yaml # Auto-configure Prometheus + Loki
│ ├── dashboards/
│ │ └── dashboards.yaml # Dashboard provider config
│ └── dashboards-json/
│ ├── diskover-overview.json # Main overview dashboard
│ ├── diskover-http.json # HTTP metrics dashboard
│ ├── diskover-logs.json # Log explorer dashboard
│ └── diskover-business.json # Business metrics dashboard
├── loki/
│ └── loki-config.yaml
├── promtail/
│ └── promtail-config.yaml
└── prometheus/
├── prometheus.yml
└── alerts/
└── diskover-alerts.yml # Alert rules
3.2 Grafana Datasource Provisioning¶
File: observability/grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
3.3 Grafana Dashboard Provisioning¶
File: observability/grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
- name: 'Diskover'
folder: 'Diskover'
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards-json
3.4 Pre-built Dashboards¶
diskover-overview.json - Main Overview Dashboard: - Request rate (requests/sec) - Error rate percentage - P95 latency - Active connections - Recent error logs - Top endpoints by traffic
diskover-http.json - HTTP Performance Dashboard: - Latency heatmap by endpoint - Status code distribution - Slowest endpoints table - Request duration histogram
diskover-logs.json - Log Explorer Dashboard: - Live log stream - Log volume by level (info/warn/error) - Error log count stat - Search by service, level, path
diskover-business.json - Business Metrics Dashboard: - Transactions created - Transaction success/failure rate - Products viewed - User registrations
Phase 4: Local Development Stack (Docker Compose)¶
4.1 Update Docker Compose¶
File: docker-compose.yml
Add services:
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
volumes:
- ./observability/grafana/provisioning:/etc/grafana/provisioning
depends_on:
- loki
- prometheus
loki:
image: grafana/loki:2.9.0
ports:
- "3100:3100"
volumes:
- ./observability/loki/loki-config.yaml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml
promtail:
image: grafana/promtail:2.9.0
volumes:
- ./observability/promtail/promtail-config.yaml:/etc/promtail/config.yml
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yml
depends_on:
- loki
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-lifecycle'
4.2 Makefile Targets¶
File: Makefile (add targets)
# Start observability stack only
.PHONY: observability
observability:
docker-compose up -d grafana loki promtail prometheus
# Start everything including observability
.PHONY: dev-full
dev-full:
docker-compose up -d
# Stop observability stack
.PHONY: observability-down
observability-down:
docker-compose stop grafana loki promtail prometheus
# View Grafana logs
.PHONY: observability-logs
observability-logs:
docker-compose logs -f grafana loki prometheus
# Reset observability data (clean slate)
.PHONY: observability-reset
observability-reset:
docker-compose down -v grafana loki prometheus
docker volume rm diskover-app_grafana_data diskover-app_loki_data diskover-app_prometheus_data 2>/dev/null || true
4.3 Configuration Files¶
loki-config.yaml - Filesystem storage, 24h index period
promtail-config.yaml - Docker log scraping with JSON parsing
prometheus.yml - Scrape backend at backend:8080/metrics
Phase 5: Railway Deployment¶
5.1 Deploy Grafana Stack Services¶
On Railway, create 3 services:
- Loki Service
- Image:
grafana/loki:2.9.0 - Port: 3100
-
Volume:
/loki -
Prometheus Service
- Image:
prom/prometheus:latest - Port: 9090
- Volume:
/prometheus -
Config: scrape
backend.railway.internal:8080/metrics -
Grafana Service
- Image:
grafana/grafana:latest - Port: 3000
- Volume:
/var/lib/grafana - Env:
GF_SECURITY_ADMIN_PASSWORD,GF_SERVER_ROOT_URL - Mount provisioning configs for stateless dashboards
5.2 Railway Provisioning Options¶
Option A: Custom Docker Image (Recommended)
Create observability/grafana/Dockerfile:
Option B: Railway Volume + Init Script Use Railway's volume mounting to copy provisioning files on startup.
5.3 Update Backend Environment¶
Phase 6: Environment Variables¶
Backend (.env.example)¶
Frontend (.env)¶
Files to Create/Modify¶
New Files¶
| File | Purpose |
|---|---|
backend/internal/observability/logger.go |
Enhanced slog logger |
backend/internal/observability/metrics.go |
Prometheus metrics |
backend/internal/observability/middleware.go |
HTTP metrics middleware |
frontend/src/services/logger.ts |
Frontend logger service |
frontend/src/components/error-boundary.tsx |
React error boundary |
observability/grafana/provisioning/datasources/datasources.yaml |
Datasource config |
observability/grafana/provisioning/dashboards/dashboards.yaml |
Dashboard provider |
observability/grafana/provisioning/dashboards-json/diskover-overview.json |
Main dashboard |
observability/grafana/provisioning/dashboards-json/diskover-http.json |
HTTP dashboard |
observability/grafana/provisioning/dashboards-json/diskover-logs.json |
Logs dashboard |
observability/grafana/provisioning/dashboards-json/diskover-business.json |
Business dashboard |
observability/loki/loki-config.yaml |
Loki config |
observability/promtail/promtail-config.yaml |
Promtail config |
observability/prometheus/prometheus.yml |
Prometheus config |
observability/grafana/Dockerfile |
Custom Grafana image for Railway |
Modified Files¶
| File | Changes |
|---|---|
backend/go.mod |
Add prometheus dependency |
backend/internal/config/config.go |
Add observability config |
backend/cmd/api/main.go |
Use enhanced logger |
backend/internal/api/router.go |
Add metrics endpoint + middleware |
frontend/src/services/api-client.ts |
Add logging |
frontend/src/main.tsx |
Add error boundary, web vitals |
docker-compose.yml |
Add Grafana stack services |
Makefile |
Add observability targets |
.env.example |
Add observability env vars |
Quick Start (After Implementation)¶
Local Development¶
# Start full stack with observability
make dev-full
# Or start observability only (if backend already running)
make observability
Then open: - Grafana: http://localhost:3001 (admin/admin) - Prometheus: http://localhost:9090 - Loki: http://localhost:3100
All dashboards are pre-configured and ready to use.
Reset to Clean State¶
Verification¶
Local Testing Checklist¶
make dev-fullstarts all services- Grafana at http://localhost:3001 shows "Diskover" folder with 4 dashboards
- Datasources (Prometheus, Loki) show "Working" status
- Make some API requests to generate traffic
- Overview dashboard shows request rate, latency, error logs
- Logs dashboard shows live log stream from backend
Metrics Endpoint Check¶
curl http://localhost:8080/metrics | grep diskover
# Should show: diskover_http_requests_total, diskover_http_request_duration_seconds
Railway Verification¶
- Deploy custom Grafana image with baked-in provisioning
- Verify datasources connect to internal URLs
- Dashboards appear automatically in "Diskover" folder