Building Distributed Systems: Lessons from the Trenches
Over the past decade, I've built distributed systems that serve millions of users - from payment reconciliation services processing 5,000 recurring payments per hour to warehouse management systems optimizing operations across geographies. Here are the hard-won lessons that textbooks don't teach you.
The Fundamental Truth
Distributed systems are fundamentally about managing failure. Everything else - scaling, performance, consistency - flows from accepting that things will fail and designing around it.
Your network will partition. Your servers will crash. Your database will slow down. Your assumptions will be wrong.
Lesson 1: Design for Failure, Not Success
The Fallacies of Distributed Computing
Remember these fallacies, coined by Peter Deutsch:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Every production incident I've debugged violated at least one of these assumptions.
Practical Example: Circuit Breakers
When calling external services, implement circuit breakers to prevent cascade failures:
type CircuitBreaker struct {
maxFailures int
timeout time.Duration
state State
failures int
lastAttempt time.Time
mu sync.RWMutex
}
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
defer cb.mu.Unlock()
if cb.state == StateOpen {
if time.Since(cb.lastAttempt) > cb.timeout {
cb.state = StateHalfOpen
} else {
return errors.New("circuit breaker is open")
}
}
err := fn()
cb.lastAttempt = time.Now()
if err != nil {
cb.failures++
if cb.failures >= cb.maxFailures {
cb.state = StateOpen
}
return err
}
cb.failures = 0
cb.state = StateClosed
return nil
}
This pattern saved us during a third-party API outage - instead of all requests timing out (30s each), we failed fast and degraded gracefully.
Lesson 2: Eventual Consistency is Your Friend
At NTUC Enterprise, we built an attendance system handling 20,000 concurrent check-ins during peak times. The secret? Embracing eventual consistency.
The Pattern: Write-Ahead Log + Async Processing
- Accept the request and write to a local WAL (Write-Ahead Log)
- Return success immediately
- Process asynchronously with retries
- Update UI via WebSockets when complete
// Accept check-in request
func (s *AttendanceService) CheckIn(ctx context.Context, studentID string) error {
// Write to local WAL (fast, reliable)
event := CheckInEvent{
StudentID: studentID,
Timestamp: time.Now(),
Status: "pending",
}
if err := s.wal.Write(event); err != nil {
return err
}
// Return immediately
// Async processor picks it up
return nil
}
// Async processor
func (s *AttendanceService) ProcessEvents() {
for event := range s.wal.Read() {
if err := s.processWithRetry(event); err != nil {
// Move to DLQ for manual review
s.dlq.Add(event, err)
}
}
}
Result
- 99.9% success rate even during peak load
- Sub-100ms response times
- Zero data loss (WAL persists everything)
Lesson 3: Observability is Not Optional
You can't fix what you can't see. At Cisco ThousandEyes, we instrumented everything:
The Three Pillars
1. Metrics - What's happening right now?
// Prometheus metrics
requestDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint", "status"},
)
// Instrument every request
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
start := time.Now()
defer func() {
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(
r.Method,
r.URL.Path,
strconv.Itoa(w.StatusCode),
).Observe(duration)
}()
h.next.ServeHTTP(w, r)
}
2. Logs - What happened?
// Structured logging with context
logger.Info("processing_test_request",
zap.String("router_id", routerID),
zap.String("test_type", testType),
zap.String("trace_id", traceID),
zap.Duration("duration", elapsed),
)
3. Traces - Where did time go?
// Distributed tracing
ctx, span := tracer.Start(ctx, "process_test")
defer span.End()
span.SetAttributes(
attribute.String("router.id", routerID),
attribute.String("test.type", testType),
)
// Spans propagate across service boundaries
The 25,000 Metric Points We Captured Daily
At NTUC, capturing 25k metrics daily meant we could: - Detect anomalies in <5 minutes - Debug issues with full context - Capacity plan with confidence - Prove SLA compliance
Lesson 4: Data Consistency Patterns
Pattern 1: Saga Pattern (Distributed Transactions)
When you need to coordinate across multiple services, use the Saga pattern:
// Saga for order processing
type OrderSaga struct {
steps []SagaStep
}
type SagaStep struct {
Execute func(ctx context.Context) error
Compensate func(ctx context.Context) error
}
func (s *OrderSaga) Execute(ctx context.Context) error {
executed := []SagaStep{}
for _, step := range s.steps {
if err := step.Execute(ctx); err != nil {
// Compensate in reverse order
for i := len(executed) - 1; i >= 0; i-- {
executed[i].Compensate(ctx)
}
return err
}
executed = append(executed, step)
}
return nil
}
// Usage
saga := &OrderSaga{
steps: []SagaStep{
{
Execute: reserveInventory,
Compensate: releaseInventory,
},
{
Execute: chargePayment,
Compensate: refundPayment,
},
{
Execute: scheduleShipment,
Compensate: cancelShipment,
},
},
}
Pattern 2: Event Sourcing
Store events instead of current state. Rebuild state by replaying events.
At HelloFresh, we used this for recipe and SKU management:
// Events
type RecipeCreated struct {
RecipeID string
Name string
Timestamp time.Time
}
type IngredientAdded struct {
RecipeID string
Ingredient Ingredient
Timestamp time.Time
}
// Aggregate
type Recipe struct {
ID string
Name string
Ingredients []Ingredient
}
func (r *Recipe) Apply(event Event) {
switch e := event.(type) {
case RecipeCreated:
r.ID = e.RecipeID
r.Name = e.Name
case IngredientAdded:
r.Ingredients = append(r.Ingredients, e.Ingredient)
}
}
Benefits: - Complete audit trail - Time travel (rebuild state at any point) - Easy to add new projections - Natural fit for event-driven architecture
Lesson 5: Backpressure and Rate Limiting
When I joined Airtel Digital, we had a problem: traffic spikes would overwhelm downstream services. The solution? Layers of backpressure.
Token Bucket Rate Limiter
type TokenBucket struct {
capacity int
tokens int
refillRate time.Duration
mu sync.Mutex
}
func (tb *TokenBucket) Allow() bool {
tb.mu.Lock()
defer tb.mu.Unlock()
if tb.tokens > 0 {
tb.tokens--
return true
}
return false
}
func (tb *TokenBucket) refill() {
ticker := time.NewTicker(tb.refillRate)
for range ticker.C {
tb.mu.Lock()
if tb.tokens < tb.capacity {
tb.tokens++
}
tb.mu.Unlock()
}
}
Load Shedding
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
// Shed load if queue is too full
if h.queue.Len() > h.maxQueueSize {
http.Error(w, "Service Unavailable", http.StatusServiceUnavailable)
return
}
// Or shed based on request priority
priority := r.Header.Get("X-Request-Priority")
if priority == "low" && h.isOverloaded() {
http.Error(w, "Service Unavailable", http.StatusServiceUnavailable)
return
}
h.next.ServeHTTP(w, r)
}
Lesson 6: The Importance of Idempotency
Every operation in a distributed system should be idempotent.
At HelloFresh, payment processing had to be idempotent - customers shouldn't be charged twice if a request is retried.
func (s *PaymentService) ProcessPayment(ctx context.Context, req PaymentRequest) error {
idempotencyKey := req.IdempotencyKey
// Check if already processed
if result, found := s.cache.Get(idempotencyKey); found {
return result.Error
}
// Process payment
result := s.chargeCard(ctx, req)
// Store result with TTL
s.cache.Set(idempotencyKey, result, 24*time.Hour)
return result.Error
}
Lesson 7: Cache Carefully
Caching is a double-edged sword. At NTUC, we used Redis master-slave for caching, and learned:
When to Cache
- Expensive computations
- High-read, low-write data
- Acceptable slight staleness
When NOT to Cache
- Rapidly changing data
- Data requiring strong consistency
- Small datasets (overhead > benefit)
Cache Invalidation Strategy
// Write-through cache
func (s *Service) UpdateUser(ctx context.Context, user User) error {
// Update database
if err := s.db.Update(user); err != nil {
return err
}
// Update cache
if err := s.cache.Set(user.ID, user); err != nil {
// Log but don't fail - cache miss is ok
log.Error("cache_update_failed", zap.Error(err))
}
return nil
}
// Cache-aside pattern
func (s *Service) GetUser(ctx context.Context, userID string) (*User, error) {
// Try cache first
if user, found := s.cache.Get(userID); found {
return user, nil
}
// Cache miss - get from DB
user, err := s.db.Get(userID)
if err != nil {
return nil, err
}
// Populate cache
s.cache.Set(userID, user)
return user, nil
}
Conclusion
Building distributed systems is hard. There's no silver bullet, no perfect architecture. What works at 1,000 RPS fails at 100,000 RPS.
But these patterns have served me well across industries and scales:
- Design for failure
- Embrace eventual consistency
- Observe everything
- Plan for data consistency
- Implement backpressure
- Make operations idempotent
- Cache thoughtfully
The real lesson? Start simple, measure everything, and evolve based on real data. Don't over-engineer for scale you don't have. But do design for the failure modes you'll definitely encounter.
Recommended Reading: - Designing Data-Intensive Applications by Martin Kleppmann - Site Reliability Engineering by Google - Release It! by Michael Nygard
Questions? War stories to share? Let's connect!