Why ?

  • economic efficiency ( do more and faster with less)
  • a major goal in code refactoring/optimization phase ( build first, optimize later )

Aspects of performance

Metrics ( monitoring )

  • availibility: down times, crash times, GC times
  • response time: average, .95 quantile
  • throughput: query per second, degree of concurrency
  • resource usage: cpu idle, memory used, disk used & io, network io, instance counts

Profiling analysis ( what costs how much )

request cost distribution over phases:

  • browser rendering time
  • external network latency
  • internal network latency
  • business backend execution time
  • database execution latency time

resource usage distribution over lines of codes:

  • memory allocation & release ( memory leak detection )
  • file handle allocation & release ( file handle leaks )

Dependency analysis ( what depends on what )

  • dependency defintion & managements ( package management tools, artifact repos )
  • dependency loops detection ( must be DAG to build and run )
  • conflicts resolution
  • visualization of dependencies

Benchmarking (comparison of changes)

  • set up test group and control group (baseline)
  • input & output standarization & quantification
  • statistical metric comparation of groups and times

Common bottlenecks of performance

single points of failure: any “master” node without backup or automatic election mechanism

Database slow queries

  • no index
  • vague or wrong usage of index
  • too many data rows

Business logic defects

  • unecessary nested loops or recursion
  • unecessary synchronized and/or sequential execution flow (n+1 issue)

How to improve performance

Code level

  • add specific and proper indexes
  • use batch execution
  • use connection pools (use random sleep)
  • guard read with cache middlewares
  • throttle write with message queues
  • use statics and consts when possible
  • proactively clean up costly objects
  • less synchronized logs

Arch level

  • parallelization and asynchronization when possible
  • scale up and down service instances with flow & load changes
  • do data partitions
  • read and write flow separation
  • hot and cold data separation
  • build up performance environments for testing

Build up performance environments for testing

  • real requests recording & playback
  • containment and provision
  • tests management for benchmarking

Tools

  • metrics collection: filebeat, fluentd for logs; prometheus for key-value metrics
  • metrics aggregation: logstash, fluentd
  • frontend solutions: kibana, grafana
  • visualization diagrams: comparison: histograms; analysis: flamegraph; relationship: force-directed graphs, circos
  • http request load & stress test: gatlin, ab
  • profiling: pprof for golang

Design patterns & mindsets

  • epoll in a loop ( servers )
  • async & await ( avoiding callback hell )
  • divide & conquer ( map then reduce )
  • trading time for space ( like file sorts ), or space for time ( any kinds of indexes )
  • event based ( fire and forget )
  • master-minions
  • separate list from detail
  • in distributed systems: pick two among consistency, availability and partition-tolerance