Categories
Emails Process Site Reliability Troubleshooting

Flying Blind

(Decorist : 5/19-10/19)

Challenge

Lack of logging/monitoring (including for email sending/deliverability) made it impossible to know if integrations were affecting system/application uptime.

Action

Odd for an email marketing company: stakeholders would ask about email deliverability and engineering had no insight because no loggging/tracking had ever been instrumented.

Planned initiatve for adding integration monitoring while digging deep into application code and Amazon SES, instrumenting new features going forward, but given time and resource constratints, not retrofitting existing ones.

Results

  • Created visibility into metrics for company’s key component: email marketing.
Categories
Database Monitoring ORM Performance Engineering Troubleshooting

40s to 10s

(Decorist : 10/19-10/19)

Challenge

Business-critical page used internally and externally was taking 40s to load, then started timing out for everyone.

Action

  • Dug in, realized DB CPU Util was pegging at 100%, found and killed runaway DB process.
  • Setup alarm to be notified whenever CPU over 70%.
  • Page was still taking 40s, realized queries weren’t being logged, turned quereies on, saw LEFT OUTER JOINS across eight large tables.
  • Removed six lines of Django ORM select_related syntax without affecting page functionality.

Results

  • Dropped page load speed for critical internal Admin UX from 40s to 10s.