Categories
Collaboration Database Troubleshooting

Fixing Database Replication

(6/20-6/20)

Challenge

Usually our MySQL->Postgres data replication powering the business’s analytics dashboard runs without issue. One day, it completely failed.

Action

  • Spotted AWS alert notification and worked with Data Engineer to realize it wasn’t a normal hiccup in our replication pipeline.
  • Stopped the AWS DMS task for replication.
  • Examined Cloudwatch logs to see if we could find direction as to where the problem was. This told us there was a table (awsdms_apply_exceptions) that didn’t exist.
  • Dug into online documentation about the issue.
  • Created a new Postgres copy target of the analytics database.
  • Created a new AWS DMS task with the copy target DB which should create table public.awsdms_apply_exceptions.
  • Grabbed the DDL statement (e.g. CREATE TABLE) for the awsdms_apply_exceptions table.
  • In the (original) analytics DB, created the ‘public’ schema
  • Also in the (original) analytics DB, applied the CREATE TABLE for awsdms_apply_exceptions.
  • Deleted 1) the new Postgres copy target and 2) the AWS DMS task as cleanup.
  • (Never did figure out why the public schema and table disappeared.)

Results

  • Resolved data replication issue leading to minimal downtime for analytics dashboard.
Categories
Database Monitoring ORM Performance Engineering Troubleshooting

40s to 10s

(Decorist : 10/19-10/19)

Challenge

Business-critical page used internally and externally was taking 40s to load, then started timing out for everyone.

Action

  • Dug in, realized DB CPU Util was pegging at 100%, found and killed runaway DB process.
  • Setup alarm to be notified whenever CPU over 70%.
  • Page was still taking 40s, realized queries weren’t being logged, turned quereies on, saw LEFT OUTER JOINS across eight large tables.
  • Removed six lines of Django ORM select_related syntax without affecting page functionality.

Results

  • Dropped page load speed for critical internal Admin UX from 40s to 10s.