Categories
Collaboration Database Troubleshooting

Fixing Database Replication

(6/20-6/20)

Challenge

Usually our MySQL->Postgres data replication powering the business’s analytics dashboard runs without issue. One day, it completely failed.

Action

  • Spotted AWS alert notification and worked with Data Engineer to realize it wasn’t a normal hiccup in our replication pipeline.
  • Stopped the AWS DMS task for replication.
  • Examined Cloudwatch logs to see if we could find direction as to where the problem was. This told us there was a table (awsdms_apply_exceptions) that didn’t exist.
  • Dug into online documentation about the issue.
  • Created a new Postgres copy target of the analytics database.
  • Created a new AWS DMS task with the copy target DB which should create table public.awsdms_apply_exceptions.
  • Grabbed the DDL statement (e.g. CREATE TABLE) for the awsdms_apply_exceptions table.
  • In the (original) analytics DB, created the ‘public’ schema
  • Also in the (original) analytics DB, applied the CREATE TABLE for awsdms_apply_exceptions.
  • Deleted 1) the new Postgres copy target and 2) the AWS DMS task as cleanup.
  • (Never did figure out why the public schema and table disappeared.)

Results

  • Resolved data replication issue leading to minimal downtime for analytics dashboard.
Categories
Infrastructure Process Security

Shoring Up The Site

(10/19-6/20)

Challenge

Remediate CVEs without specialized resources.

Action

An annual process around springtime, I worked with corporate IT and its security consulting company acting in Red/Blue Teams to navigate rules of engagement, coordinate a window for the penetration tests, and then fix the identified CVEs.

Lacking other resources to replicate the penetration findings, I discovered and leveraged cobalt.io and detectify to reproduce an understanding of the vulnerabilities in order to create tickets for engineers to address.

Result

Remedied High severity vulnerabilities within 30 days and added those with Medium severity to the backlog.

Categories
Forecasting Management Process Troubleshooting

Rolling With the Punches

(11/18-5/20)

Challenge

Finding the right-sized engineering team as the business ebbed and waned.

Action

Nov 2018 – Mar 2019

Two months after joining, in Oct 2018, was surprised by request to provide KLO budget slashing engineering by 60%.

Having only a basic understanding of team members’ strengths and weaknesses, I anticipated the following year’s needs and then presented guidance w/SVP PROD & CEO to corporate parent COO.

We secured fiscal year funding to ensure team/business continuity @ 38 headcount.

Dec 2019

In Nov 2019, was informed we needed to reduce our 25 person Delhi team to 8 for the 2020 fiscal year. I looked over the skillsets of the team and – with a solid year under my belt and experience of who top performers were – decided who would stay.

Saved company $800K by reducing engineering headcount from 38 to 25.

Mar 2020

Then, with the onset of COVID, needed to further reduce team-size, for both India and Pakistan teams.

Given impact of COVID, made decisions leading to add’l $840K reduction from 25 to 9.

Results

  • Adjusted team size as necessary to meet needs of the business.
Categories
Architecture Chat Collaboration Integration Site Reliability SOA Troubleshooting

Connecting Supply with Demand

(10/18-5/20)

Challenge

Ongoing reliability issues with 3rd party chat solution crucial for business operation w/o documentation and integration monitoring.

Action

The key aspect of the Decorist experience is the connection between the Client (Demand-side) and Designer (Supply-side.) To facilitate that connection, many years prior, the business had included a (at that time) nascent Chat-as-a-Solution provider as part of the user experience; it made sense to “Buy” instead of “Build.”

Architecture Overview

The chat UX is loaded into an iFrame. When users chat, their payload is posted to the 3rd party’s backend. The 3rd party then fires a webhook to Decorist which tracks the event in the DB and fires a transactional email depending on business logic.

The Problem

The ways the bugs manifested boiled down to:

  • chat UX not appearing (like below, often as the result of a 3rd party deploy gone bad)
  • emails not sending because webhooks not called

Lacking integration monitoring, issues often bubbled up through first-tier support.

Chat not loading

Improving the Dependency

Noticing the reliability issues, I first delegated to one experienced engineer to triage and then another.

I also dug in on my own and discovered/remedied bugs in our own webhooks while also providing data-backed reliability outage information to 3rd party, escalating to 3rd party’s CTO when necessary

Almost quarterly, as a company, the decision to continue using the 3rd party is re-visited given the reliability issues. Each time, though all stakeholders are aware of the pain collectively experienced, the decision has been made to punt replacing the solution.

Results

  • Ensured remediation of 3rd party issues in 24-36h, even on lowest support tier.