Categories
Architecture Chat Collaboration Integration Site Reliability SOA Troubleshooting

Connecting Supply with Demand

(10/18-5/20)

Challenge

Ongoing reliability issues with 3rd party chat solution crucial for business operation w/o documentation and integration monitoring.

Action

The key aspect of the Decorist experience is the connection between the Client (Demand-side) and Designer (Supply-side.) To facilitate that connection, many years prior, the business had included a (at that time) nascent Chat-as-a-Solution provider as part of the user experience; it made sense to “Buy” instead of “Build.”

Architecture Overview

The chat UX is loaded into an iFrame. When users chat, their payload is posted to the 3rd party’s backend. The 3rd party then fires a webhook to Decorist which tracks the event in the DB and fires a transactional email depending on business logic.

The Problem

The ways the bugs manifested boiled down to:

  • chat UX not appearing (like below, often as the result of a 3rd party deploy gone bad)
  • emails not sending because webhooks not called

Lacking integration monitoring, issues often bubbled up through first-tier support.

Chat not loading

Improving the Dependency

Noticing the reliability issues, I first delegated to one experienced engineer to triage and then another.

I also dug in on my own and discovered/remedied bugs in our own webhooks while also providing data-backed reliability outage information to 3rd party, escalating to 3rd party’s CTO when necessary

Almost quarterly, as a company, the decision to continue using the 3rd party is re-visited given the reliability issues. Each time, though all stakeholders are aware of the pain collectively experienced, the decision has been made to punt replacing the solution.

Results

  • Ensured remediation of 3rd party issues in 24-36h, even on lowest support tier.
Categories
eCommerce Performance Engineering Site Reliability

Always Be Improving

(Decorist : 8/18-12/19)

Challenge

Leading-by-example to own and improve systems as sole ENGR having SRE/DevOps/Frontend/Backend experience.

Action

Mar 2019

Watching our AWS costs rise ~8% monthly…

Costs Rising

I learned about and subscribed to Reserved Instances to realize costs savings for our hosting spend:

Dec 2019

Though not leading to cost savings or revenue generation, part of my responsbilities have been database administration, jumping in when the production DB would spike like below, figuring out if a runaway process needed to be terminated, if a slow query was bringing it to its knees, if a cron job was introducing load, or whatever needed to be done to keep the site up.

Or when bots would crawl the site, bringing it down, necessitating an IP block:

Or when digging into the logs to find that a route was 500ing and had to be fixed:

Mar 2020

Using Cloudcraft, I diagrammed our AWS infrastructure, identifying and deleting 1000 unused SQS instances.

Also identified and deleted numerous unused RDS snapshots:

All changes led to a yet another 37% reduction in MoM AWS costs:

Results

  • Saved company 115% of my salary in 2019 through process improvements.
Categories
Emails Process Site Reliability Troubleshooting

Flying Blind

(Decorist : 5/19-10/19)

Challenge

Lack of logging/monitoring (including for email sending/deliverability) made it impossible to know if integrations were affecting system/application uptime.

Action

Odd for an email marketing company: stakeholders would ask about email deliverability and engineering had no insight because no loggging/tracking had ever been instrumented.

Planned initiatve for adding integration monitoring while digging deep into application code and Amazon SES, instrumenting new features going forward, but given time and resource constratints, not retrofitting existing ones.

Results

  • Created visibility into metrics for company’s key component: email marketing.
Categories
Process Site Reliability

Preventing Business Failure

(Decorist : 5/19-5/19)

Challenge

Realized a key analytics ETL server was a crucial component of the engineering infrastructure and without redundancy.

Action

  • Crafted a plan to remediate risk.
  • Performed AWS devops necessary to bring up 2nd instance.
  • Trained-up data engineer.
  • Worked with Data Engineer and offshore Tiger Team of 2 to deliver a process for spinning up a Docker-based backup server.

Results

  • Created replacement Docker image (and recovery process) to be spun-up, ensuring business continuity in catastrophic situation.
Categories
Management Monitoring Process Site Reliability Troubleshooting

All the False Positives

(Decorist : 9/18-12/18)

Challenge

Inherited a situation where there were incomprehensible 1000+ issues per day in exception reporting software (Sentry.)

Action

Identifed and delegated KR to FE lead to create a daily process to chip away at remediation. Encouraged accountability by having FE lead give status report weekly.

Results

  • 1000+ to 7 system alert notifications.