Category: Site Reliability

Connecting Supply with Demand

Post author By Anthony Bull
Post date June 1, 2020

(10/18-5/20)

Challenge

Ongoing reliability issues with 3rd party chat solution crucial for business operation w/o documentation and integration monitoring.

Action

The key aspect of the Decorist experience is the connection between the Client (Demand-side) and Designer (Supply-side.) To facilitate that connection, many years prior, the business had included a (at that time) nascent Chat-as-a-Solution provider as part of the user experience; it made sense to “Buy” instead of “Build.”

Architecture Overview

The chat UX is loaded into an iFrame. When users chat, their payload is posted to the 3rd party’s backend. The 3rd party then fires a webhook to Decorist which tracks the event in the DB and fires a transactional email depending on business logic.

The Problem

The ways the bugs manifested boiled down to:

chat UX not appearing (like below, often as the result of a 3rd party deploy gone bad)
emails not sending because webhooks not called

Lacking integration monitoring, issues often bubbled up through first-tier support.

Improving the Dependency

Noticing the reliability issues, I first delegated to one experienced engineer to triage and then another.

I also dug in on my own and discovered/remedied bugs in our own webhooks while also providing data-backed reliability outage information to 3rd party, escalating to 3rd party’s CTO when necessary

Almost quarterly, as a company, the decision to continue using the 3rd party is re-visited given the reliability issues. Each time, though all stakeholders are aware of the pain collectively experienced, the decision has been made to punt replacing the solution.

Results

Ensured remediation of 3rd party issues in 24-36h, even on lowest support tier.

Tags Django

eCommerce Performance Engineering Site Reliability

Always Be Improving

Post author By Anthony Bull
Post date January 5, 2020

(Decorist : 8/18-12/19)

Challenge

Leading-by-example to own and improve systems as sole ENGR having SRE/DevOps/Frontend/Backend experience.

Action

Mar 2019

Watching our AWS costs rise ~8% monthly…

Costs Rising

I learned about and subscribed to Reserved Instances to realize costs savings for our hosting spend:

Dec 2019

Though not leading to cost savings or revenue generation, part of my responsbilities have been database administration, jumping in when the production DB would spike like below, figuring out if a runaway process needed to be terminated, if a slow query was bringing it to its knees, if a cron job was introducing load, or whatever needed to be done to keep the site up.

Or when bots would crawl the site, bringing it down, necessitating an IP block:

Or when digging into the logs to find that a route was 500ing and had to be fixed:

Mar 2020

Using Cloudcraft, I diagrammed our AWS infrastructure, identifying and deleting 1000 unused SQS instances.

Also identified and deleted numerous unused RDS snapshots:

All changes led to a yet another 37% reduction in MoM AWS costs:

Results

Saved company 115% of my salary in 2019 through process improvements.

Tags AWS, Beanstalk, Cloudcraft, Django, Elastic, Lambdas, MySQL, Postgres, RDS, Route53, S3

Emails Process Site Reliability Troubleshooting