Reliability Engineering at Coinbase

Why Reliability Engineering?

Why is Reliability Engineering related at an organization like Coinbase? Why would we need to construct a Reliability Engineering staff?

“Our purpose is to make Coinbase the most trusted and best to make use of digital foreign money trade.”

-Brian Armstrong, Co-founder & CEO

All of it comes again to what our CEO Brian Armstrong stated about Coinbase desirous to be probably the most trusted. Our purpose within the cryptocurrency business is to create an open financial system for the world — and a part of that requires us to construct probably the most trusted digital foreign money trade. With the intention to be probably the most trusted trade, we should be probably the most dependable. Being dependable is a aggressive benefit in our business, whereas being unreliable is a severe threat to our enterprise.

Earlier than you get too deep into this text, please be aware that we’re actively hiring nice Reliability Engineers, so if any of this sounds attention-grabbing to you please head over to our Senior Reliability Engineer job posting here.

What's Reliability Engineering?

The mission of the Reliability Engineering staff at Coinbase is:

“Assist engineers design & preserve their guarantees in manufacturing.”

The phrase “promise” in our mission assertion is a reference to Promise Theory which was invented by Mark Burgess. Whereas we use most of the rules from the Google SRE books, we discovered Promise Principle to be extra human-friendly than the time period “Service Stage Goal” which is a bit jargon-y. Primarily based on the investigations into security and reliability by folks like Sidney Dekker and firms reminiscent of Toyota (see the Toyota Way), we think about reliability to be in the end a human problem. Because of this we most well-liked to reference an idea which each human already understands — that of creating and holding guarantees.

Main variations between Reliability Engineering at Coinbase vs Website Reliability Engineering (SRE) at another corporations:

  • We're generalist software program engineers firstly. We give attention to fixing challenges by writing higher software program somewhat than including an increasing number of people to push buttons. Everybody on the staff is a powerful software program engineer, engaged on a number of software program techniques in a wide range of programming languages.
  • We wouldn't have front-line pager accountability. We're on-call for the techniques that we ourselves personal (e.g. the Coinbase observability stack), however we aren't the primary line of incident response for different groups. Service and product groups have their very own pager rotations.

We like to use the metaphor of ‘educating an individual to fish vs giving them a fish’ to how we function — our mission is to “train groups to fish” by way of reliability. That is in distinction to “giving them a fish” by dealing with front-line pager duties on their behalf. One other method of placing it's that our purpose is to up-level each engineering staff at Coinbase to be self-sufficient in Reliability Engineering.

How do Reliability Engineers work?

One of many essential issues to understand about reliability engineering is that it's inherently cross-cutting all through the group. Reliability will not be itself a practical silo — it's a worth and a enterprise output. Our prospects are each single engineering staff at Coinbase. Since we work with so many purchasers, we've got outlined totally different fashions of engagement to satisfy their wants:

  • Advisory. That is answering questions, or responding to ad-hoc requests with out formal deliverables. For instance responding to “Assist me monitor/scale/enhance my factor” questions in Slack, or leaping into manufacturing incidents to help responders.
  • Consulting. We regularly run structured reliability workshops and pairing classes with different groups. In these engagements, we've got a shared purpose (in our case, OKR) with the staff we’re consulting with — thus there's a measurable end result. Whereas consulting engagements are formal, they're sometimes part-time endeavours.
  • Embedding. Generally groups will want full-time reliability help from our engineers, they usually request that we bodily sit and work with them, taking part of their standups, dash plannings, and so on. That is the place we use embedding. Just like Consulting, this work has a shared purpose and measurable end result (OKR) — the distinction is the reliability engineer is a brief (sometimes, one calendar quarter) member of the shopper staff.

Past the varied methods we have interaction with prospects, we comply with an ordinary “agile” software program engineering course of. Now we have a weekly planning assembly to replace our Kanban board, conduct month-to-month retrospectives and maintain every day standups. Longer-term technique and measurements are captured in quarterly OKRs which we derive from buyer suggestions and inside dialogue.

Introducing the Coinbase Reliability Engineering Crew

The Reliability Crew was based in 2018 with one engineer (Luke Demi) and myself (Niall O’Higgins) as supervisor. Since then, we’ve grown to 7 engineers and shipped loads of enhancements.

Within the phrases of oldsters on the staff, listed here are some accomplishments we are able to talk about publicly in addition to impressions and experiences from engaged on reliability!

Luke Demi

After becoming a member of Coinbase in 2016, my preliminary efforts inside the firm centered on constructing self-service infrastructure for engineers. Nonetheless in 2017 as curiosity in cryptocurrency surged, Coinbase started to expertise outages throughout our techniques. Fixing these types of reliability problems excited me, so I dove in head first to resolve these points.

We have been in a position to survive 2017, nevertheless it was clear that as a way to stand up to future surges and supply a dependable expertise for our prospects we would want to make reliability a core part of the engineering tradition at Coinbase.

I discover the Reliability Crew thrilling as a result of we’re in a position to each advise groups on finest practices for selecting reliability indicators (Service Stage Indicators AKA SLIs) and guarantees (AKA SLOs) in addition to construct the instruments that allow engineers perceive the efficiency of their techniques in manufacturing.

Maksym Naboka

I joined Coinbase in July 2018. Being the third engineer on the Reliability Crew was an incredible expertise. There are such a lot of issues I really like in regards to the firm and I’d like to spotlight few of them:

  • A possibility to work with / be taught from good and proficient folks.
  • Mission possession. An engineer on the Reliability Crew owns a venture throughout from design to transport.
  • Means to contribute to Open Supply.
  • Study, be taught and be taught. Coinbase gives so many alternatives to be taught new expertise. It seems like we're using each spare minute to be taught new issues! Now we have Lunch & Study classes with friends from main expertise corporations, each engineer has an annual academic finances to go to conferences or take on-line lessons.
  • Scrumptious meals on web site 🙂

Amy Li

Once I first joined the Reliability Crew in November 2018, I used to be below the impression that I might be thrown into the deep finish of blockchain — drowning in Bitcoin, Ethereum, and good contracts. Colleagues additionally warned me of infinite firefighting and nightmarish on-call rotations. Thankfully, this was not the case.

The Reliability Crew doesn’t work with blockchains immediately and aren’t the primary ones being paged for each single incident. Every Coinbase staff owns the every day operations of their particular services or products. This enables for distributed data throughout the group.

As a brand new faculty graduate I initially felt overwhelmed, however everybody on the staff has been extremely supportive and keen to share their data. Inside a month, Niall and I improved our incident administration system by integrating it with JIRA. I wrote my first design doc to additional combine PagerDuty with our incident administration system and I'm frequently making incremental adjustments to our system.

One of the vital essential issues I’ve realized is that working with superb staff members is priceless. The Reliability Crew is a gaggle of curious, empathetic, and clever people and there’s no different group I might somewhat be with for 5 days a week.

Paul Henry

Probably the most attention-grabbing a part of being on the Reliability Crew for me is our high-level perspective throughout the group. Since we aren't tasked with dealing with day-to-day operations of any particular Coinbase product (, Coinbase Professional, Coinbase Pockets, and so on), we are able to give attention to bettering the power for groups to watch and perceive their techniques. Which means that groups can transfer sooner, incidents are resolved faster, and there’s a decentralization of information throughout the group.

Right here’s some examples of enhancements that I’ve contributed to over the previous 12 months:

  • Writing light-weight stats, tracing, and logging libraries for the varied languages in use throughout the group.
  • Contributing to “paved roads” for numerous languages and making certain that builders have start line for brand spanking new companies, with sane defaults.
  • Introducing new distributors (reminiscent of Datadog) to deliver extra dimensions of observability, unlocking new methods of monitoring techniques.
  • Bringing a perspective of reliability to expertise selections made by groups and serving to them ask the precise questions.
  • Contributing to our deployment tooling to combine excessive stage monitoring by default on all companies.
  • Enabling using gRPC throughout the group by shopper technology in numerous languages and integration into our AWS structure. See weblog submit “gRPC to AWS Lambda: Is it Possible?

Along with shared tooling, we have interaction with many groups throughout the group by operating workshops, evaluate classes, and workplace hours.

Workshops are hands-on classes that target matters like observability tooling and promise development, inside the context of that staff’s companies or drawback area.

Overview classes occur each early within the design course of for companies and later when they're nearing manufacturing. These evaluations don't act as a gate or “inexperienced verify mark” for groups, however as a substitute guarantee that they're asking the precise questions and highlighting ways in which the reliability staff can stage up groups throughout the group.

Workplace hours are open time each week for any engineer to deliver issues or suggestions to our staff by pairing with an engineer. Subjects often embrace: how one can construct efficient screens and dashboards, integrating tracing or metrics libraries, what database ought to I exploit for this specific drawback, and extra.

On the finish of the day, my favourite half in regards to the Reliability Crew is the various set of engineers we've got. The breadth and depth of information shared by everybody is a good help construction for tackling an issue of any scale.

Jordan Sitkin

I've an uncommon background for an infrastructure engineer. I studied graphic design in class and labored for the primary half of my profession as a designer. Becoming a member of the Reliability Crew was, for me, the newest step in a protracted, ongoing journey away from the entrance finish. I’ve actually loved the brand new challenges I’ve confronted on this staff and have been pleasantly shocked at how typically my expertise as a designer finally ends up being related right here.

My favourite half about being on the Reliability Crew is being near the place the thrill is occurring throughout the corporate. The best want for reliability experience is usually round new product launches or new-found success of some current product. We’ve been pursuing a brand new mannequin of embedding reliability engineers in different groups the place their experience is required most. I’m personally at the moment embedded within the Client staff, which is accountable for and the Coinbase cell apps. I’ve loved feeling near the entrance strains of product growth whereas nonetheless specializing in infrastructure.

One other rewarding facet of being on the Reliability Crew has been turning our work into convention talks. Over the previous 12 months I had the possibility to talk at MongoDB World and QCon about designing load testing methods. I had by no means given a chat earlier than, so this was an amazing studying alternative for me and I ended up having loads of enjoyable doing it.

Engaged on the Reliability Crew is without doubt one of the most enjoyable positions at Coinbase as a result of we get to be part of so many various initiatives and tasks throughout the corporate. We’ve acquired an amazing range of experience on the staff. I’ve by no means realized a lot so rapidly.

Reliability Engineering and the Future

Prior to now 12 months, our staff has helped all of Coinbase construct a tradition of reliability within the following methods:

  • Transferring the complete engineering staff from a reactive stance on reliability (firefighting, and so on.) to a proactive one (putting in smoke detectors) with service stage indicators and guarantees.
  • Offering a world-class observability stack comprised of three pillars — tracing, metrics and logs.
  • Designing and implementing high-performance infrastructure companies.

We sit up for doing way more over the following 12 months such as:

  • Constructing the serverless basis to speed up characteristic growth.
  • Serving to transfer to a service oriented structure by constructing core infrastructure such because the service mesh.
  • Leveling up each single staff by way of efficiency engineering, high quality and incident response.

If any of this sounds attention-grabbing to you please head over to our Senior Reliability Engineer job posting here.

This web site comprises hyperlinks to third-party web sites or different content material for info functions solely (“Third-Celebration Websites”). The Third-Celebration Websites will not be below the management of Coinbase, Inc., and its associates (“Coinbase”), and Coinbase will not be accountable for the content material of any Third-Celebration Website, together with with out limitation any hyperlink contained in a Third-Celebration Website, or any adjustments or updates to a Third-Celebration Website. Coinbase will not be accountable for webcasting or every other type of transmission acquired from any Third-Celebration Website. Coinbase is offering these hyperlinks to you solely as a comfort, and the inclusion of any hyperlink doesn't indicate endorsement, approval or advice by Coinbase of the location or any affiliation with its operators.

Reliability Engineering at Coinbase was initially revealed in The Coinbase Blog on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.

Leave a Reply

Your email address will not be published. Required fields are marked *