SREcon22 Asia/Pacific is a three-day conference organized by the USENIX Association, which was held on 7-9 December 2022 in Sydney, Australia. I am Kazuki Higashiguchi, a senior site reliability engineer in the Autify engineering team. At this conference, I had the great fortune and privilege to talk about a stateful system.

This blog reports on SREcon22 and describes the technical details of our unique showcase of a stateful system – a WebSocket server for secure tunneling called Autify Connect.

SREcon

SREcon is one of the biggest global conferences, a gathering of engineers who care deeply about site reliability, system engineering, and working with complex distributed systems at scale. Since 2014, SREcon has been held annually and in three regions, America, East/Africa, and Asia/Pacific recently. SREcon22 Asia/Pacific was the first in-person SREcon in Asia/Pacific region since the emergency of COVID-19.

There were hundreds of participants at the conference hall (I remember there were roughly 300 attendees), and they came from variants of countries. I took a picture in the main hall, showing the atmosphere where many engineers enjoyed this conference in person.

The conference program is awe-inspiring, with a keynote by Brendan Gregg, the author of “Systems Performance,” and other talks by speakers from big tech companies.

My talk

My talk, A Better Way to Manage Stateful Systems: Design for Observability and Robust Deployment, literally mentions a better way of managing stateful systems. Stateful systems have a critical state for business requirements, which makes it difficult to secure good observability and build a robust deployment pipeline. For that reason (and others), we prefer to build our system components to be as stateless as possible, and it is known as one of the best practices for operating servers these days.

Providing a typical web application, we won’t need stateful servers in most cases. Still, we had an exceptional situation where it was more effective for user experience to select building a stateful system.

We call this system component Autify Connect. It provides functionality that establishes a secure tunnel between customers’ local machines and Autify so that it’s possible to access customers’ private applications under test from Autify.

My talk uses this project as a case study and explains what we’ve learned. I published the presentation material at SpeakerDeck, so you can see what I’ve spoken at SREcon.

https://speakerdeck.com/hgsgtk/srecon2022apac-design-of-stateful-system

It was challenging for me to speak in front of roughly one hundred participants in the conference room because it had passed years since I gave a presentation in person. Therefore I was relieved that I heard a big clap from the participants after finishing my presentation and Q and A time.

I took a picture right after I finished my presentation. From now, I’ll highlight the key points in my talk briefly.

Autify for Web’s infrastructure

A little background context – Autify is a test automation platform, so it is mission-critical to ensure stable test execution infrastructure for our customers. To support this, there are some primary infrastructure components in Autify for Web:

  • Web server: management console (e.g., create/edit test scenarios)
  • Worker: test execution engine
  • Device farm: Browsers and devices running tests
  • Autify Connect server / client

This diagram shows the high-level infrastructure overview when a customer executes a test.

A customer can trigger test executions via either our management console or Autify API, and then the web server requests their test execution to a worker. A worker facilitates test executions and sends commands to a device farm. A device sends requests to the system under test while it browses its application.

When customers’ applications under testing are not accessible from the public internet, the Connect server and a Connect Client are added to the network route to execute tests. The responsibility is establishing a secure bidirectional tunnel connection with customers’ private networks.

Autify Connect technical requirement

As you can see, Autify Connect is a part of this mission-critical business workflow to enable customers to run tests against their applications on their private networks.

Autify Connect technically uses the WebSocket protocol for secure bidirectional tunnel connection, which is one of the typical stateful protocols. This connection lives for a long time until a customer disconnects it.

In addition, it behaves as a proxy server to transfer requests from the device farm. A Connect server accepts a proxy request and transfers test connections over WebSocket.

These system components are built in Go language and developed from scratch. We ship code frequently into production.

Customers use this service right now to test applications on their private networks. As previously mentioned, being a test automation platform, the stability of test executions is essential for our business.

Therefore, we tried to build an automated, zero-downtime deployment approach since it is a mission-critical system component for customers’ test execution.

An issue with a typical Blue Green Deployment workflow

A commonly-used approach for zero-downtime deployment is Blue Green Deployment, a deployment strategy in which we create two separate, but identical environments (Blue and Green).

To sum it up briefly, Blue Green Deployment proceeds with the following steps:

  1. Launch and test new servers in the Green environment.
  2. Switch a router so incoming requests go to the Green environment from the original Blue environment.
  3. Keep the original servers in the Blue environment running for a while so that we can roll back to the original quickly.
  4. Terminate the original servers in the Blue environment.

This idea works well without any additional steps in most cases, but this is an issue when the target server is stateful. Speaking of Autify Connect, a tunnel connection between a server and a client can live for a long time even after switching new requests to the Green environment. It indicates that there could be some connections in the original servers in the Blue environment that shouldn’t disconnect when a Blue Green Deployment tool terminates the original servers.

A key point is that terminating a server that has a living state leads to customer-visible failures.

An enhanced Blue Green Deployment workflow

The approach that we adopt in Autify Connect is adding two steps to a typical Blue Green Deployment workflow.

  1. Launch and test new servers in the Green environment.
  2. Switch a router so incoming requests go to the Green environment from the original Blue environment.
  3. Shut down idle sessions (tunnel connections) in the original servers in the Blue environment.
  4. Keep the original servers in the Blue environment running for a while so that we can roll back to the original quickly.
  5. Verify if the original servers in the Blue environment is ready to terminate.
  6. Terminate the original servers in the Blue environment.

The first step we added is to shut down idle sessions (tunnel connections) in the original servers in the Blue environment. It is not realistic to wait for all clients to terminate sessions with an old server, so we should have a proactive approach to shut down long-living states properly and navigate them to a new server.

We shouldn’t terminate busy tunnel connections which are used by customers’ test execution, or customers will see unexpected network errors in their test execution results.

Fortunately, it is possible to shut down these connections safely, because of two attributes of the test execution service.

The first attribute is that a customer does not continue to use this session to run tests all the time. Let’s say I am an Autify customer and want to run test scenarios against my localhost machine. I launch an Autify Connect client on my laptop and establish a new session with the Autify Cloud. Then, the server has a session but this session is not busy until I actually run a test. The second attribute is that a browser doesn’t transmit network requests all the time even when a customer heavily uses Autify Connect (such as for monitoring purposes). Let’s see the below example.

A typical browser automation scenario contains various actions, such as opening a page, “inputting a value in a form, finding an element, and asserting an actual value matches an expected one. Not every action emits a network request, for example, an assertion can be done locally.

The second step we added is verifying if the Blue environment’s original servers are ready to terminate. A stateful system tends to be a black box, invisible from external systems. Here is a key point, we should design a stateful system to be able to see necessary details inside of its state. It can be achieved in a few ways – logging, infrastructure metrics, and writing state to a file. Still, I recommend implementing an HTTP endpoint to show metrics of internal states, because it is easy to use from external programs, such as a monitoring agent or a script in a deployment pipeline

Implementation in AWS

Cloud vendors such as AWS offer excellent managed services for Blue Green Deployment, but they were not a fit. Therefore, we’ve developed an in-house deployment pipeline. There are lots of options to implement such a custom deployment pipeline, but we decided to use AWS Step Functions and AWS Lambda. AWS Step Functions is a serverless orchestration service that lets you integrate with AWS functions and other AWS services to build business-critical applications.

The below picture is a masked image that I took of our actual state machine modeled in AWS Step Functions. It is a visual workflow service, so we can see the results more easily.

In-person conference in Sydney

SREcon22 Asia/Pacific was held at Sheraton Grand Sydney Hyde Park, a famous hotel among tourists.

The hotel has big conference rooms, bars, restaurants, and clean hotel rooms. Thanks to USENIX’s special offer to conference attendees, I could book the hotel room with a big discount.

My hotel room was comfortable and an excellent place to rehearse my presentation. I spent much time in this hotel room, including midnight, to finalize the presentation material and manuscript.

The hotel is located nearby a famous symbol, the Sydney Opera House. It takes a 15-minutes walk to get there. Sydney is a beautiful city, so I really enjoyed staying there for five days.

The conference offered attendees lots of opportunities to communicate with other participants. Its program has a “Morning Coffee and Tea time,” “Lunch break,” and “Break with Refreshments” every day, which gave us good chances to talk with other participants. There, we can enjoy tasty Australian meals and coffee.

I enjoyed the conversations I had with the other participants. Some attendees encouraged me to be confident in the content of my presentation. Some recommended good places to buy souvenirs, and some let me know their working style in a big tech company. My experience of the event would not have been the same without these memorable conversations, so I truly appreciate them.

On the final day, I asked Liz Fong-Jones to take a picture with me. She is a program co-chair of this conference and one of the authors of the book “Observability Engineering” published by O’Reilly. This book helped me so much to build the logic of my presentation about observability, so I was so happy to tell her that I am a big fan of this book.

Reflection

It was my huge honor to stand as a speaker at SREcon. I really appreciate the hard work and dedication of the USENIX conference organizers to create such an amazing event.

SREcon gave me a lot of confidence in speaking in front of a huge global audience for 20 minutes. I hope I will come back to a future event as a speaker again.