Wednesday, May 15, 2024

Moving over to Medium!

 Thanks for checking out my TODO: Fix This blog!


Please continue over to Medium, where I'll continually update and post content! Thank you!

Tuesday, April 30, 2024

Slashing Latency: How Uber's Cloud Proxy Transformed India's User Experience

In the fast-paced world of globally scaled technology, every millisecond counts. I joined Uber in April of 2015, and if I'm being truly honest, I'm not sure if this story happened at the end of 2015 or in 2016. It all sort of blurs together when you're working at a startup like Uber. For the sake of this story, I'm going to say that it's late 2015. We were already a global company and were constantly under the immense pressure of scale.

One day, while working in the office at 555 Market Street in downtown San Francisco, I overheard a conversation that piqued my interest. If I'm being honest, it sort of pissed me off. A team was discussing the high latency issues faced by our users in India. The latency was so severe that it took over 900 milliseconds to initialize the first fetch, stacking on each other and making the experience miserable. In hindsight, anger may have been a telling state of my emotional headspace at the time, but that's for another story. But I was angry for our users in India. I was angry they had to wait SECONDS just for the page to load. Adding to that, most of India at the time was on the 3G networking and on remote or rural connectivity, and their experiences would have made my mind numb.

Think about that. When most of us load an app that takes longer than 250 ms to initially fetch data, you'll usually do 1 of two things.

  1. Close the app and re-open it, hoping it's a transient issue.
  2. Close the app and never return.
Back at Uber HQ (or it was HQ at the time), I continued to listen to a group of engineers talking about how they wanted to approach this problem. They had been discussing using Squid, Nginx, or even a CDN to cache data at the edge of our data centers... They wanted to cache assets in our data centers... IN THE UNITED STATES! This was simply unacceptable. Anyone that has experienced any type of scale before can tell you this wouldn't even make a dent in terms of actual latency reduction. If I had to guess, it'd be in the terms of 10s of milliseconds, if that. As I continued to listen, it came up that they had been working on solving this problem for 3 weeks! Three whole weeks we had known about the issue and they were still on drawing board in terms of a solution.

Now, I had worked at Nest Labs prior to Uber and while it wasn't at all the same size in terms of scale, we did face problems with devices connecting from around the world. I knew the main culprit as soon as I heard the engineer outline the problem. The ocean has a lot of dead space in terms of traveling packets, and we're limited by the speed of light when it comes to global latency.

My mind was blown. I was motivated by their lack of progress and confident enough in my Go skills that in 15 minutes, while two engineers and a senior manager brainstormed a solution, I had written a proof of concept tool that would eventually become known as Cloudley, a companion to the service mesh tool known as Muttley. 

I don't have the original code, but it was small, something like 120 lines of code, but the real driving factor was Go's ReverseProxy. Using Go's ReverseProxy I was able to establish and cache a connection to the Uber Front End, persisting that secure connection and eliminating the need to do a 900+ millisecond TLS handshake each time a new request was made.

With this code snippet built and tested the goal was now simple: reduce latency and improve the user experience in India. Seven days later, we launched the product, pushing it as close to the end user as possible in a POP (point of presence) and the results were astounding. Latency dropped from 900+ milliseconds to a mere 400 milliseconds. The graph was used for various brown bags and all hands for a good 8 months, which was decades at Uber.


The Science Behind the Speed

To understand the significance of this improvement, let's dive into the math. Our primary ingress at the time was located in California, with a secondary data center in Virginia. The distance from California to India is approximately 13,000 kilometers. Considering the speed of light (around 300,000 km/s) and the fact that India was primarily using 3G networks at the time (which reduces the effective speed to about 133,333 km/s), we can calculate the round-trip time (RTT) as follows:

RTT = (13,142 km / 133,333 km/s) * 2 ≈ 197ms

Now, let's factor in the TLS handshake. TLS uses a three-way handshake, which means three round trips across the ocean. This translates to a p50 (median) latency of roughly 600 milliseconds just to establish the TLS connection. Add in the final hop back with data, and you get a total latency of around 800 milliseconds.

Halub3, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons



The Power of Go and Edge Computing

Our cloud proxy leveraged Go's reverse proxy functionality to establish a secure TLS connection to Uber's frontend. By deploying the proxy onto cloud providers right at the edge in India, we effectively eliminated the need for multiple round trips across the ocean. This simple yet powerful solution slashed nearly 600 milliseconds from each new request, resulting in a dramatically improved user experience.

While I've pivoted my career towards AI Risk and Security, I'm thankful for this (and many other) experiences I had at Uber. This was just once of many examples of some amazing ingenuity we developed and it will always be a fond memory for me.


Sunday, January 21, 2024

DEF CON 31 AIV Post Mortem

 DEF CON is always a flurry of activity, pushing technology to its limits and often revealing unforeseen challenges. This year, amidst all the buzz, there was a notable mention of the platform in Sven Cattell's "Generative Red Team Recap." While the recap provided an extensive overview, as one of the primary maintainers of the platform during the event, I'd like to address and clarify a few points, particularly concerning the network challenges and our response.



We observed high latency issues on August 11th between 12:00 and 14:30 PDT. Given the high stakes and our commitment to seamless service, our team quickly sprang into action. We implemented and deployed a reverse proxy, written in Go, which effectively resolved the latency and throttling issues. We immediately saw the latency issues within the DEF CON network vanish. This improved performance was consistent across various test locations, including on-site at DEF CON, two locations in the Bay Area (SF and Santa Cruz), and Argentina.


However, later in the day, around 15:30, there was a complete network outage lasting approximately 15 minutes. As the primary observer of our network activities, I noticed signs indicating internal throttling within the DEF CON network. External factors remained consistent, leading to some speculations which are best reserved for another discussion.


To highlight our system's robustness during this event, we successfully handled 134.5k requests on the first day. Of these, 27k were direct vendor requests to Large Language Models (LLMs). A minuscule 289 of these vendor requests resulted in a 4xx error, indicating client or request issues, while only 64 led to a 5xx error, implying vendor-side server problems. Of the 134.5k requests our service received, a total of 390 resulted in 5xx errors, giving us an uptime of 99.66% on the first day. It's worth noting that a bug in our system caused duplicate email entries to trigger a 500 error. When accounting for this, our actual unhandled SLA was 99.947%.



One of the pain points of the event ended up being the physical referral codes used in part to gate access to the event. While these codes were pre-generated and single-use per attendee, some codes ended up being recycled and given out to several attendees. This caused errors during signup, potentially contributing to the overall error rate and degradation in SLA, but more importantly, eroded user experience and caused confusion amongst attendees.


Another pain point of the event was an error that was believed to have leaked user credentials. Due to the network throttling from DEF CON at approximately 15:30 PDT mentioned above, for a small number of users, some elements of the user experience failed to load, causing other elements, such as forms, to fall back to their default behaviors. This included forming GET requests with user event information in the query path. Due to the TLS proxy between DEF CON and the platform's backend, there is little to no chance outside parties could gain access to this information for the few users it occurred for, and all logs that may have included this information have been erased. If this were to happen again, credentials could get logged in the web server access logs, something we should be aware of and check for. 


Even with these pain points, day one ran effectively with a large number of attendees greatly enjoying their experience in the GRT Challenge. Day 2 saw even smoother operations, with the previously encountered issues non-existent, a testament to our team's swift and effective response.



DEF CON 31 was an intense, learning-filled experience. We remain committed to pushing boundaries, innovating rapidly in the face of challenges, and ensuring the best service possible.

Moving over to Medium!

 Thanks for checking out my TODO: Fix This blog! Please continue over to Medium , where I'll continually update and post content! Thank ...