Thanks for checking out my TODO: Fix This blog!
Please continue over to Medium, where I'll continually update and post content! Thank you!
Just another engineer wrestling with AI. From Silicon Valley startups to congressional showcases, I’m hacking away at the future of responsible AI, one post at a time.
Thanks for checking out my TODO: Fix This blog!
Please continue over to Medium, where I'll continually update and post content! Thank you!
In the fast-paced world of globally scaled technology, every millisecond counts. I joined Uber in April of 2015, and if I'm being truly honest, I'm not sure if this story happened at the end of 2015 or in 2016. It all sort of blurs together when you're working at a startup like Uber. For the sake of this story, I'm going to say that it's late 2015. We were already a global company and were constantly under the immense pressure of scale.
One day, while working in the office at 555 Market Street in downtown San Francisco, I overheard a conversation that piqued my interest. If I'm being honest, it sort of pissed me off. A team was discussing the high latency issues faced by our users in India. The latency was so severe that it took over 900 milliseconds to initialize the first fetch, stacking on each other and making the experience miserable. In hindsight, anger may have been a telling state of my emotional headspace at the time, but that's for another story. But I was angry for our users in India. I was angry they had to wait SECONDS just for the page to load. Adding to that, most of India at the time was on the 3G networking and on remote or rural connectivity, and their experiences would have made my mind numb.
Think about that. When most of us load an app that takes longer than 250 ms to initially fetch data, you'll usually do 1 of two things.
To understand the significance of this improvement, let's dive into the math. Our primary ingress at the time was located in California, with a secondary data center in Virginia. The distance from California to India is approximately 13,000 kilometers. Considering the speed of light (around 300,000 km/s) and the fact that India was primarily using 3G networks at the time (which reduces the effective speed to about 133,333 km/s), we can calculate the round-trip time (RTT) as follows:
Now, let's factor in the TLS handshake. TLS uses a three-way handshake, which means three round trips across the ocean. This translates to a p50 (median) latency of roughly 600 milliseconds just to establish the TLS connection. Add in the final hop back with data, and you get a total latency of around 800 milliseconds.
Halub3, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons |
Our cloud proxy leveraged Go's reverse proxy functionality to establish a secure TLS connection to Uber's frontend. By deploying the proxy onto cloud providers right at the edge in India, we effectively eliminated the need for multiple round trips across the ocean. This simple yet powerful solution slashed nearly 600 milliseconds from each new request, resulting in a dramatically improved user experience.
While I've pivoted my career towards AI Risk and Security, I'm thankful for this (and many other) experiences I had at Uber. This was just once of many examples of some amazing ingenuity we developed and it will always be a fond memory for me.
DEF CON is always a flurry of activity, pushing technology to its limits and often revealing unforeseen challenges. This year, amidst all the buzz, there was a notable mention of the platform in Sven Cattell's "Generative Red Team Recap." While the recap provided an extensive overview, as one of the primary maintainers of the platform during the event, I'd like to address and clarify a few points, particularly concerning the network challenges and our response.
We observed high latency issues on August 11th between 12:00 and 14:30 PDT. Given the high stakes and our commitment to seamless service, our team quickly sprang into action. We implemented and deployed a reverse proxy, written in Go, which effectively resolved the latency and throttling issues. We immediately saw the latency issues within the DEF CON network vanish. This improved performance was consistent across various test locations, including on-site at DEF CON, two locations in the Bay Area (SF and Santa Cruz), and Argentina.
However, later in the day, around 15:30, there was a complete network outage lasting approximately 15 minutes. As the primary observer of our network activities, I noticed signs indicating internal throttling within the DEF CON network. External factors remained consistent, leading to some speculations which are best reserved for another discussion.
To highlight our system's robustness during this event, we successfully handled 134.5k requests on the first day. Of these, 27k were direct vendor requests to Large Language Models (LLMs). A minuscule 289 of these vendor requests resulted in a 4xx error, indicating client or request issues, while only 64 led to a 5xx error, implying vendor-side server problems. Of the 134.5k requests our service received, a total of 390 resulted in 5xx errors, giving us an uptime of 99.66% on the first day. It's worth noting that a bug in our system caused duplicate email entries to trigger a 500 error. When accounting for this, our actual unhandled SLA was 99.947%.
One of the pain points of the event ended up being the physical referral codes used in part to gate access to the event. While these codes were pre-generated and single-use per attendee, some codes ended up being recycled and given out to several attendees. This caused errors during signup, potentially contributing to the overall error rate and degradation in SLA, but more importantly, eroded user experience and caused confusion amongst attendees.
Another pain point of the event was an error that was believed to have leaked user credentials. Due to the network throttling from DEF CON at approximately 15:30 PDT mentioned above, for a small number of users, some elements of the user experience failed to load, causing other elements, such as forms, to fall back to their default behaviors. This included forming GET requests with user event information in the query path. Due to the TLS proxy between DEF CON and the platform's backend, there is little to no chance outside parties could gain access to this information for the few users it occurred for, and all logs that may have included this information have been erased. If this were to happen again, credentials could get logged in the web server access logs, something we should be aware of and check for.
Even with these pain points, day one ran effectively with a large number of attendees greatly enjoying their experience in the GRT Challenge. Day 2 saw even smoother operations, with the previously encountered issues non-existent, a testament to our team's swift and effective response.
DEF CON 31 was an intense, learning-filled experience. We remain committed to pushing boundaries, innovating rapidly in the face of challenges, and ensuring the best service possible.
Thanks for checking out my TODO: Fix This blog! Please continue over to Medium , where I'll continually update and post content! Thank ...