Pictured: Ms. Majors near Linden Lab's erstwhile Brighton UK office. Yes, her shirt says "Will work for L$"
Charity Majors has an excellent blog post on creating culture and rituals within a tech company's engineering team, and knows whereof she writes. She's CTO and co-founder of Honeycomb, a leading observability provider -- I.E., The service your company uses if it has a highly complex computer network and you want to see in real time across the whole system what's working well and what's not. Charity cut her teeth at Linden Lab -- a company with an insanely, wackily complex computer network -- and shares some of the insider rituals that the Linden engineers practiced, to keep up morale and keep Second Life online and running as smoothly as (relatively) possible.
Which brings us to Shrek Ears:
SHREK EARS
We had a matted green felt headband with ogre ears on it, called the Shrek Ears. The first time an engineer broke production, they would put on the Ears for a day. This might sound unpleasant, like a dunce cap, but no — it was a rite of passage. It was a badge of honor! Everyone breaks production eventually, if they’re working on something meaningful.
If you were wearing the Shrek Ears, people would stop you throughout the day and excitedly ask what happened, and reminisce about the first time they broke production. It became a way for 1) new engineers to meet lots of their teammates, 2) to socialize lots of production wisdom and risk factors, and 3) to normalize the fact that yes, things break sometimes, and it’s okay — nobody is going to yell at you.
So that's an inside insight for Second Life users. Whenever the virtual world was offline or constantly crashing, chances are someone in San Francisco was wandering around the office with green felt ears on their head. Those ears were worn, Charity tells me, "[L]iterally every outage."
But it wasn't a mandatory hazing ritual, she quickly adds: "The spirit of it was always very light hearted and fun. It wasn't to shame people, it was to congratulate people. You took down production? Congratulations, you're finally one of us now. Our thesis was that if you never took down prod, you probably weren't moving fast enough or taking enough risks. (Second Life was a crazy complex distributed system, even by today's standards, at a time before devops or IaaC or any modern tooling. The world was much more fragile one.)"
I remember Shrek Ears well, because wearing them began as a ritual during my own time at Linden Lab, from 2003-2006. In fact, they matriculated from the engineering team so that random people across the company would wear them after they goofed up in their own department. (I even put them on for awhile after I messed up a New World Notes detail, but I don't think anybody noticed.)
Speaking of Second Life downtimes, Charity told me about some major incidents from her time on the engineering team, from 2004-2010. Strap in, we're getting into the technical weeds, sharing a Shrek Ears photo, and talking about "Transactpocalypse" too:
Pictured: Fellow Linden alumna Erica Firment in 2008, showing off the Shrek Ears: "I messed up a Subversion commit and overwrote Callum [Linden]'s code. I wore the Shrek ears for a couple hours as penance." (Via her Flickr)
"There was the time we did the MYSQL 4.1 - 5.0 upgrade and ended up having to roll back and lose a whole day's worth of updates, because the performance was so terrible. (I spent a year after that working on MYSQL performance testing tools trying to de-risk the upgrade).
"Another fun thing from the MYSQL downtime was the fact that the world literally could never come back up again after going down completely... because people trying to log in would DoS the login service and mysql.agni. That's what led to our developing the 'velvet rope' process to slowly and selectively let people log back in in waves.
"There was, what did we call it, 'transactpocalypse' or something? When we noticed the transaction database had an auto-increment column that was just days away from running out of integers and taking the whole world down indefinitely."
The Lindens fixed that, she says, "By creating a new column with a different integer type, then copying the contents from the old column to the new -- first the entire thing, then syncing the changes since you began the copy in smaller and smaller incremental passes, then briefly locking the table and moving the old transaction column to a temporary name, and the new one to the old name. Nowadays you can do fancy 'online migrations' for your tables, but then we had to do it all by hand.
"As far as the users were concerned, the only impact was about two seconds of write errors -- effectively unnoticeable. But if we had run out, the system would likely have been down for days (that's how long it took to copy the column, it was enormous and these were the days of spinning rust -- pre-SSD)." Fortunately, the world was saved by the sharp eyes of a Linden engineer named Ryan, who quickly came up with a solution.
Read Charity's full post here, and marvel at the rituals needed to keep a fully 3D, user-created virtual world accessed by some 1 million people a month from not becoming a chaos of offline data.
Very entertaining behind the scenes look, thanks Hamlet. Love her shirt.
Posted by: Valentina kendal | Friday, September 02, 2022 at 05:34 AM
Reading this just a year later, it made me smile. This should give the strongest anti-LL naysayers — claiming that LL is made of sheer incompetents from top to bottom — some room for thought.
As I love to say: it's impossible to have a simple GUI for what is an extremely complex virtual environment beneath. As it is, the SL Viewer is merely a pale reflection of the layers upon layers of complexity that keep Second Life running as smoothly as possible.
The very few who have, indeed, experienced what it feels are current OpenSimulator grid operators. Granted, OpenSimulator is not Second Life — not by far! — and you'd be hard-pressed to find a grid with more than a few dozens of thousands of monthly active users, a few hundreds of those actively logged in (OpenSimulator deals very easily with such low demands). But it should give you pause for thought, as you start looking at what exactly happens to one simulator when just one avatar is logged in to it, and not even doing much, beyond moving their camera around...
Now look at the logs and whatever measurements you use to figure out what's going on with your system, and multiply it by 60,000 (roughly the number of simultaneous users that Second Life has every day). Then you can start to get a feeling of what it takes to "run" a platform such as the Second Life Grid, and you start to appreciate those Lindens working hard every day to keep the grid going a lot more...
Posted by: Gwyneth Llewelyn | Tuesday, October 10, 2023 at 02:11 AM