Names changed to protect the dead and innocent.
The bright afternoon sun shines into my open office. Its 2:05PM I notice as I stare blankly at my Laptop.
“It’s gone,” says Fred, one of the senior DevOps engineers.
“What’s gone?” I ask.
“ALL OF IT,” Fred replies.
I fire up Chrome and type https://player.bigcorp.tv
. Nothing. A white background in small Monaco font reads: “Server Error”. I press F12 and check developer tools. The status code column reads a singular 502. Our front-end is served from one service and the backend API, for which I am responsible, is served from another service. Using a terminal I check the backend with a curl command to see if I can hit publicly accessible data. Another 502. I quickly navigate to https://internal.bigcorp.tv/status
, and see “Server Error” again. I type https://internal.bigcorp.tv/status
into the location bar. Our internal status is also throwing the now familiar “Server Error.” “Where is Logan?” I ask. “He left for lunch,” replies Fred.
We login to the root AWS account and start checking our super resilient, uncrashable Mesos infrastructure that was costing up to $100,000 a month. Mesos has worker nodes that require connection to the control plane nodes to operate. Fred explains that the control plane nodes, running on virtual machines, in AWS had been corrupted and rebuilt that morning. The reason our infrastructure was uncrashable was our lead DevOps engineer, Ryan, had extra control plane nodes running in Mesos itself. This self-hosted model allowed us to always spin up Mesos masters in such an event. The DevOps Manager, Logan, was away at lunch.
The control plane running in Mesos had split brained and was unable to take over the leader role when the AWS nodes were rebuilt. Fred had thought restarting them would quickly synchronize all the control plane nodes, and if anything was wrong we had backups of the control plane data. When this happened, unbeknownst to Fred, they lost the only good copy of the master data. Docker containers, if not mounted to a machine on their host, have ephemeral disk space. “The last backup was 4 months ago,” Fred informs me. “Does Will know?” I ask. Will is my boss and the Director of our Digital TV Product.
“IF THIS IS NOT UP IMMEDIATELY THEY ARE GOING TO FIRE ME AND ALL OF YOU,” yells Will. As usual the concern of Will not earning income while his high profile attorney wife cucks him is first of his concerns. “WHERE THE FUCK ARE LOGAN AND BRADEN?” Braden is the mastermind of this Mesos Architecture, likely he could fix it, however, Braden is spending his vacation by visiting Burning Man for the first time. His boss Logan is MIA at lunch. Will gives Logan a call and informs him the entire platform is down for customers and every internal media management employee at the company.
2:45 PM.
Logan enters the office and scowls at me. Surprisingly, he walks straight to where the action is. I half-expected him to retreat to the DevOps room and tinker with his modular synthesizer letting everyone else clean up the mess. I’ve had multiple public fights with him about the cost of running this Mesos Frankenstein. My budget partially pays for the $100,000 a month infrastructure. “The point of a container orchestration platform is to scale down so we can spend LESS money,” I would scream after noting the bill had gone up more than 10x a month after migrating to his Mesos infrastructure. Now, despite how awesome the Mesos platform is, and how all my teams would love it, it is down.
By 4:00 PM afternoon bar patrons, I mean alcoholics, have been without bar TV almost 3 hours. The only thing Logan has accomplished is pacing nervously around the office and occasionally breathing stale, hungover breath into Fred’s ear. The head of North America is now hovering, with Will my boss. Unfortunately the mobile and TV apps were crashing every time you tap or click the TV icon. Our platform provides digital TV to a European TV product. Europe is asleep by now, but the customers, and our European execs, will definitely start calling by morning. If you have never been yelled at by a German exec, its just as terrifying as old WW2 clips have you believe. Everyone looks increasingly nervous. “We can just run the services on bare AWS without Mesos,” I suggest.
Logan had finally made contact with Braden via text. He stated the Mesostein was in jeopardy and if they didn’t fix it, he would certainly lose his job. Even though Braden had not slept the night before and may or may not have had pharmaceutical assistance with not sleeping he decided to hurriedly leave Burning Man and drive the 8 hours back to the office, targeting arrival before 11PM.
The backend is a simple Golang app which is easy to run with a single command. I demo to Will and the NA lead how I can route the DNS directly to the box via auto scaling groups, which gets us scaling out of the box. The database for the backend was not impacted and running in RDS, so this works, and we see TV show titles and M3U8 playlist URLs in a JSON blob. At 4:15 PM we have a strategy.
It didn’t take long to get the front-end application running again, which was a simple node.js app. Hitting the staging URL the site was back-up, but not the internal management tools of it. At 5:15PM we shout “HOORAY!” The site is back up. Unfortunately, the service which serves the M3U8 playlists responsible for playing the video is a Java web-service, whose lead has just left the company. In parallel the video playout team had been trying to get the service running, but they are not familiar with the dark arts of Linux CLI and the AWS console. Fred and the DevOps team is still trying to jolt Frankenmesos back to life so they are of little help.
A new engineer on the video team frustrated with the verbosity and complexity of the service had built a skunk works project which generated M3U8 playlists for video. It was missing the advertising stitching capability, but it would play video if you pointed our video player at it without ads. We demoed this to Will. “We can just change the M3U8 url in the database for the non-working video service to this service,” we say. “But its completely untested,” Will says. “Yeah but no video works, what do we have to lose?”, I reply. “Fuck it,” he says.
The next 4 hours we spent spinning up new services by hand on AWS running the video service with nohup after SSHing into them on public IP addresses. Around 7PM I receive a phone call from Braden. He tells me to check my text messages. I look and see a message he’s sent with a picture. The picture contains a small BMW hatchback connected to a small UHaul. Both are totally destroyed, the contents of the UHaul are strewn about the highway complete with a bunny onesie and California mountains in the background.
Midnight.
We were now watching video on our player, streaming from a completely untested service. The database was updated with the new service URL just as our European customers started waking up to watch. The video team was now free to go home, and the DevOps team was directed to help them get the old video service running in the morning.
At 7 AM I arrive back in the office with a $5 pour over coffee in hand. The video team beat me there. They were still trying to save face and get their application working after being bested by a junior engineer and weekend project. Despite the panic of the previous day, they were close to getting back up and running.
I take the team and Fred out to a food truck lunch for take out. We pass Logan as we approach the second office building. “I’m out,” he says. “Leaving already after yesterday?” I ask, but shit like this isn’t surprising for him. “They fired me,” he replies. “I guess we will always have Mesos,” I say. “Hey, come over into my office,” Will catches me offguard holding my Butter chicken, which is burning my hand, but soon to burn my asshole. “How would you like to run the DevOps team as well?” he says. “Only if we can delete that hell spawn of an infrastructure,” I reply.
Our next months bill came to $7000–a savings of $93,000 a month
submitted by /u/Impressive_Act5198
[link] [comments]
r/cscareerquestions Names changed to protect the dead and innocent. The bright afternoon sun shines into my open office. Its 2:05PM I notice as I stare blankly at my Laptop. “It’s gone,” says Fred, one of the senior DevOps engineers. “What’s gone?” I ask. “ALL OF IT,” Fred replies. I fire up Chrome and type https://player.bigcorp.tv. Nothing. A white background in small Monaco font reads: “Server Error”. I press F12 and check developer tools. The status code column reads a singular 502. Our front-end is served from one service and the backend API, for which I am responsible, is served from another service. Using a terminal I check the backend with a curl command to see if I can hit publicly accessible data. Another 502. I quickly navigate to https://internal.bigcorp.tv/status, and see “Server Error” again. I type https://internal.bigcorp.tv/status into the location bar. Our internal status is also throwing the now familiar “Server Error.” “Where is Logan?” I ask. “He left for lunch,” replies Fred. We login to the root AWS account and start checking our super resilient, uncrashable Mesos infrastructure that was costing up to $100,000 a month. Mesos has worker nodes that require connection to the control plane nodes to operate. Fred explains that the control plane nodes, running on virtual machines, in AWS had been corrupted and rebuilt that morning. The reason our infrastructure was uncrashable was our lead DevOps engineer, Ryan, had extra control plane nodes running in Mesos itself. This self-hosted model allowed us to always spin up Mesos masters in such an event. The DevOps Manager, Logan, was away at lunch. The control plane running in Mesos had split brained and was unable to take over the leader role when the AWS nodes were rebuilt. Fred had thought restarting them would quickly synchronize all the control plane nodes, and if anything was wrong we had backups of the control plane data. When this happened, unbeknownst to Fred, they lost the only good copy of the master data. Docker containers, if not mounted to a machine on their host, have ephemeral disk space. “The last backup was 4 months ago,” Fred informs me. “Does Will know?” I ask. Will is my boss and the Director of our Digital TV Product. “IF THIS IS NOT UP IMMEDIATELY THEY ARE GOING TO FIRE ME AND ALL OF YOU,” yells Will. As usual the concern of Will not earning income while his high profile attorney wife cucks him is first of his concerns. “WHERE THE FUCK ARE LOGAN AND BRADEN?” Braden is the mastermind of this Mesos Architecture, likely he could fix it, however, Braden is spending his vacation by visiting Burning Man for the first time. His boss Logan is MIA at lunch. Will gives Logan a call and informs him the entire platform is down for customers and every internal media management employee at the company. 2:45 PM. Logan enters the office and scowls at me. Surprisingly, he walks straight to where the action is. I half-expected him to retreat to the DevOps room and tinker with his modular synthesizer letting everyone else clean up the mess. I’ve had multiple public fights with him about the cost of running this Mesos Frankenstein. My budget partially pays for the $100,000 a month infrastructure. “The point of a container orchestration platform is to scale down so we can spend LESS money,” I would scream after noting the bill had gone up more than 10x a month after migrating to his Mesos infrastructure. Now, despite how awesome the Mesos platform is, and how all my teams would love it, it is down. By 4:00 PM afternoon bar patrons, I mean alcoholics, have been without bar TV almost 3 hours. The only thing Logan has accomplished is pacing nervously around the office and occasionally breathing stale, hungover breath into Fred’s ear. The head of North America is now hovering, with Will my boss. Unfortunately the mobile and TV apps were crashing every time you tap or click the TV icon. Our platform provides digital TV to a European TV product. Europe is asleep by now, but the customers, and our European execs, will definitely start calling by morning. If you have never been yelled at by a German exec, its just as terrifying as old WW2 clips have you believe. Everyone looks increasingly nervous. “We can just run the services on bare AWS without Mesos,” I suggest. Logan had finally made contact with Braden via text. He stated the Mesostein was in jeopardy and if they didn’t fix it, he would certainly lose his job. Even though Braden had not slept the night before and may or may not have had pharmaceutical assistance with not sleeping he decided to hurriedly leave Burning Man and drive the 8 hours back to the office, targeting arrival before 11PM. The backend is a simple Golang app which is easy to run with a single command. I demo to Will and the NA lead how I can route the DNS directly to the box via auto scaling groups, which gets us scaling out of the box. The database for the backend was not impacted and running in RDS, so this works, and we see TV show titles and M3U8 playlist URLs in a JSON blob. At 4:15 PM we have a strategy. It didn’t take long to get the front-end application running again, which was a simple node.js app. Hitting the staging URL the site was back-up, but not the internal management tools of it. At 5:15PM we shout “HOORAY!” The site is back up. Unfortunately, the service which serves the M3U8 playlists responsible for playing the video is a Java web-service, whose lead has just left the company. In parallel the video playout team had been trying to get the service running, but they are not familiar with the dark arts of Linux CLI and the AWS console. Fred and the DevOps team is still trying to jolt Frankenmesos back to life so they are of little help. A new engineer on the video team frustrated with the verbosity and complexity of the service had built a skunk works project which generated M3U8 playlists for video. It was missing the advertising stitching capability, but it would play video if you pointed our video player at it without ads. We demoed this to Will. “We can just change the M3U8 url in the database for the non-working video service to this service,” we say. “But its completely untested,” Will says. “Yeah but no video works, what do we have to lose?”, I reply. “Fuck it,” he says. The next 4 hours we spent spinning up new services by hand on AWS running the video service with nohup after SSHing into them on public IP addresses. Around 7PM I receive a phone call from Braden. He tells me to check my text messages. I look and see a message he’s sent with a picture. The picture contains a small BMW hatchback connected to a small UHaul. Both are totally destroyed, the contents of the UHaul are strewn about the highway complete with a bunny onesie and California mountains in the background. Midnight. We were now watching video on our player, streaming from a completely untested service. The database was updated with the new service URL just as our European customers started waking up to watch. The video team was now free to go home, and the DevOps team was directed to help them get the old video service running in the morning. At 7 AM I arrive back in the office with a $5 pour over coffee in hand. The video team beat me there. They were still trying to save face and get their application working after being bested by a junior engineer and weekend project. Despite the panic of the previous day, they were close to getting back up and running. I take the team and Fred out to a food truck lunch for take out. We pass Logan as we approach the second office building. “I’m out,” he says. “Leaving already after yesterday?” I ask, but shit like this isn’t surprising for him. “They fired me,” he replies. “I guess we will always have Mesos,” I say. “Hey, come over into my office,” Will catches me offguard holding my Butter chicken, which is burning my hand, but soon to burn my asshole. “How would you like to run the DevOps team as well?” he says. “Only if we can delete that hell spawn of an infrastructure,” I reply. Our next months bill came to $7000–a savings of $93,000 a month submitted by /u/Impressive_Act5198 [link] [comments]
Names changed to protect the dead and innocent.
The bright afternoon sun shines into my open office. Its 2:05PM I notice as I stare blankly at my Laptop.
“It’s gone,” says Fred, one of the senior DevOps engineers.
“What’s gone?” I ask.
“ALL OF IT,” Fred replies.
I fire up Chrome and type https://player.bigcorp.tv
. Nothing. A white background in small Monaco font reads: “Server Error”. I press F12 and check developer tools. The status code column reads a singular 502. Our front-end is served from one service and the backend API, for which I am responsible, is served from another service. Using a terminal I check the backend with a curl command to see if I can hit publicly accessible data. Another 502. I quickly navigate to https://internal.bigcorp.tv/status
, and see “Server Error” again. I type https://internal.bigcorp.tv/status
into the location bar. Our internal status is also throwing the now familiar “Server Error.” “Where is Logan?” I ask. “He left for lunch,” replies Fred.
We login to the root AWS account and start checking our super resilient, uncrashable Mesos infrastructure that was costing up to $100,000 a month. Mesos has worker nodes that require connection to the control plane nodes to operate. Fred explains that the control plane nodes, running on virtual machines, in AWS had been corrupted and rebuilt that morning. The reason our infrastructure was uncrashable was our lead DevOps engineer, Ryan, had extra control plane nodes running in Mesos itself. This self-hosted model allowed us to always spin up Mesos masters in such an event. The DevOps Manager, Logan, was away at lunch.
The control plane running in Mesos had split brained and was unable to take over the leader role when the AWS nodes were rebuilt. Fred had thought restarting them would quickly synchronize all the control plane nodes, and if anything was wrong we had backups of the control plane data. When this happened, unbeknownst to Fred, they lost the only good copy of the master data. Docker containers, if not mounted to a machine on their host, have ephemeral disk space. “The last backup was 4 months ago,” Fred informs me. “Does Will know?” I ask. Will is my boss and the Director of our Digital TV Product.
“IF THIS IS NOT UP IMMEDIATELY THEY ARE GOING TO FIRE ME AND ALL OF YOU,” yells Will. As usual the concern of Will not earning income while his high profile attorney wife cucks him is first of his concerns. “WHERE THE FUCK ARE LOGAN AND BRADEN?” Braden is the mastermind of this Mesos Architecture, likely he could fix it, however, Braden is spending his vacation by visiting Burning Man for the first time. His boss Logan is MIA at lunch. Will gives Logan a call and informs him the entire platform is down for customers and every internal media management employee at the company.
2:45 PM.
Logan enters the office and scowls at me. Surprisingly, he walks straight to where the action is. I half-expected him to retreat to the DevOps room and tinker with his modular synthesizer letting everyone else clean up the mess. I’ve had multiple public fights with him about the cost of running this Mesos Frankenstein. My budget partially pays for the $100,000 a month infrastructure. “The point of a container orchestration platform is to scale down so we can spend LESS money,” I would scream after noting the bill had gone up more than 10x a month after migrating to his Mesos infrastructure. Now, despite how awesome the Mesos platform is, and how all my teams would love it, it is down.
By 4:00 PM afternoon bar patrons, I mean alcoholics, have been without bar TV almost 3 hours. The only thing Logan has accomplished is pacing nervously around the office and occasionally breathing stale, hungover breath into Fred’s ear. The head of North America is now hovering, with Will my boss. Unfortunately the mobile and TV apps were crashing every time you tap or click the TV icon. Our platform provides digital TV to a European TV product. Europe is asleep by now, but the customers, and our European execs, will definitely start calling by morning. If you have never been yelled at by a German exec, its just as terrifying as old WW2 clips have you believe. Everyone looks increasingly nervous. “We can just run the services on bare AWS without Mesos,” I suggest.
Logan had finally made contact with Braden via text. He stated the Mesostein was in jeopardy and if they didn’t fix it, he would certainly lose his job. Even though Braden had not slept the night before and may or may not have had pharmaceutical assistance with not sleeping he decided to hurriedly leave Burning Man and drive the 8 hours back to the office, targeting arrival before 11PM.
The backend is a simple Golang app which is easy to run with a single command. I demo to Will and the NA lead how I can route the DNS directly to the box via auto scaling groups, which gets us scaling out of the box. The database for the backend was not impacted and running in RDS, so this works, and we see TV show titles and M3U8 playlist URLs in a JSON blob. At 4:15 PM we have a strategy.
It didn’t take long to get the front-end application running again, which was a simple node.js app. Hitting the staging URL the site was back-up, but not the internal management tools of it. At 5:15PM we shout “HOORAY!” The site is back up. Unfortunately, the service which serves the M3U8 playlists responsible for playing the video is a Java web-service, whose lead has just left the company. In parallel the video playout team had been trying to get the service running, but they are not familiar with the dark arts of Linux CLI and the AWS console. Fred and the DevOps team is still trying to jolt Frankenmesos back to life so they are of little help.
A new engineer on the video team frustrated with the verbosity and complexity of the service had built a skunk works project which generated M3U8 playlists for video. It was missing the advertising stitching capability, but it would play video if you pointed our video player at it without ads. We demoed this to Will. “We can just change the M3U8 url in the database for the non-working video service to this service,” we say. “But its completely untested,” Will says. “Yeah but no video works, what do we have to lose?”, I reply. “Fuck it,” he says.
The next 4 hours we spent spinning up new services by hand on AWS running the video service with nohup after SSHing into them on public IP addresses. Around 7PM I receive a phone call from Braden. He tells me to check my text messages. I look and see a message he’s sent with a picture. The picture contains a small BMW hatchback connected to a small UHaul. Both are totally destroyed, the contents of the UHaul are strewn about the highway complete with a bunny onesie and California mountains in the background.
Midnight.
We were now watching video on our player, streaming from a completely untested service. The database was updated with the new service URL just as our European customers started waking up to watch. The video team was now free to go home, and the DevOps team was directed to help them get the old video service running in the morning.
At 7 AM I arrive back in the office with a $5 pour over coffee in hand. The video team beat me there. They were still trying to save face and get their application working after being bested by a junior engineer and weekend project. Despite the panic of the previous day, they were close to getting back up and running.
I take the team and Fred out to a food truck lunch for take out. We pass Logan as we approach the second office building. “I’m out,” he says. “Leaving already after yesterday?” I ask, but shit like this isn’t surprising for him. “They fired me,” he replies. “I guess we will always have Mesos,” I say. “Hey, come over into my office,” Will catches me offguard holding my Butter chicken, which is burning my hand, but soon to burn my asshole. “How would you like to run the DevOps team as well?” he says. “Only if we can delete that hell spawn of an infrastructure,” I reply.
Our next months bill came to $7000–a savings of $93,000 a month
submitted by /u/Impressive_Act5198
[link] [comments]