Fault Tolerant Router

Posted on March 3, 2015 by mea

Fault Tolerant Router is a daemon, running in background on a Linux router or firewall, monitoring the state of multiple internet uplinks/providers and changing the routing accordingly. LAN/DMZ internet traffic (outgoing connections) is load balanced between the uplinks using Linux multipath routing. The daemon monitors the state of the uplinks by routinely pinging well known IP addresses (Google public DNS servers, etc.) through each outgoing interface: once an uplink goes down, it is excluded from the multipath routing, when it comes back up, it is included again. All of the routing changes are notified to the administrator by email.

via Fault Tolerant Router

Snowflake-shaped networks are easiest to mend

Posted on October 4, 2014 by mea

They found the best networks are made from partial loops around the units of the grid, with exactly one side of each loop missing. All of these partial loops link together, back to a central source. These have a low repair cost because if a link breaks, the repair simply involves adding back the missing side of a loop. What’s more, they are resistant to multiple breaks over time, as each repair preserves the network’s fundamental design.

via Snowflake-shaped networks are easiest to mend – tech – 03 October 2014 – New Scientist.

Tension and Flaws Before Health Website Crash

Posted on November 25, 2013 by mea

Thanks to a huge effort to fix the most obvious weaknesses and the appointment at last of a single contractor, QSSI, to oversee the work, the website now crashes much less frequently, officials said. That is a major improvement from a month ago, when it was up only 42 percent of the time and 10-hour failures were common. Yet an enormous amount of work remains to be done, all sides agree.

via Tension and Flaws Before Health Website Crash – NYTimes.com.

Systems like this should require 5 9s availability from the beginning. This means that the system should be operationally up 99.999% of the time. This allows for around 5.7 minutes downtime per year. I suspect companies like Amazon, Facebook, and Google meet this standard for high availability. There are all kinds of methods and tricks to achieve this that have been learned over the past century in telecommunication systems.

In the last week of September, the disastrous results of the project’s inept management and execution were becoming fully apparent. The agency pressed CGI to explain why a performance test showed that the site could not handle more than 500 simultaneous users. The response once again exhibited the blame-shifting that had plagued the project for months.

Mars Rover Curiosity in Safe Mode After Computer Glitch

Posted on March 4, 2013 by mea

The issue cropped up Wednesday (Feb. 27), when the spacecraft failed to send its recorded data back to Earth and did not switch into its daily sleep mode as planned. After looking into the issue, engineers decided to switch the Curiosity rover from its primary “A-side” computer to its “B-side” backup on Thursday at 5:30 p.m. EST (22:30 GMT). [Curiosity Rover’s Latest Amazing Mars Photos]

via Mars Rover Curiosity in Safe Mode After Computer Glitch | Space.com.

NYC Data Centers Struggle to Recover After Sandy

Posted on October 31, 2012 by mea

The fight now is to keep those generators fueled while pumps clear the basement areas, allowing the standard backup generators to begin operating. It’s also unclear whether the critical elements of infrastructure (power and communications) will both be up and running in time to restore services.

Below is a list of some of the data centers and services in the area, and how they’re faring:

via NYC Data Centers Struggle to Recover After Sandy.

That smooth SpaceX launch? Turns out one of the engines came apart

Posted on October 8, 2012 by mea

The Falcon 9, as its name implies, has nine engines, and is designed to go to orbit if one of them fails. On-board computers will detect engine failure, cut the fuel supply, and then distribute the unused propellant to the remaining engines, allowing them to burn longer. This seems to be the case where that was required, and the computers came through. The engines are also built with protection to limit the damage in cases where a neighboring engine explodes, which appears to be the case here.

via That smooth SpaceX launch? Turns out one of the engines came apart | Ars Technica.

Chaos Monkey released into the wild

Posted on July 30, 2012 by mea

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group.

via The Netflix Tech Blog: Chaos Monkey released into the wild.

Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more

Posted on July 1, 2012 by mea

An outage of Amazon’s Elastic Compute Cloud in North Virginia has taken down Netflix, Pinterest, Instagram, and other services. According to numerous Twitter updates and our own checks, all three services are unavailable as of Friday evening at 9:10 p.m. PT.

via Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more | VentureBeat.

With the critical Amazon outage, which is the second this month, we wouldn’t be surprised if these popular services started looking at other options, including Rackspace, SoftLayer, Microsoft’s Azure, and Google’s just-introduced Compute Engine. Some of Amazon’s biggest EC2 outages occurred in April and August of last year.

Bucktown Bell

Cut to the chase.

Tag Archives: high availability