Fault Tolerant Router is a daemon, running in background on a Linux router or firewall, monitoring the state of multiple internet uplinks/providers and changing the routing accordingly. LAN/DMZ internet traffic (outgoing connections) is load balanced between the uplinks using Linux multipath routing. The daemon monitors the state of the uplinks by routinely pinging well known IP addresses (Google public DNS servers, etc.) through each outgoing interface: once an uplink goes down, it is excluded from the multipath routing, when it comes back up, it is included again. All of the routing changes are notified to the administrator by email.
Tag Archives: high availability
Snowflake-shaped networks are easiest to mend
They found the best networks are made from partial loops around the units of the grid, with exactly one side of each loop missing. All of these partial loops link together, back to a central source. These have a low repair cost because if a link breaks, the repair simply involves adding back the missing side of a loop. What’s more, they are resistant to multiple breaks over time, as each repair preserves the network’s fundamental design.
via Snowflake-shaped networks are easiest to mend – tech – 03 October 2014 – New Scientist.
Tension and Flaws Before Health Website Crash
Thanks to a huge effort to fix the most obvious weaknesses and the appointment at last of a single contractor, QSSI, to oversee the work, the website now crashes much less frequently, officials said. That is a major improvement from a month ago, when it was up only 42 percent of the time and 10-hour failures were common. Yet an enormous amount of work remains to be done, all sides agree.
via Tension and Flaws Before Health Website Crash – NYTimes.com.
Systems like this should require 5 9s availability from the beginning. This means that the system should be operationally up 99.999% of the time. This allows for around 5.7 minutes downtime per year. I suspect companies like Amazon, Facebook, and Google meet this standard for high availability. There are all kinds of methods and tricks to achieve this that have been learned over the past century in telecommunication systems.
In the last week of September, the disastrous results of the project’s inept management and execution were becoming fully apparent. The agency pressed CGI to explain why a performance test showed that the site could not handle more than 500 simultaneous users. The response once again exhibited the blame-shifting that had plagued the project for months.
Cloud Providers Work To Disperse Points Of Failure
In the end, cloud providers — many of which aim for 99.9 percent uptime, or “three nines” — are likely to offer individual companies a more reliable service than those companies attain for themselves, the CSA’s Howie says.
via Cloud Providers Work To Disperse Points Of Failure – Dark Reading.
Note that telecom typically operates under 5 nines uptime. The point of this article may be that end users need to implement their own backup plans to get higher than three 9s reliability.
Mars Rover Curiosity in Safe Mode After Computer Glitch
The issue cropped up Wednesday (Feb. 27), when the spacecraft failed to send its recorded data back to Earth and did not switch into its daily sleep mode as planned. After looking into the issue, engineers decided to switch the Curiosity rover from its primary “A-side” computer to its “B-side” backup on Thursday at 5:30 p.m. EST (22:30 GMT). [Curiosity Rover’s Latest Amazing Mars Photos]
via Mars Rover Curiosity in Safe Mode After Computer Glitch | Space.com.
NYC Data Centers Struggle to Recover After Sandy
The fight now is to keep those generators fueled while pumps clear the basement areas, allowing the standard backup generators to begin operating. It’s also unclear whether the critical elements of infrastructure (power and communications) will both be up and running in time to restore services.
Below is a list of some of the data centers and services in the area, and how they’re faring:
Pirate Bay Moves to The Cloud, Becomes Raid-Proof
“If one cloud-provider cuts us off, goes offline or goes bankrupt, we can just buy new virtual servers from the next provider. Then we only have to upload the VM-images and reconfigure the load-balancer to get the site up and running again.”
via Pirate Bay Moves to The Cloud, Becomes Raid-Proof | TorrentFreak.
The load balancer and transit-routers are still owned and operated by The Pirate Bay, which allows the site to hide the location of the cloud provider. It also helps to secure the privacy of the site’s users.
That smooth SpaceX launch? Turns out one of the engines came apart
The Falcon 9, as its name implies, has nine engines, and is designed to go to orbit if one of them fails. On-board computers will detect engine failure, cut the fuel supply, and then distribute the unused propellant to the remaining engines, allowing them to burn longer. This seems to be the case where that was required, and the computers came through. The engines are also built with protection to limit the damage in cases where a neighboring engine explodes, which appears to be the case here.
via That smooth SpaceX launch? Turns out one of the engines came apart | Ars Technica.
Chaos Monkey released into the wild
Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group.
via The Netflix Tech Blog: Chaos Monkey released into the wild.
Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more
An outage of Amazon’s Elastic Compute Cloud in North Virginia has taken down Netflix, Pinterest, Instagram, and other services. According to numerous Twitter updates and our own checks, all three services are unavailable as of Friday evening at 9:10 p.m. PT.
via Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more | VentureBeat.
With the critical Amazon outage, which is the second this month, we wouldn’t be surprised if these popular services started looking at other options, including Rackspace, SoftLayer, Microsoft’s Azure, and Google’s just-introduced Compute Engine. Some of Amazon’s biggest EC2 outages occurred in April and August of last year.