Following a rather important breakdown last week, during Thanksgiving, Amazon Web Services revealed that adding capacity to its complex system was the reason for the failure.
AWS explained that its Kinesis service added more capacity, leading to the creation of new communication threads for each of the other servers in the front-end fleet. Thousands of servers were already involved so when new servers are added, it might take almost an hour for the news of additions to reach the fleet.
Hence, adding capacity to the fleet led to exceed the maximum number of threads allowed, causing the breakdown.
In order to fix this issue, AWS said they will be moving to larger servers and reducing the total number of servers and threads. However, by doing that, they also had to reboot all of Kinesis. As they use thousands of servers, the recovery was then very slow.
AWS made a report explaining how to avoid any incidents like this in the future. The first step is to use bigger servers. AWS will then be moving to larger CPU and memory servers in order to reduce the number of servers and threads required.
Moreover, the company is planning to increase the thread count limits in the operating system configuration, hence allowing more threads per server and more safety margin. AWS is also looking to isolate in-demand services such as CloudFront so as to use dedicated Kinesis servers.
The company is determined to learn from this incident and to improve its systems.