UPDATE 2009-10-12: I’m happy to let you know this post is not longer relevant. Amazon AWS team successfully deployed the fix and the scenario used to simulate Denial of Service attack using UDP flood isn’t applicable anymore. All that in less than 24 hours after publishing the link on Twitter. Good job!
Original post follows.
Unfortunate events surrounding the DDoS attack against BitBucket kicked-off heated discussions about the nature of this vulnerability. Where Amazon officially acknowledged this to be a single isolated incident, many others started asking questions why did it happen in first place?
- Was BitBucket’s security group configuration set to block UDP traffic?
- How come they haven’t got better visibility of the on-going attack?
- Is this really Amazon’s fault?
Both personal and professional interest led me to find out more. Having designed series of tests how to replicate this scenario, I’ve started first instance and set up the target environment.
instance : c1.medium (us-east-1d) EBS volume : 200 GB attached to (/dev/sdf) monitoring : vmstat, netstat, iptraf, Amazon CloudWatch security group : allowed SSH only (port 22/TCP)
UDP flood set up to be generated from the second instance (c1.medium) using simple Perl script, managing to generate whopping traffic of 650mbit per second (according to iptraf) using 1KB packets to random ports on the target IP.
Test 1. Let it run has been successful in a way there was no visibility on target machine. Still surprised by the traffic level generated on the source box, I’ve pointed the UDP flood to another machine – with security group allowing UDP traffic (ports 0 – 65535) – to check if the network traffic is able to reach another box. And it was. Not only from the same availability zone, but even from the different ones (tested us-east-1c and us-east-1b).
Test 2. consisted of formatting the prepared EBS, 5 samples for both scenario with and without UDP flood.
No traffic (1m15s) UDP Flood (2m54s)
During the test there were only moderate increase in IO waits (somewhere between 2 – 4%).
Test 3. Bonnie++ performance test of the EBS volume. Running with no incoming traffic, it took around 8 minutes to produce quite reasonable report. Having switched on the UDP flood I’ve repeated the same tests and my expectation was to see some results in similar time. Fifteen minutes later and bonnie still haven’t even finished third step (rewriting). Another 10 minutes without any significant progress pointed me to do some research what’s going on. The box wasn’t performing virtually any IO operations, and time spent waiting for IO topped 100% every second reading (1s delay). Bingo!
To verify if the problem is really caused by incoming UDP flood, I’ve stopped the traffic for a brief interval (around 7 seconds) and monitored using vmstat:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 1 893480 11272 3112 1699240 0 0 0 0 10 11 0 0 66 34 0 1 893480 11272 3112 1699240 0 0 0 0 9 6 0 0 0 100 0 1 893480 11272 3112 1699240 0 0 0 0 10 9 0 0 67 33 0 1 893480 11272 3112 1699240 0 0 0 0 11 8 0 0 0 100 0 1 893480 8824 3100 1700052 0 0 23808 24864 962 697 0 1 68 31 0 1 893480 12284 3084 1697988 0 0 16384 16576 711 424 0 2 4 93 0 1 893480 9020 3084 1700088 0 0 20480 20720 817 563 0 1 68 31 0 1 893480 10432 3072 1700192 0 0 20864 20720 907 612 0 4 5 90 0 1 893480 10976 3040 1699724 0 0 15620 12432 588 423 0 1 68 31 0 1 893480 10872 3044 1698556 0 0 12676 16576 600 350 0 2 2 96 0 1 893480 10328 3024 1700676 0 0 19976 16576 761 535 0 1 68 31 0 1 893480 12408 3004 1698096 0 0 8708 12432 457 254 0 1 4 95 0 1 893480 12408 3004 1698096 0 0 0 0 9 7 0 0 67 33 0 1 893480 11636 3004 1699120 0 0 1024 0 38 38 0 0 0 100 0 1 893480 10548 3004 1700420 0 0 1280 0 47 45 0 0 66 33 0 1 893480 10188 3004 1700756 0 0 3584 4144 195 110 0 0 0 100 0 1 893480 10120 2992 1697968 0 0 6404 8288 256 205 0 0 67 33 0 1 893480 12468 2992 1696864 0 0 8064 8288 343 250 0 0 2 98 0 1 893480 11720 2972 1696984 0 0 12420 12432 495 333 0 0 67 32 0 1 893480 10136 2976 1700800 0 0 6916 4144 321 190 0 0 0 100 0 1 893480 11972 2956 1698820 0 0 4096 4144 161 117 0 0 67 33 0 1 893480 11364 2960 1699480 0 0 3844 4144 200 126 0 0 1 99 0 1 893480 11432 2960 1699480 0 0 2944 4144 160 91 0 0 66 34 0 1 893480 11156 2960 1699820 0 0 256 0 18 12 0 0 0 100 0 1 893480 10884 2960 1700020 0 0 256 0 17 17 0 0 66 34 0 1 893480 10856 2960 1700076 0 0 0 0 9 8 0 0 0 100 0 1 893480 10856 2960 1700076 0 0 0 0 9 9 0 0 67 33
As you can see on line 5 the IO traffic resumed, roughly correlating to the time incoming traffic stopped. Seven seconds later with the UDP traffic back on the box tried to keep up for another quarter of minute before giving it up. Best time to check CloudWatch:

Nothing! Based on my notes the first bonnie run occured at 10:40, switched on the UDP flood at 10:50, and started second bonnie run at 10:52. My patience ran out before 11:30 where there’s small peak caused by interactive iptraf session.
At this point there were no reasons to continue testing. All IO operations to/from EBS volume seemed to be blocked by UDP traffic generated by a single instance!
Conclusion
BitBucket guys had every reason to be angry. Blocking UDP in the security group configuration only hides the problem. Contraindicating the Jesper Nøhr statement, during this experiment there were no peaks visible using paid monitoring service – Amazon CloudWatch (see above). Which was probably the amount of information available to AWS 1st line of support.
This corresponds to the ‘black box’ described by Jesper. Looking back on the results it’s obvious that
- on-demand network capacity backfired in this case
- security group configuration is most likely applied on the host system
- host architecture seems to be sharing same network interface(s) for actual network traffic as well as network traffic to/from EBS instances. Even though instances got only a single network interface, I would expect this separation to be implemented on the host system. Segregation of the network traffic is one of the first lesson learned in high-exposed clustered environment.
- a week after the attack and there isn’t any fix in place. Hello, Amazon?!?!
To be fair, it’s been the first incident of such a magnitude. Let’s hope Amazon AWS team will come up with the architecture fix before somebody use the vulnerability in much wider and devastating attack. In mean time, the only workaround we can apply is to hide our instances as much as we can. Load-balancers and proxies in front of the worker instances should be enough, as long as you don’t share the same host machine.
Have a good weekend and good luck protecting your instance’s IPs!
PS: who had the same dark thought as I just had? What about S3?
[UPDATE 2009-10-11 7:00pm] c1.xlarge instances are able to generate UDP flood in the rate of 800 mbps. I guess, Amazon AWS is running 1Gbps network infrastructure.
6 Comments
Great write-up. My co-workers and I had a chat about technical side of this outage in the office last week and pretty much arrived at the same conclusions – your #2 and #3.
Now waiting for AWS to offer details what they are planning to do in order to mitigate such problems in the future.
Thank you for analysing the issue further.
I think it should be highlighted, that
1. the “attack” was performed within EC2
2. both instances were run in the same security group (i guess?!)
The question is, whether your test actually measures EC2’s vulnerability as reported by BitBucket’s owner, or whether it tests a different kind of security / availability problem.
Martin,
the simulation has been performed within EC2, but using different security group, availability center (, and account as well). I agree it might be related to a different kind of security / availability problem, but nevertheless the user experience and visibility is pretty much the same.
Radim
Hello Radim,
thank you for the corrections – and sorry for the confusion, I don’t know how I have missed that you actually used different sec. groups.
This post is not longer relevant. See the notice on the top.
I don’t see how “All that in less than 24 hours after publishing the link on Twitter. Good job!” applies.
Kudos to Amazon for finally fixing it, but this was over a week after the first occurrence, not merely a day after you posted your blog post.
7 Trackbacks
[...] study seems to suggest that the problem may not yet be well understood or solved by AWS (See: Amazon EC2 vulnerable to UDP flood attacks) (Ed: After I wrote this, I got a notification that this particular issue has been fixed which is [...]
[...] Amazon EC2 still vulnerable to UDP flood attacks (tags: amazon ec2 security sysadmin) Categories: Links Comments (0) Trackbacks (0) Leave a comment Trackback [...]
[...] with strategies to improve the trust factor in cloud computing. BTW Amazon, it appears EC2 is still vulnerable to UDP flood attacks (at least at the time when that blog post went live).While arguing about certain advantage enjoyed [...]
[...] outage shed more light on some internal designs of EC2 itself, as described here. It might have also showcased our over-confidence in EC2’s ability to detect and defeat [...]
[...] FIXED: Amazon EC2 vulnerable to UDP flood attacks « laststation.net – ふむふむ [...]
[...] The tests done here showed the capability to generate 650 Mbps from a single medium instance that attacked another instance which, per Radim Marek, was using another AWS account in another availability zone. So if the “largest” DDoS attacks now exceed 40 Gbps” and five EC2 instances can handle 5Gb/s, I’d need 8 instances to absorb an attack of this scale (unknown if this represents a small or large instance.) Seems simple, right? [...]
[...] security group rules are applied at an instance’s dom0 (as makes at least some sense and as this research implies), I now suspect that all dom0 hosts have entire view of all security groups in the region [...]