FIXED: Amazon EC2 vulnerable to UDP flood attacks

Unfortunate events surrounding the DDoS attack against BitBucket kicked-off heated discussions about the nature of this vulnerability. Where Amazon officially acknowledged this to be a single isolated incident, many others started asking questions why did it happen in first place?
- Was BitBucket’s security group configuration set to block UDP traffic?
- How come they haven’t got better visibility of the on-going attack?
- Is this really Amazon’s fault?
Both personal and professional interest led me to find out more. Having designed series of tests how to replicate this scenario, started first instance and set up the target environment.
instance : c1.medium (us-east-1d)
EBS volume : 200 GB attached to (/dev/sdf)
monitoring : vmstat, netstat, iptraf, Amazon CloudWatch
security group : allowed SSH only (port 22/TCP)
UDP flood set up to be generated from the second instance (c1.medium) using simple Perl script, managing to generate whopping traffic of 650mbit per second (according to iptraf) using 1KB packets to random ports on the target IP.
<b>Test 1. Let it run</b> has been successful in a way there was no visibility on target machine. Still surprised by the traffic level generated on the source box, I’ve pointed the UDP flood to another machine – with security group allowing UDP traffic (ports 0 – 65535) – to check if the network traffic is able to reach another box. And it was. Not only from the same availability zone, but even from the different ones (tested us-east-1c and us-east-1b).
<b>Test 2. consisted of formatting the prepared EBS, 5 samples for both scenario with and without UDP flood.
Average Time
no traffic 1m15s
UDP flood 2m54s
During the test there were only moderate increase in IO waits (somewhere between 2 – 4%)
<b>Test 3. Bonnie++ performance test of the EBS volume. Running with no incoming traffic, it took around 8 minutes to produce quite reasonable report. Having switched on the UDP flood I’ve repeated the same tests and my expectation was to see some results in similar time. Fifteen minutes later and bonnie still haven’t even finished third step (rewriting). Another 10 minutes without any significant progress pointed me to do some research what’s going on. The box wasn’t performing virtually any IO operations, and time spent waiting for IO topped 100% every second reading (1s delay). Bingo!
To verify if the problem is really caused by incoming UDP flood, I’ve stopped the traffic for a brief interval (around 7 seconds) and monitored using vmstat:
procs ———–memory———- —swap– —–io—- -system– —-cpu—-
0 1 893480 11272 3112 1699240 0 0 0 0 10 11 0 0 66 34
0 1 893480 11272 3112 1699240 0 0 0 0 9 6 0 0 0 100
0 1 893480 11272 3112 1699240 0 0 0 0 10 9 0 0 67 33
0 1 893480 11272 3112 1699240 0 0 0 0 11 8 0 0 0 100
0 1 893480 8824 3100 1700052 0 0 23808 24864 962 697 0 1 68 31
0 1 893480 12284 3084 1697988 0 0 16384 16576 711 424 0 2 4 93
0 1 893480 9020 3084 1700088 0 0 20480 20720 817 563 0 1 68 31
0 1 893480 10432 3072 1700192 0 0 20864 20720 907 612 0 4 5 90
0 1 893480 10976 3040 1699724 0 0 15620 12432 588 423 0 1 68 31
0 1 893480 10872 3044 1698556 0 0 12676 16576 600 350 0 2 2 96
0 1 893480 10328 3024 1700676 0 0 19976 16576 761 535 0 1 68 31
0 1 893480 12408 3004 1698096 0 0 8708 12432 457 254 0 1 4 95
0 1 893480 12408 3004 1698096 0 0 0 0 9 7 0 0 67 33
0 1 893480 11636 3004 1699120 0 0 1024 0 38 38 0 0 0 100
0 1 893480 10548 3004 1700420 0 0 1280 0 47 45 0 0 66 33
0 1 893480 10188 3004 1700756 0 0 3584 4144 195 110 0 0 0 100
0 1 893480 10120 2992 1697968 0 0 6404 8288 256 205 0 0 67 33
0 1 893480 12468 2992 1696864 0 0 8064 8288 343 250 0 0 2 98
0 1 893480 11720 2972 1696984 0 0 12420 12432 495 333 0 0 67 32
0 1 893480 10136 2976 1700800 0 0 6916 4144 321 190 0 0 0 100
0 1 893480 11972 2956 1698820 0 0 4096 4144 161 117 0 0 67 33
0 1 893480 11364 2960 1699480 0 0 3844 4144 200 126 0 0 1 99
0 1 893480 11432 2960 1699480 0 0 2944 4144 160 91 0 0 66 34
0 1 893480 11156 2960 1699820 0 0 256 0 18 12 0 0 0 100
0 1 893480 10884 2960 1700020 0 0 256 0 17 17 0 0 66 34
0 1 893480 10856 2960 1700076 0 0 0 0 9 8 0 0 0 100
0 1 893480 10856 2960 1700076 0 0 0 0 9 9 0 0 67 33
As you can see on line 5 the IO traffic resumed, roughly correlating to the time incoming traffic stopped. Seven seconds later with the UDP traffic back on the box tried to keep up for another quarter of minute before giving it up. Best time to check CloudWatch:
<cloud watch image>
Nothing! Based on my notes the first bonnie run occured at 10:40, switched on the UDP flood at 10:50, and started second bonnie run at 10:52. My patience ran out before 11:30 where there’s small peak caused by interactive iptraf session.
At this point there were no reasons to continue testing. All IO operations to/from EBS volume seemed to be blocked by UDP traffic generated by a single instance!
Conclusion
BitBucket guys had every reason to be angry. Blocking UDP in the security group configuration only hides the problem. Contraindicating the Jesper Nøhr statement, during this experiment there were no peaks visible using paid monitoring service – Amazon CloudWatch (see above). Which was probably the amount of information available for AWS 1st line of support.
This corresponds to the ‘black box’ described by Jesper. Looking back on the results it’s obvious that
- on-demand network capacity backfired in this case
- security group configuration is most likely applied on the host system
- host architecture seems to be sharing same network interface(s) for actual network traffic as well as network traffic to/from EBS instances. Even though instances got only a single network interface, I would expect this separation to be implemented on the host system. Segregation of the network traffic is one of the first lesson learned in high-exposed clustered environment.
- a week after the attack and there isn’t any fix in place. Hello, Amazon?!?!
To be fair, it’s been the first incident of such a magnitude. Let’s hope Amazon AWS team will come up with the architecture fix before somebody use the vulnerability in much wider and devastating attack. In mean time, the only workaround we can apply is to hide our instances as much as we can. Load-balancers and proxies in front of the worker instances should be enough, as long as you don’t share the same host machine.
Have a good weekend and good luck protecting your instance’s IPs!
PS: who had the same dark thought as I just had? What about S3?

UPDATE 2009-10-12: I’m happy to let you know this post is not longer relevant. Amazon AWS team successfully deployed the fix and the scenario used to simulate Denial of Service attack using UDP flood isn’t applicable anymore. All that in less than 24 hours after publishing the link on Twitter. Good job!

Original post follows.

Unfortunate events surrounding the DDoS attack against BitBucket kicked-off heated discussions about the nature of this vulnerability. Where Amazon officially acknowledged this to be a single isolated incident, many others started asking questions why did it happen in first place?

  • Was BitBucket’s security group configuration set to block UDP traffic?
  • How come they haven’t got better visibility of the on-going attack?
  • Is this really Amazon’s fault?

Both personal and professional interest led me to find out more. Having designed series of tests how to replicate this scenario, I’ve started first instance and set up the target environment.

	instance : c1.medium (us-east-1d)
	EBS volume : 200 GB attached to (/dev/sdf)
	monitoring : vmstat, netstat, iptraf, Amazon CloudWatch
	security group : allowed SSH only (port 22/TCP)

UDP flood set up to be generated from the second instance (c1.medium) using simple Perl script, managing to generate whopping traffic of 650mbit per second (according to iptraf) using 1KB packets to random ports on the target IP.

Test 1. Let it run has been successful in a way there was no visibility on target machine. Still surprised by the traffic level generated on the source box, I’ve pointed the UDP flood to another machine – with security group allowing UDP traffic (ports 0 – 65535) – to check if the network traffic is able to reach another box. And it was. Not only from the same availability zone, but even from the different ones (tested us-east-1c and us-east-1b).

Test 2. consisted of formatting the prepared EBS, 5 samples for both scenario with and without UDP flood.

	No traffic (1m15s)
	UDP Flood (2m54s)

During the test there were only moderate increase in IO waits (somewhere between 2 – 4%).

Test 3. Bonnie++ performance test of the EBS volume. Running with no incoming traffic, it took around 8 minutes to produce quite reasonable report. Having switched on the UDP flood I’ve repeated the same tests and my expectation was to see some results in similar time. Fifteen minutes later and bonnie still haven’t even finished third step (rewriting). Another 10 minutes without any significant progress pointed me to do some research what’s going on. The box wasn’t performing virtually any IO operations, and time spent waiting for IO topped 100% every second reading (1s delay). Bingo!

To verify if the problem is really caused by incoming UDP flood, I’ve stopped the traffic for a brief interval (around 7 seconds) and monitored using vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
0  1 893480  11272   3112 1699240    0    0     0     0   10   11  0  0 66 34
0  1 893480  11272   3112 1699240    0    0     0     0    9    6  0  0  0 100
0  1 893480  11272   3112 1699240    0    0     0     0   10    9  0  0 67 33
0  1 893480  11272   3112 1699240    0    0     0     0   11    8  0  0  0 100
0  1 893480   8824   3100 1700052    0    0 23808 24864  962  697  0  1 68 31
0  1 893480  12284   3084 1697988    0    0 16384 16576  711  424  0  2  4 93
0  1 893480   9020   3084 1700088    0    0 20480 20720  817  563  0  1 68 31
0  1 893480  10432   3072 1700192    0    0 20864 20720  907  612  0  4  5 90
0  1 893480  10976   3040 1699724    0    0 15620 12432  588  423  0  1 68 31
0  1 893480  10872   3044 1698556    0    0 12676 16576  600  350  0  2  2 96
0  1 893480  10328   3024 1700676    0    0 19976 16576  761  535  0  1 68 31
0  1 893480  12408   3004 1698096    0    0  8708 12432  457  254  0  1  4 95
0  1 893480  12408   3004 1698096    0    0     0     0    9    7  0  0 67 33
0  1 893480  11636   3004 1699120    0    0  1024     0   38   38  0  0  0 100
0  1 893480  10548   3004 1700420    0    0  1280     0   47   45  0  0 66 33
0  1 893480  10188   3004 1700756    0    0  3584  4144  195  110  0  0  0 100
0  1 893480  10120   2992 1697968    0    0  6404  8288  256  205  0  0 67 33
0  1 893480  12468   2992 1696864    0    0  8064  8288  343  250  0  0  2 98
0  1 893480  11720   2972 1696984    0    0 12420 12432  495  333  0  0 67 32
0  1 893480  10136   2976 1700800    0    0  6916  4144  321  190  0  0  0 100
0  1 893480  11972   2956 1698820    0    0  4096  4144  161  117  0  0 67 33
0  1 893480  11364   2960 1699480    0    0  3844  4144  200  126  0  0  1 99
0  1 893480  11432   2960 1699480    0    0  2944  4144  160   91  0  0 66 34
0  1 893480  11156   2960 1699820    0    0   256     0   18   12  0  0  0 100
0  1 893480  10884   2960 1700020    0    0   256     0   17   17  0  0 66 34
0  1 893480  10856   2960 1700076    0    0     0     0    9    8  0  0  0 100
0  1 893480  10856   2960 1700076    0    0     0     0    9    9  0  0 67 33

As you can see on line 5 the IO traffic resumed, roughly correlating to the time incoming traffic stopped. Seven seconds later with the UDP traffic back on the box tried to keep up for another quarter of minute before giving it up. Best time to check CloudWatch:

CloudWatch monitoring

Nothing! Based on my notes the first bonnie run occured at 10:40, switched on the UDP flood at 10:50, and started second bonnie run at 10:52. My patience ran out before 11:30 where there’s small peak caused by interactive iptraf session.

At this point there were no reasons to continue testing. All IO operations to/from EBS volume seemed to be blocked by UDP traffic generated by a single instance!

Conclusion

BitBucket guys had every reason to be angry. Blocking UDP in the security group configuration only hides the problem. Contraindicating the Jesper Nøhr statement, during this experiment there were no peaks visible using paid monitoring service – Amazon CloudWatch (see above). Which was probably the amount of information available to AWS 1st line of support.

This corresponds to the ‘black box’ described by Jesper. Looking back on the results it’s obvious that

  • on-demand network capacity backfired in this case
  • security group configuration is most likely applied on the host system
  • host architecture seems to be sharing same network interface(s) for actual network traffic as well as network traffic to/from EBS instances. Even though instances got only a single network interface, I would expect this separation to be implemented on the host system. Segregation of the network traffic is one of the first lesson learned in high-exposed clustered environment.
  • a week after the attack and there isn’t any fix in place. Hello, Amazon?!?!

To be fair, it’s been the first incident of such a magnitude. Let’s hope Amazon AWS team will come up with the architecture fix before somebody use the vulnerability in much wider and devastating attack. In mean time, the only workaround we can apply is to hide our instances as much as we can. Load-balancers and proxies in front of the worker instances should be enough, as long as you don’t share the same host machine.

Have a good weekend and good luck protecting your instance’s IPs!

PS: who had the same dark thought as I just had? What about S3?

[UPDATE 2009-10-11 7:00pm] c1.xlarge instances are able to generate UDP flood in the rate of 800 mbps. I guess, Amazon AWS is running 1Gbps network infrastructure.

6 Comments

  1. Posted October 11, 2009 at 5:01 pm | Permalink

    Great write-up. My co-workers and I had a chat about technical side of this outage in the office last week and pretty much arrived at the same conclusions – your #2 and #3.

    Now waiting for AWS to offer details what they are planning to do in order to mitigate such problems in the future.

  2. Martin
    Posted October 12, 2009 at 2:41 pm | Permalink

    Thank you for analysing the issue further.

    I think it should be highlighted, that

    1. the “attack” was performed within EC2
    2. both instances were run in the same security group (i guess?!)

    The question is, whether your test actually measures EC2’s vulnerability as reported by BitBucket’s owner, or whether it tests a different kind of security / availability problem.

    • Radim Marek
      Posted October 12, 2009 at 2:47 pm | Permalink

      Martin,

      the simulation has been performed within EC2, but using different security group, availability center (, and account as well). I agree it might be related to a different kind of security / availability problem, but nevertheless the user experience and visibility is pretty much the same.

      Radim

      • Martin
        Posted October 12, 2009 at 3:53 pm | Permalink

        Hello Radim,

        thank you for the corrections – and sorry for the confusion, I don’t know how I have missed that you actually used different sec. groups.

  3. Radim Marek
    Posted October 12, 2009 at 7:33 pm | Permalink

    This post is not longer relevant. See the notice on the top.

  4. Tom
    Posted October 13, 2009 at 2:00 am | Permalink

    I don’t see how “All that in less than 24 hours after publishing the link on Twitter. Good job!” applies.

    Kudos to Amazon for finally fixing it, but this was over a week after the first occurrence, not merely a day after you posted your blog post.


7 Trackbacks

  1. [...] study seems to suggest that the problem may not yet be well understood or solved by AWS (See: Amazon EC2 vulnerable to UDP flood attacks) (Ed: After I wrote this, I got a notification that this particular issue has been fixed which is [...]

  2. By links for 2009-10-12 « Bloggitation on October 13, 2009 at 7:12 am

    [...] Amazon EC2 still vulnerable to UDP flood attacks (tags: amazon ec2 security sysadmin) Categories: Links Comments (0) Trackbacks (0) Leave a comment Trackback [...]

  3. By Demand Transparency | CloudAve on October 13, 2009 at 1:48 pm

    [...] with strategies to improve the trust factor in cloud computing. BTW Amazon, it appears EC2 is still vulnerable to UDP flood attacks (at least at the time when that blog post went live).While arguing about certain advantage enjoyed [...]

  4. By Standalone Web Front Door a Must in EC2? on October 13, 2009 at 4:24 pm

    [...] outage shed more light on some internal designs of EC2 itself, as described here. It might have also showcased our over-confidence in EC2’s ability to detect and defeat [...]

  5. [...] FIXED: Amazon EC2 vulnerable to UDP flood attacks « laststation.net – ふむふむ [...]

  6. [...] The tests done here showed the capability  to generate 650 Mbps from a single medium instance that attacked another instance which, per Radim Marek, was using another AWS account in another availability zone.  So if the “largest” DDoS attacks now exceed 40 Gbps” and five EC2 instances can handle 5Gb/s, I’d need 8 instances to absorb an attack of this scale (unknown if this represents a small or large instance.)  Seems simple, right? [...]

  7. By Punching UDP Holes in Amazon EC2 on November 3, 2009 at 5:24 am

    [...] security group rules are applied at an instance’s dom0 (as makes at least some sense and as this research implies), I now suspect that all dom0 hosts have entire view of all security groups in the region [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*