Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 download_file does not retry on ConnectionErrors when receiving response #3876

Open
adrianliaw opened this issue Sep 28, 2023 · 5 comments
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue s3

Comments

@adrianliaw
Copy link

Describe the bug

When there's a ConnectionError during reading S3 get object response (perhaps because of network issues), boto3/s3transfer does not retry these requests and raises ResponseStreamingError immediately. We occasionally get ConnectionResetErrors from S3, which I'd assume would be retried based on boto3 docs about retries, because ConnectionResetError is a subclass of ConnectionError.

Expected Behavior

download_file should have automatic retries for ConnectionErrors if boto3 client has a retry strategy configured.

Current Behavior

If a ConnectionError happens during reading of response body during download_file, the entire download_file operation is failed with ResponseStreamingError without retries.

Reproduction Steps

I used the following unit test (note that this test is not hermetic as it actually uses boto3) to demonstrate that downloading is not retried, substituting <a bucket that actually exists> and <a key that actually exists> with actual bucket/key that I have:

import pytest
import botocore
import boto3
from boto3.s3.transfer import TransferConfig
from botocore.config import Config
import http


def test_s3_does_not_retry_download_file():
    ### Start injecting fake HTTP connection errors during read ###
    real_http_read = http.client.HTTPResponse.read
    # Count the number of requests being made, I know it makes 3 requests, but retries should
    # bump this number up
    call_count = 0

    class MockedHTTPResponse(http.client.HTTPResponse):
        def read(self, *args, **kwargs):
            nonlocal call_count
            call_count += 1
            # Inject socket connection errors
            if call_count > 2:
                raise ConnectionResetError("Connection reset")
            return real_http_read(self, *args, **kwargs)

    http.client.HTTPResponse.read = MockedHTTPResponse.read
    ### End injecting fake HTTP connection errors during read ###

    # Use a very high retry attempt number to make sure that this is not retried
    s3 = boto3.resource("s3", config=Config(retries={"mode": "standard", "max_attempts": 100}))
    obj = s3.Object("<a bucket that actually exists>", "<a key that actually exists>")

    with pytest.raises(botocore.exceptions.ResponseStreamingError):
        # Use a very high retry attempt number to make sure that this is not retried
        obj.download_file("/tmp/output-file", Config=TransferConfig(num_download_attempts=100))
    
    # This test passes, meaning no retries are attempted
    assert call_count == 3

Possible Solution

Consider adding ResponseStreamingError to S3_RETRYABLE_DOWNLOAD_ERRORS list.

I think retrying this kind of network connection issues is suitable for download operation as these HEAD/GET requests should be idempotent.

Additional Information/Context

I confirmed that ConnectionResetErrors are not retried by walking through these breadcrumbs:

  1. urllib3 raises ProtocolError when there's a ConnectionError (which is a subclass of OSError)
  2. StreamingBody.read() raises ResponseStreamingError (the error we got) when a urllib3 ProtocolError is caught
  3. The above is called by DownloadChunkIterator for multi-part downloads
  4. The above is used in a retry loop
  5. The except clause looks for S3_RETRYABLE_DOWNLOAD_ERRORS, which includes socket.timeout, ConnectionError, ReadTimeoutError/IncompleteReadTimeout defined in botocore
  6. None of the above is in the inheritance path of ResponseStreamingError (ResponseStreamingError -> HTTPClientError -> BotoCoreError -> Exception)

My understanding is s3transfer "tries" to catch and retry ConnectionError (and other retryable errors), but urllib3 masks ConnectionError as ProtocolError, then s3transfer masks ProtocolError as ResponseStreamingError, which botocore also catches but errors out instead of retrying.

SDK version used

1.28.54

Environment details (OS name and version, etc.)

Amazon Linux 2

@adrianliaw adrianliaw added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Sep 28, 2023
@tim-finnigan tim-finnigan self-assigned this Oct 3, 2023
@tim-finnigan
Copy link
Contributor

Hi @adrianliaw thanks for reaching out. I brought this issue up for discussion with the team, and they may be open to considering the changes you suggested. However, they would first like to gather more information to support the points shared here. Can you provide logs that demonstrate the issue you are describing? (You can get the full logs boto3.set_stream_logger('') to your script — please redact any sensitive info.)

@tim-finnigan tim-finnigan added response-requested Waiting on additional information or feedback. and removed bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. needs-review labels Oct 3, 2023
@adrianliaw
Copy link
Author

@tim-finnigan here's the logs (and also the script content, taken out from unit test content): https://gist.github.com/adrianliaw/ca8a36eea127ea7e49ffcad4fbf07e48

@tim-finnigan
Copy link
Contributor

tim-finnigan commented Oct 4, 2023

Thanks @adrianliaw for following up. I can reproduce the issue with the sample you provided. I'll leave this issue open for tracking and others are welcome to 👍 the issue and share their thoughts here in the comments.

@tim-finnigan tim-finnigan added s3 p2 This is a standard priority issue bug This issue is a confirmed bug. and removed response-requested Waiting on additional information or feedback. labels Oct 4, 2023
@tim-finnigan tim-finnigan removed their assignment Oct 4, 2023
@HimanshuBarak
Copy link

Hey @tim-finnigan iis this issue up fpr grabs? If not can you assign me this , I would love to work on it.

@Ayushi-12
Copy link

boto/s3transfer#301
Does this PR address the bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue s3
4 participants