Amazon S3 multipart uploads let us upload a larger file to S3 in smaller, more manageable chunks. Individual pieces are then stitched together by S3 after we signal that all parts have been uploaded. The individual part uploads can even be done in parallel. If a single part upload fails, it can be restarted again and we can save on bandwidth.
We’re going to cover uploading a large file to AWS using the official python library.
To interact with AWS in python, we will need the boto3 package. Install the package via pip as follows
pip install boto3
Boto3 can read the credentials straight from the aws-cli config file. As long as we have a ‘default’ profile configured, we can use all functions in boto3 without any special authorization.
Run aws configure
in a terminal and add a default profile with a new IAM user with an access key and secret. Make sure that that user has full permissions on S3.
There are 3 steps for Amazon S3 Multipart Uploads,
- Creating the upload using create_multipart_upload: This informs aws that we are starting a new multipart upload and returns a unique UploadId that we will use in subsequent calls to refer to this batch.
- Uploading each part using MultipartUploadPart: Individual file pieces are uploaded using this. Each uploaded part will generate a unique ETag that will be required to be passed in the final request. Each part needs to be given a sequence number by us, which is the order in which the final file will be assembled.
- Completing the upload using complete_multipart_upload: This signals AWS S3 that all the parts of a multipart upload are complete and it can begin stitching the file together.
First, We need to start a new multipart upload:
multipart_upload = s3Client.create_multipart_upload(
ACL='public-read',
Bucket='multipart-using-boto',
ContentType='video/mp4',
Key='movie.mp4',
)
Then, we will need to read the file we’re uploading in chunks of manageable size. For this, we will open the file in ‘rb’ mode where the ‘b’ stands for binary. We don’t want to interpret the file data as text, we need to keep it as binary data to allow for non-text files.
with open('movie.mp4', 'rb') as f:
while True:
piece = f.read(10000000) # roughly 10 mb parts
if piece == b'':
break
Then for each part, we will upload it and keep a record of its Etag
uploadPart = s3.MultipartUploadPart(
'multipart-using-boto', 'movie.mp4', multipart_upload['UploadId'], part_number
)
uploadPartResponse = uploadPart.upload(
Body=piece,
)
parts.append({
'PartNumber': part_number,
'ETag': uploadPartResponse['ETag']
})
We will complete the upload with all the Etags and Sequence numbers
completeResult = s3Client.complete_multipart_upload(
Bucket='multipart-using-boto',
Key='movie.mp4',
MultipartUpload={
'Parts': parts
},
UploadId=multipart_upload['UploadId'],
)
Your file should now be visible on the s3 console. In this example, we have read the file in parts of about 10 MB each and uploaded each part sequentially. But we can also upload all parts in parallel and even re-upload any failed parts again. The advantages of uploading in such a multipart fashion are :
Significant speedup: Possibility of parallel uploads depending on resources available on the server.
Fault tolerance: Individual pieces can be re-uploaded with low bandwidth overhead.
Lower Memory Footprint: Large files don’t need to be present in server memory all at once. This can really help with very large files which can cause the server to run out of ram.
Amazon S3 multipart uploads have more utility functions like list_multipart_uploads and abort_multipart_upload are available that can help you manage the lifecycle of the multipart upload even in a stateless environment.