Resources

S3 Transfer Engine

The S3 Transfer Engine is a quick and reliable tool created by Blueberry for Amazon S3 file transfer and archiving.

Amazon Simple Storage Service

Amazon S3 (Simple Storage Service) is an online storage service offered by Amazon, which can be used for a wide variety of uses, ranging from Web applications to media files. Amazon S3 can be used to store and retrieve any amount of data, at any time, from anywhere on the Web, and gives users the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of websites.

Although Amazon S3 storage is very reliable, the problem with using its Web interface for uploads/downloads is that it is very slow and can be unreliable. There are commercial tools that can be used as an alternative, but many fail to correctly upload and resume very large files with guaranteed reliability.

Blueberry realised what was needed was a quick and reliable tool that would work 24 hours a day, 7 days a week, in any Windows environment, and without the need for user intervention. Blueberry subsequently developed the S3 Transfer Engine.

What is Blueberry's S3 Transfer Engine?

The S3 Transfer Engine is a free program created by Blueberry for Amazon S3 file transfer and archiving, which can be downloaded using the link above.

A number of technical challenges were met in implementing the tool, such as the requirement for multi-threading to speed up transfers and HTTP round trip delays.

After considering the technical requirements for the tool and the technologies available (such as Java, PHP, Python, Ruby, and Windows .NET APIs), it was decided the S3 Transfer Engine should be developed using .NET technology, mainly because:


  • .Net provides powerful multithreading capability - resulting in faster transfers and reduced Web interface round trip time.

  • .Net technology has the advantage of being applicable to all desktops.


Blueberry’s S3 Transfer Engine is implemented as a windowless console application that monitors ‘Transfer Description Files’ (TDFs). TDFs define the files to be uploaded in its queue folder, handle multipart uploads, and put the results into output predefined folders.

Installation and Configuration

To use Blueberry's S3 Transfer Engine, download the correct file using the link above (the installation packs include both 32-bit and 64-bit versions), unzip the file and run Setup.exe. The standard MSI installation dialogue will appear. Simply follow the setup wizard. The program is installed by default into a “\Program Files\Blueberry Consultants\S3 Upload\” folder. After installation, edit the “S3Uploader.exe.config” file, and set your required values / attributes:
  • AWSAccessKey and AWSSecretKey - these values are provided to you by Amazon.
  • RootFolder – the root folder name in which all needed subfolders are located (see Folder Structure below).
  • MaxPartRetries – the number of attempts of one part transfer before the program decides that the part transfer failed. Default value is 10.
  • MaxFileRetries – the number of attempts to transfer entire file before the program decides that entire file transfer failed. Default value is 5.
  • FileFailRetryInterval – the time period in seconds the program should waits before a new entire file transfer attempt.
  • PartSizeBytes – the part size in bytes. Default value is 5242880 bytes (5 Mbytes). Don’t set this value less than 10240 bytes (10 Kbytes).
  • ThreadsAtOnce – the number of threads that work simultaneously during file transfer. The default value is 15, and each part is transferred in a separate thread. When this value is exceeded, the program pauses new thread creation and waits until any remaining threads are completed. The program then creates new threads and starts the next part transfer.

Multi-Part Upload

The multi-part upload is a very important feature implemented by the Amazon team.

If the file is uploaded in its entirety in one go, then in the case of any communication errors/faults the upload is repeated from scratch. In the case of very large files the time/expense on such repetitions can be extremely large and there is no way to be sure that your upload will finish at any reasonable time, or even at all.

The Amazon S3 multi-part upload offers a solution when an entire file is transferred separately by parts of equal length. If the length of each part is too small, there is there risk of overhead expenses becoming too large. Conversely, if the length of each part is too large, there is the risk of transfer errors requiring the parts to be re-transferred. Our recommendation is 5 megabytes (5242880 bytes). At this size, the parts can be transferred simultaneously in separate threads, so increasing transfer productivity.

The Amazon S3 service collects all successfully transferred parts. This set of parts is called an incomplete upload and stored at the server side. It is invisible and does not belong to your bucket. When the first attempt of the file upload finishes, the S3 Uploader re-transfers only those parts that failed previously. If any re-transferred parts fail to transfer again then they are re-transferred too - and so on until all the parts have been uploaded or the retry attempts are exceeded.

If all parts are successfully transferred the upload is complete.

The Amazon S3 service gathers all the uploaded parts in the correct order and puts it into the bucket as a key (or file). The incomplete upload then disappears from the Amazon S3 service. If the retry counter is exceeded then the S3 Uploader puts the TDF in a ‘Failed’ folder and you can either make a new attempt (only failed parts are uploaded) or find a more stable internet connection.

The above process works well, even if you have an interruption; such as an electrical outage, lost internet connection, or have other urgent tasks to do and decide to kill the S3 Uploader to free PC resources, etc. From the Amazon side the timeout between upload attempts is not limited, so you can return to complete a failed upload at any time.

Folder Structure

S3 Uploader uses the following folder structure:

Root Folder
|-- Queue
|-- In Progress
|-- Complete
|-- Failed
|-- [Copy here to pause]

‘Root Folder’ is the common part of the path to other important folders used by S3 Uploader. The Root Folder must be configured in the settings file.

‘Queue’ folder is an input queue folder. The customer should use an external tool to put all required TDFs of the files to be transferred into this folder. The S3 Uploader does not move transferred files. The transferred files can be large and they are commonly left in place after transfer. However, the source file can be deleted if the transfer succeeds – see TDF file description below. The S3 Uploader watches this folder content and when it has free upload resources - for example the previous upload uses less resources than configured the ThreadsAtOnce value -it reads the TDF and moves it into the ‘In Progress’ folder. At the same time, the S3 Uploader starts a new file transfer. If the S3 Uploader has no free resources to upload the file, then the TDF is held in the Queue folder.

The ‘In Progress’ folder is a folder where the uploaded file’s TDF is kept until the upload succeeds or fails. Whilst the file is being transferred the TDF remains here and the S3 Uploader adds reporting records to it. When the transfer succeeds, the TDF is moved into the 'Complete' folder.

If the transfer fails, the TDF is moved into the 'Failed' folder.

‘Complete’ folder. This folder is a storage of all TDFs with successful uploads.

‘Failed’ folder. This folder is a storage of all TDFs with failed uploads.

‘Copy here to pause’ folder. This has not been implemented yet, but it is intended to allow transfer pausing. Once implemented the user will be able to move the TDF file from the ‘In Progress’ folder into this folder to pause file transfer. Note: the transfer is not stopped; it is paused only. It waits until all parts currently being transferred finish transferring and there are no more new parts. The transfer resources that become free (threads, for example) can be used by other files in the queue. In other words current transfers can be accelerated or new files from the queue can commence transfer. To resume transfer, just move the TDF back into the ‘In Progress’ folder.

Transfer Definition File (TDF) Format

TDF file example:

// All parameters are case insensitive
// Empty lines are ignored
// Lines started with '//' or '#' are comments and they are ignored too
// Parameters order may be any.
// All spaces prior '=' and after it are ignored
// All values are used as is. Don't add quote or double characters even for paths.

// Mandatory parameters are below
Source = D:\_Distributives\TortoiseSVN\tortoisesvn-1.6.15.21042-win32-svn-1.6.16.msi
Destination Bucket=tomatindb
Destination Key=folder/tortoisesvn-1.6.15.21042-win32-svn-1.6.16.msi

// Optional parameter:
// Values yes or true mean delete file
// Values no or false mean do not delete it
Delete Transferred File On Success = yes

Notes:

  1. TDF is a UTF-8 encoded file to allow national characters support.
  2. Double slash (‘//’) or sharp (‘#’) at the beginning of the line mark a comment. It is ignored at processing time. Empty lines are ignored.
  3. If the parameter contains spaces within it then they must be used and cannot be ignored. All multiple spaces are considered as single space character.
  4. The ‘equal’ sign (‘=’) separates parameter name and value. All spaces before equal sign and after it ignored.
  5. File names that contain spaces can be double quoted or not.
  6. Boolean values can be represented as yes/no or true/false. No difference.

Source parameter is a full path to the file to be transferred. During the transfer the file stays in place.

‘Destination Bucket’ parameter is an Amazon bucket name the file must be transferred to.

‘Destination Key’ parameter is a file name at the Amazon server. Destination Key may contain the path which is relative to the bucket. The path parts should be separated by slash character. So, the example above transfers local file D:\_Distributives\TortoiseSVN\tortoisesvn-1.6.15.21042-win32-svn-1.6.16.msi to the bucket ‘tomatindb’. The destination file should be placed into the folder ‘folder’ and named at server side as ‘tortoisesvn-1.6.15.21042-win32-svn-1.6.16.msi’.

‘Delete Transferred File On Success’ parameter tells to S3 Uploader to delete the source file in case of successful transfer. Use with care: the source file will be deleted!

Usage

To start usage you need to:
    • create appropriate folder structure – see Folder Structure section
     
    • change config file – see Installation section
     
    • start S3 Uploader.exe
     
    • put the files to be transferred into Queue folder.
When the transfer succeeds the S3 Uploader moves the TDF to Complete folder. You may analyze it to make your settings more accurate. When the transfer fails the S3 Uploader moves the TDF to the Failed folder. You should check the TDF file for which transfer failed and decide what to do. Perhaps your internet connection is poor and you need to increase MaxFileRetries value, or MaxPartRetries value, or both. If your computer is very busy you may try to make the ThreadsAtOnce value smaller. We do not recommend making smaller PartSizeBytes value. If the transfer failed and you decide to repeat a transfer, just copy or move the TDF from the Failed folder into the Queue folder. The transfer will again be started. What happens when the S3 Upload execution fails? Causes include: your PC was turned off, or the S3 Uploader process was killed, or the S3 Uploader itself failed. If it is happens, please notify us and send the TDF file with which execution failed. In all cases if the transfer was interrupted, simple restart S3 Uploader again. It knows that the TDFs in the ‘In Progress’ folder should be resumed.

TDF Example (after upload is finished)

If upload succeeds the TDF in Complete folder looks like:

Source = D:\_Distributives\TortoiseSVN\tortoisesvn-1.6.15.21042-win32-svn-1.6.16.msi

Destination Bucket=tomatindb

Destination Key=folder/tortoisesvn-1.6.15.21042-win32-svn-1.6.16.msi

// ==========

Session = 13.07.2011 16:11:18

Threads at once = 15

Attempt = #1

File size = 20000768 bytes

Part size = 5242880 bytes

Number of parts = 4

UploadId =

SI938Zl.0MKqCyH3NFLBbQ2jadey7LrjDJ0DKJrPUtkMFdo3o

FuUcBubFfAxwJ0tjLchGl7g9CTw8gSsBgcXkw--

Part = #4 succeeded. Elapsed time 00:05:57.7184603, 14656 bytes per second

Part = #3 succeeded. Elapsed time 00:06:53.7566656, 12671 bytes per second

Part = #1 succeeded. Elapsed time 00:06:55.4037597, 12621 bytes per second

Part = #2 succeeded. Elapsed time 00:06:55.9857931, 12603 bytes per second

Status = Succeeded, elapsed time 00:07:01.5991141, 47440 bytes per second

 

If the upload fails the TDF in the ‘Complete’ folder looks fragmented:

Source=f:\mssql\backup\zip\WS_20110817143900.bak.zip

Destination Bucket=rubric-test

Destination Key=WS_20110817143900.bak.zip

Delete Transferred File On Success=yes

// =============

Session = 26/08/2011 11:16:12

Threads at once = 15

Attempt = #1

File size = 27082453322 bytes

Part size = 5242880 bytes

Number of parts = 5166

UploadId =
rHvOPL4MwYW31vT3gS3p3i0FZLHMFIk3xYoQxFhS3S
3FcB._QT12nNElVCIldAEN5U64wCpO8H9gHQTV7qLb6Q--

Part = #2 succeeded. Elapsed time 00:01:23.2505310, 62977 bytes per second

Part = #4 succeeded. Elapsed time 00:01:04.9636155, 80704 bytes per second

Part = #3 succeeded. Elapsed time 00:01:57.2395665, 44719 bytes per second

Part = #46 succeeded. Elapsed time 00:01:16.6415790, 68407 bytes per second

Part = #45 succeeded. Elapsed time 00:01:44.4005445, 50218 bytes per second

Part = #47 failed.

Part = #50 succeeded. Elapsed time 00:01:06.3834465, 78978 bytes per second

Part = #48 succeeded. Elapsed time 00:01:37.3570500, 53852 bytes per second

Part = #5165 succeeded. Elapsed time 00:01:26.7952260, 60405 bytes per second

Status = Failed. Parts succeedeeed 5159, failed 7, found in uploads 0 of 5166

// ==============

Session = 26/08/2011 23:14:19

Threads at once = 15

Attempt = #2

File size = 27082453322 bytes

Part size = 5242880 bytes

Number of parts = 5166

… Etc.

 

  • The first lines are the TDF itself that was placed into the Queue folder. The rest of this file is a transfer report.
  • Session line is the date/time when transfer attempt has been started.
  • Threads at once line shows your ThreadsAtOnce setting value.
  • Attempt line shows a number of retry if file transfer fails. Each attempt has Status line at the end.]
  • File Size shows entire file size in bytes.
  • Part Size – a part size from settings. In cases where the transfer started and an incomplete upload is found, this value shows the part size from this incomplete upload. It is impossible to continue an incomplete upload with another part size than the one it was started with.
  • Number of parts – a whole number of parts participated in current upload session
  • Part line shows a number of the part and part upload state (succeeded or failed). If the upload succeeds then elapsed time and calculated upload rate is shown.
  • Status is the entire session status. It also shows the number of parts being uploaded successfully, the number of failed parts, and the number of parts found in incomplete upload.

Technical Notes on S3

The following technical points may be of interest:

(1) .NET limits the number of simultaneous connections. This is a known .NET issue called .NET throttling. By default you may only allow 2 concurrent connections to a remote address. The work around is either by adding to the code:

ServicePointManager.DefaultConnectionLimit = 100;

Or adding the following to your App.config file if you have one:









(Forum: https://forums.aws.amazon.com/thread.jspa?threadID=56179)

(2) It is possible that you encounter the error: “The difference between the request time and the current time is too large”. This actually means that uploading a single part thread cannot be longer than 900 seconds (15 minutes). If part uploading time is longer than this value the upload fails. To work around this, decrease the ThreadsAtOnce value to speed up each part upload. There is further discussion on this here - https://forums.aws.amazon.com/thread.jspa?threadID=61234

Links

Amazon web services - http://aws.amazon.com/ Amazon Simple Storage Service (Amazon S3) - http://aws.amazon.com/s3/ Amazon Simple Storage Service Documentation - http://docs.amazonwebservices.com/AmazonS3/latest/dev/ Amazon API Reference - http://docs.amazonwebservices.com/AmazonS3/latest/API/ Amazon S3 Developer Guide - http://docs.amazonwebservices.com/AmazonS3/latest/dev/ Using AWS SDK for .NET - http://docs.amazonwebservices.com/AmazonS3/latest/dev/ AWS SDK for .NET - http://aws.amazon.com/sdkfornet/ AWS SDK for .NET API Reference - http://aws.amazon.com/sdkfornet/ REST API - http://docs.amazonwebservices.com/AmazonS3/latest/API/index.html?APIRest.html Forums: UploadPart max number of transfers - https://forums.aws.amazon.com/thread.jspa?threadID=56179 Forums: The difference between the request time and the current time is too large - https://forums.aws.amazon.com/thread.jspa?messageID=200325𰺅