I've used Amazon S3 a little bit for backups for some time. Usually, after I upload a file I check the MD5 sum matches to开发者_如何学运维 ensure I've made a good backup. S3 has the "etag" header which used to give this sum.
However, when I uploaded a large file recently the Etag no longer seems to be a md5 sum. It has extra digits and a hyphen "696df35ad1161afbeb6ea667e5dd5dab-2861" . I can't find any documentation about this changing. I've checked using the S3 management console and with Cyberduck.
I can't find any documentation about this change. Any pointers?
You will always get this style of ETag when uploading an multipart file. If you upload the whole file as a single file, then you will get an ETag without the -{xxxx} suffix.
Bucket Explorer will show the unsuffixed ETag for a multipart file up to 5Gb.
AWS:
The ETag for an object created using the multipart upload api will contain one or more non-hexadecimal characters and/or will consist of less than 16 or more than 16 hexadecimal digits.
Reference: https://forums.aws.amazon.com/thread.jspa?messageID=203510#203510
Amazon S3 calculates Etag with a different algorithm (not MD5 Sum, as usually) when you upload a file using multipart.
This algorithm is detailed here : http://permalink.gmane.org/gmane.comp.file-systems.s3.s3tools/583
"Calculate the MD5 hash for each uploaded part of the file, concatenate the hashes into a single binary string and calculate the MD5 hash of that result."
I just develop a tool in bash to calculate it, s3md5 : https://github.com/Teachnova/s3md5
For example, to calculate Etag of a file foo.bin that has been uploaded using multipart with chunk size of 15 MB, then
# s3md5 15 foo.bin
Now you can check integrity of a very big file (bigger than 5GB) because you can calculate the Etag of the local file and compares it with S3 Etag.
Also in python...
#!/usr/bin/env python3
import binascii
import hashlib
import os
# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
# note: 2022-01-27 bitnami-minio container uses 5 mib
AWS_UPLOAD_PART_SIZE = int(os.environ.get('AWS_UPLOAD_PART_SIZE', 5 * 1024 * 1024))
def md5sum(sourcePath):
    '''
    Function: md5sum
    Purpose: Get the md5 hash of a file stored in S3
    Returns: Returns the md5 hash that will match the ETag in S3    
    '''
    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()
    if filesize > AWS_UPLOAD_MAX_SIZE:
        block_count = 0
        md5bytes = b""
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash = hashlib.md5()
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
                md5bytes += binascii.unhexlify(hash.hexdigest())
                block_count += 1
        hash = hashlib.md5()
        hash.update(md5bytes)
        hexdigest = hash.hexdigest() + "-" + str(block_count)
    else:
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
        hexdigest = hash.hexdigest()
    return hexdigest
Here is an example in Go:
func GetEtag(path string, partSizeMb int) string {
    partSize := partSizeMb * 1024 * 1024
    content, _ := ioutil.ReadFile(path)
    size := len(content)
    contentToHash := content
    parts := 0
    if size > partSize {
        pos := 0
        contentToHash = make([]byte, 0)
        for size > pos {
            endpos := pos + partSize
            if endpos >= size {
                endpos = size
            }
            hash := md5.Sum(content[pos:endpos])
            contentToHash = append(contentToHash, hash[:]...)
            pos += partSize
            parts += 1
        }
    }
    hash := md5.Sum(contentToHash)
    etag := fmt.Sprintf("%x", hash)
    if parts > 0 {
        etag += fmt.Sprintf("-%d", parts)
    }
    return etag
}
This is just an example, you should handle errors and stuff
Here's a powershell function to calculate the Amazon ETag for a file:
$blocksize = (1024*1024*5)
$startblocks = (1024*1024*16)
function AmazonEtagHashForFile($filename) {
    $lines = 0
    [byte[]] $binHash = @()
    $md5 = [Security.Cryptography.HashAlgorithm]::Create("MD5")
    $reader = [System.IO.File]::Open($filename,"OPEN","READ")
    if ((Get-Item $filename).length -gt $startblocks) {
        $buf = new-object byte[] $blocksize
        while (($read_len = $reader.Read($buf,0,$buf.length)) -ne 0){
            $lines   += 1
            $binHash += $md5.ComputeHash($buf,0,$read_len)
        }
        $binHash=$md5.ComputeHash( $binHash )
    }
    else {
        $lines   = 1
        $binHash += $md5.ComputeHash($reader)
    }
    $reader.Close()
    $hash = [System.BitConverter]::ToString( $binHash )
    $hash = $hash.Replace("-","").ToLower()
    if ($lines -gt 1) {
        $hash = $hash + "-$lines"
    }
    return $hash
}
If you use multipart uploads, the "etag" is not the MD5 sum of the data (see What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?). One can identify this case by the etag containing a dash, "-".
Now, the interesting question is how to get the actual MD5 sum of the data, without downloading? One easy way is to just "copy" the object onto itself, this requires no download:
s3cmd cp s3://bucket/key s3://bucket/key
This will cause S3 to recompute the MD5 sum and store it as "etag" of the just copied object. The "copy" command runs directly on S3, i.e., no object data is transferred to/from S3, so this requires little bandwidth! (Note: do not use s3cmd mv; this would delete your data.)
The underlying REST command is:
PUT /key HTTP/1.1
Host: bucket.s3.amazonaws.com
x-amz-copy-source: /buckey/key
x-amz-metadata-directive: COPY
Copying to s3 with aws s3 cp can use multipart uploads and the resulting etag will not be an md5, as others have written.
To upload files without multipart, use the lower level put-object command.
aws s3api put-object --bucket bucketname --key remote/file --body local/file
This AWS support page - How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? - describes a more reliable way to verify the integrity of your s3 backups.
Firstly determine the base64 encoded md5sum of the file you wish to upload:
$ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"
Then use the s3api to upload the file:
$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64"
Note the use of the --content-md5 flag, the help for this flag states:
--content-md5  (string)  The  base64-encoded  128-bit MD5 digest of the part data.
This does not say much about why to use this flag, but we can find this information in the API documentation for put object:
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:
{
    "ETag": "\"599393a2c526c680119d84155d90f1e5\""
}
The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).
If the hash does not match the one you specified you get an error.
A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.
In addition to this you can also add the file md5sum to the file metadata as an additional check:
$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"
After upload you can issue the head-object command to check the values.
$ aws s3api head-object --bucket my-bucket --key my-file
{
    "AcceptRanges": "bytes",
    "ContentType": "binary/octet-stream",
    "LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
    "ContentLength": 605,
    "ETag": "\"599393a2c526c680119d84155d90f1e5\"",
    "Metadata": {    
        "md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="    
    }    
}
Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:
#!/bin/bash
set -euf -o pipefail
# assumes you have aws cli, jq installed
# change these if required
tmp_dir="$HOME/tmp"
s3_dir="foo"
s3_bucket="stack-overflow-example"
aws_region="ap-southeast-2"
aws_profile="my-profile"
test_dir="$tmp_dir/s3-md5sum-test"
file_name="MailHog_linux_amd64"
test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
s3_key="$s3_dir/$file_name"
return_dir="$( pwd )"
cd "$tmp_dir" || exit
mkdir "$test_dir"
cd "$test_dir" || exit
wget "$test_file_url"
md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"
echo "$file_name hex    = $md5_sum_hex"
echo "$file_name base64 = $md5_sum_base64"
echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api put-object \
--bucket "$s3_bucket" \
--key "$s3_key" \
--body "$file_name" \
--metadata md5chksum="$md5_sum_base64" \
--content-md5 "$md5_sum_base64"
echo "Verifying sums match"
s3_md5_sum_hex=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
s3_md5_sum_base64=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )
if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
    echo "checksums match"
else
    echo "something is wrong checksums do not match:"
    cat <<EOM | column -t -s ' '
$file_name file hex:    $md5_sum_hex    s3 hex:    $s3_md5_sum_hex
$file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
EOM
fi
echo "Cleaning up"
cd "$return_dir"
rm -rf "$test_dir"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api delete-object \
--bucket "$s3_bucket" \
--key "$s3_key"
Here is C# version
    string etag = HashOf("file.txt",8);
source code
    private string HashOf(string filename,int chunkSizeInMb)
    {
        string returnMD5 = string.Empty;
        int chunkSize = chunkSizeInMb * 1024 * 1024;
        using (var crypto = new MD5CryptoServiceProvider())
        {
            int hashLength = crypto.HashSize/8;
            using (var stream = File.OpenRead(filename))
            {
                if (stream.Length > chunkSize)
                {
                    int chunkCount = (int)Math.Ceiling((double)stream.Length/(double)chunkSize);
                    byte[] hash = new byte[chunkCount*hashLength];
                    Stream hashStream = new MemoryStream(hash);
                    long nByteLeftToRead = stream.Length;
                    while (nByteLeftToRead > 0)
                    {
                        int nByteCurrentRead = (int)Math.Min(nByteLeftToRead, chunkSize);
                        byte[] buffer = new byte[nByteCurrentRead];
                        nByteLeftToRead -= stream.Read(buffer, 0, nByteCurrentRead);
                        byte[] tmpHash = crypto.ComputeHash(buffer);
                        hashStream.Write(tmpHash, 0, hashLength);
                    }
                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(hash)).Replace("-", string.Empty).ToLower()+"-"+ chunkCount;
                }
                else {
                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();
                }
                stream.Close();
            }
        }
        return returnMD5;
    }
To go one step beyond the OP's question.. chances are, these chunked ETags are making your life difficult in trying to compare them client-side.
If you are publishing your artifacts to S3 using the awscli commands (cp, sync, etc), the default threshold at which multipart upload seems to be used is 10MB. Recent awscli releases allow you to configure this threshold, so you can disable multipart and get an easy to use MD5 ETag:
aws configure set default.s3.multipart_threshold 64MB
Full documentation here: http://docs.aws.amazon.com/cli/latest/topic/s3-config.html
A consequence of this could be downgraded upload performance (I honestly did not notice). But the result is that all files smaller than your configured threshold will now have normal MD5 hash ETags, making them much easier to delta client side.
This does require a somewhat recent awscli install. My previous version (1.2.9) did not support this option, so I had to upgrade to 1.10.x.
I was able to set my threshold up to 1024MB successfully.
Based on answers here, I wrote a Python implementation which correctly calculates both multi-part and single-part file ETags.
def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
    md5s = []
    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))
    if len(md5s) == 1:
        return '"{}"'.format(md5s[0].hexdigest())
    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))
The default chunk_size is 8 MB used by the official aws cli tool, and it does multipart upload for 2+ chunks. It should work under both Python 2 and 3.
Improving on @Spedge's and @Rob's answer, here is a python3 md5 function that takes in a file-like and does not rely on being able to get the file size with os.path.getsize.
# Function : md5sum
# Purpose : Get the md5 hash of a file stored in S3
# Returns : Returns the md5 hash that will match the ETag in S3
# https://github.com/boto/boto3/blob/0cc6042615fd44c6822bd5be5a4019d0901e5dd2/boto3/s3/transfer.py#L169
def md5sum(file_like,
           multipart_threshold=8 * 1024 * 1024,
           multipart_chunksize=8 * 1024 * 1024):
    md5hash = hashlib.md5()
    file_like.seek(0)
    filesize = 0
    block_count = 0
    md5string = b''
    for block in iter(lambda: file_like.read(multipart_chunksize), b''):
        md5hash = hashlib.md5()
        md5hash.update(block)
        md5string += md5hash.digest()
        filesize += len(block)
        block_count += 1
    if filesize > multipart_threshold:
        md5hash = hashlib.md5()
        md5hash.update(md5string)
        md5hash = md5hash.hexdigest() + "-" + str(block_count)
    else:
        md5hash = md5hash.hexdigest()
    file_like.seek(0)
    return md5hash
I built on r03's answer and have a standalone Go utility for this here: https://github.com/lambfrier/calc_s3_etag
Example usage:
$ dd if=/dev/zero bs=1M count=10 of=10M_file
$ calc_s3_etag 10M_file
669fdad9e309b552f1e9cf7b489c1f73-2
$ calc_s3_etag -chunksize=15 10M_file
9fbaeee0ccc66f9a8e3d3641dca37281-1
Of course, the multipart upload of files could be common issue. In my case, I was serving static files through S3 and the etag of .js file was coming out to be different from the local file even while the content was the same.
Turns out that even while the content was the same, it was because the line endings were different. I fixed the line endings in my git repository, uploaded the changed files to S3 and it works fine now.
The python example works great, but when working with Bamboo, they set the part size to 5MB which is NON STANDARD!! (s3cmd is 15MB) Also adjusted to use 1024 to calculate bytes.
Revised to work for bamboo artifact s3 repos.
import hashlib
import binascii
# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
AWS_UPLOAD_PART_SIZE = 5 * 1024 * 1024
#
# Function : md5sum
# Purpose : Get the md5 hash of a file stored in S3
# Returns : Returns the md5 hash that will match the ETag in S3
def md5sum(sourcePath):
    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()
    if filesize > AWS_UPLOAD_MAX_SIZE:
        block_count = 0
        md5string = ""
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash = hashlib.md5()
                hash.update(block)
                md5string = md5string + binascii.unhexlify(hash.hexdigest())
                block_count += 1
        hash = hashlib.md5()
        hash.update(md5string)
        return hash.hexdigest() + "-" + str(block_count)
    else:
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash.update(block)
        return hash.hexdigest()
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论