开发者

Is there a way to load a Gzipped file from Amazon S3 into Pentaho (PDI / Spoon / Kettle)?

开发者 https://www.devze.com 2023-03-11 21:16 出处:网络
Is there a way to load a Gzipped file from Amazon S3 into Pentaho Data Integration (Spoon)? There is a \"Text File Input\" that has a Compression attribute that supports Gzip, but this module can\'t

Is there a way to load a Gzipped file from Amazon S3 into Pentaho Data Integration (Spoon)?

There is a "Text File Input" that has a Compression attribute that supports Gzip, but this module can't connect to S3 as a source.

There is an "S3 CSV Input" module, but no Compression attribute, so it can't decompress the Gzipped content 开发者_开发技巧into tabular form.

Also, there is no way to save the data from S3 to a local file. The downloaded content can only be "hopped" to another Step, but no Step can read gzipped data from a previous Step, the Gzip-compatible steps all read only from files.

So, I can get gzipped data from S3, but I can't send that data anywhere that can consume it.

Am I missing something? Is there a way to unzip zipped data from a non-file source?


Kettle uses VFS (Virtual File System) when working with files. Therefore, you can fetch a file through http, ssh, ftp, zip, ... and use it as a regular, local file in all the steps that read files. Just use the right "url". You will find more here and here, and a very nice tutorial here. Also, check out VFS transformation examples that come with Kettle.

This is url template for S3: s3://<Access Key>:<Secret Access Key>@s3<file path>

In your case, you would use "Text file input" with compression settings you mentioned and selected file would be:

s3://aCcEsSkEy:SecrEttAccceESSKeeey@s3/your-s3-bucket/your_file.gzip


I really don't know how but if you really need this you can look for using S3 through VFS capabilities that Pentaho Data Integration provides. I can se a vsf-providers.xml with the following content in my PDI CE distribution:

../data-integration/libext/pentaho/pentaho-s3-vfs-1.0.1.jar

<providers>
  <provider class-name="org.pentaho.s3.vfs.S3FileProvider">
    <scheme name="s3"/>
    <if-available class-name="org.jets3t.service.S3Service"/>
  </provider>
</providers>


You can also try with GZIP input control in peanatho kettle it is there.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号