Knowledge base

Using components: File Storage Destination

Use the File storage destination component to store the output of a data flow into files in a designated directory on a file server (SFTP, HDFS) or object store (Amazon S3, Google Cloud Storage, Azure Blob Storage). (NOTE: Kerberized HDFS file system is not supported.)

Connection

Select an existing file storage connection or create a new one (for more information, see How to connect to your data sources.)

Destination Properties 

  • Target bucket - the name of the target cloud storage bucket where the package output directory and files will be written. Only relevant in the case of object stores such as Amazon S3 and Google Cloud Storage object stores.

  • Target directory - the name of the target directory (within the bucket for object stores). One or more files will be created in the directory. By default, if the target directory already exists, the job will fail (see below).

  • Destination format define the target format to use.
    • Delimited values options - produce csv, tsv or any other delimited values format. The output files are utf-8 encoded.
    • Line delimited JSON - produce a JSON object per record in each line of the output files. The output files are utf-8 encoded.
    • Parquet - Apache Parquet is a columnar storage format popular with Impala, AWS Athena, Presto and other open source DWH solutions. You can control the Parquet output with the system variables: _PARQUET_COMPRESSION, _PARQUET_PAGE_SIZE, _PARQUET_BLOCK_SIZE. Note that datetime and complex data types are not supported with Parquet.

  • Delimited values options
    • Delimiter - select or type a single character to separate values in the output file (tab by default).
    • String Qualifier - if double quote or single quote are selected, fields that contain the selected delimiter will be enclosed in single or double quotes.
    • Escape Character - Character used to escape string qualifiers on the field data in case they exist. Only selectable if string qualifier exists (as there's nothing to escape if there is no string qualifier).
    • Line ending - Line ending for each line. Choose between Unix/Linux (CRLF) or Windows (CRLF).
  • Write field names in header - check to add a header line containing the field names in each output file.
  • Output compression - Select the type of compression for your data (Gzip, Bzip2 or none). Using Gzip or Bzip2 compression adds a .gz or .bz2 suffix to the output directory name.

Destination Action 

    • Write all files directly and fail the job if target directory already exists - Files will be written directly to the target directory. By default, the job will fail if the destination directory already exists. You can use variables to dynamically generate new directory names every time a job is executed (e.g. /output/${_JOB_ID}).
    • Write all files directly and delete target directory if already exists - Files will be written directly to the target directory. During execution, the job checks if the directory exists and deletes it.
    • Write all files directly and replace files in directory if they already exist - Files will be written directly to the target directory. When selected, the job will not fail when the destination directory exists and files will be written to the existing directory. If files with the same names exist in the directory, they will be overwritten. If you'd like to add files to the directory, make sure to change use a custom file pattern that is unique for every job execution (e.g. use the variable ${_JOB_ID} in the file prefix (see below). 
      Warning: Target directory will be deleted on job failure.
    • Use intermediate storage and copy files to an existing directory in destination - Files will be written to an intermediate storage and then will be copied to the target directory. Please make sure that the target directory exists.

  • Merge output to single file - check to make sure only a single file is written to destination directory. Limitations on file sizes in certain platforms may fail your job.

Character Encoding 

  • Specify the character encoding of the output file. Only available when Destination action is set to Use intermediate storage and copy files to an existing directory in destination.

Target file names

Default file pattern is part-[mr]-[0-9]{5} (for example part-m-00000). To change it select custom pattern:

  • File name prefix - leave empty to keep the default prefix (part-[mr]-) or change to your custom prefix. Use variables to set the prefix dynamically. A common use case is a time stamp generated dynamically with a variable (for more information, see Using and setting variables.)
  • File name suffix - Xplenty automatically suggests suffixes according to the file format and compression type you selected.