Packages define the data flow using components that specify the data to process, the data manipulation to perform, and the output destinations. Each package requires at least one source and one destination.
Once you define a package, you can verify it, and, as in any development lifecycle, fix any errors and re-verify until the package is ready to run as a job on a cluster.
Click the following links for information on using packages:
- Creating a new package
- Creating a new package from a template
- Working in the package designer
- Using and setting variables in your packages
- Validating a package
- Using pattern matching in source component paths
- Using ISO 8601 date/time functions
- Using functions in components
Component Description Amazon Redshift Source Read data stored in an Amazon Redshift table, view or using a query. Bing Ads Source Read Bing Ads report data. Database Source Read data stored in a database table, view or using a query. Facebook Ads Insights Source Read Facebook Ads Insights reports data. File Storage Source Read data stored in a file or multiple files in object stores such as Amazon S3, Google Cloud Storage or Azure Blob Storage or file servers such as SFTP. Google Adwords source Read Google Adwords report data. Google Analytics Source Read Google Analytics report data. Google BigQuery Source Read data stored in a Google BigQuery table or using a query. Google Cloud Spanner Source Read data stored in a Google Cloud Spanner table or using a query. MongoDB Source Read data stored in a MongoDB collection. NetSuite Source Read NetSuite standard and custom records (tables) using the NetSuite JDBC drivers (SuiteAnalytics Connect). Salesforce source Read Salesforce sales cloud standard and custom objects using the Bulk API. Rest API Source Read data from HTTP endpoints such as Rest Web Services. Use the Rest API source component to define the authentication method, request parameters and response fields to use in the package. Aggregate Transformation Use the Aggregate transformation to group the input dataset by one or more fields and use aggregate functions such as Count, Average, Minimum, Maximum, etc Assert Transformation Use the Assert transformation to make sure that all data in the source complies with the conditions you specify in the component. If a record does not comply, the job fails and a message is added to the error log. Clone transformation Use the Clone component to split a dataflow into two dataflows in order to apply multiple transformations to the same data. Cross Join Transformation Use the Cross Join transformation to combine records from two different inputs. The cross join returns the Cartesian product of records from the two inputs. That is, it will produce records that combine each record from the left input with each record from the right input. Distinct Transformation Use the Distinct transformation to filter out duplicate records that have the same values in all fields, leaving only unique records. For example, you might need to filter out users' double-clicks in events. Filter Transformation Use the Filter transformation to filter input data by defining conditions that must be met by records in the input. Join Transformation Use the Join transformation to combine records from two different inputs. The join component can be used to add information from one data source to another data source or to filter data that exists in both data sources or exists in only one of them. Limit Transformation Use the Limit transformation to limit the number of records in the output for the entire dataset or per partition or group within the data set. Rank Transformation Use the Rank component to sort input data by one or more fields, in an ascending or descending order and add a rank field that reflects the sort order. Select Transformation Use the Select transformation to choose which fields from the input will be available in the next component and transform them using expressions in order to parse input data, enrich it, extract information from it or manipulate it. Sort Transformation Use the Sort component to sort input data by one or more fields, in an ascending or descending order. Union Transformation Use the Union transformation to combine records from two inputs with the same schema (same fields and data types). Window Transformation Use the Window component to apply window functions to incoming data, similar to window functions in SQL. These functions let you rank or distribute data, provide moving averages, running totals and other useful data. The output of the Window component contains all records and fields from the input data flow with the addition of the calculated window functions. Sample Transformation Use the Sample component to return a percentage of random records from the input Cube transformation Use the Cube and Rollup component to group the input dataset by combinations of fields and use aggregate functions such as Count, Average, Minimum, Maximum, etc. Amazon Redshift Destination Use the Cube and Rollup component to group the input dataset by combinations of fields and use aggregate functions such as Count, Average, Minimum, Maximum, etc. Database Destination Use the database destination component to store the output of a data flow in a relational database table. File Storage Destination Use the File storage destination component to store the output of a data flow into files in a designated directory on a file server (SFTP, HDFS) or object store (Amazon S3, Google Cloud Storage, Azure Blob Storage). Google BigQuery Destination Use the Google BigQuery destination component to store the output of a data flow in a BigQuery table. Google Spanner Destination Use the Google Spanner destination component to store the output of a data flow in a Google Spanner table. MongoDB Destination Use the MongoDB destination component to store the output of a data flow in a MongoDB collection. Salesforce Destination Use the Salesforce destination component to store the output of a data flow in Salesforce Sales cloud object. Snowflake Destination Use the Snowflake destination component to store the output of a data flow in a Snowflake table.