Advanced Loading and Transformations
This guide presents examples of some more advanced transformation topics such as exploding data, loading multiple tables from a single topic, working with arrays, and accessing nested data.
The Load JSON Data from Kafka guide shows an example of loading each topic into its own table. However, each topic can route data to multiple tables and many topics can stream load simultaneously to one or more tables in a single pipeline.
Here is an example pipeline.json file that will load data into two separate tables:
A few key points here.
- You are loading data from a Kafka topic called products.
- That topic is loading into a table called database.schema.table.
- Most columns are simply mapped from the source JSON document into, however some are transformed from the original data.
- created_at is parsed from a string using the to_timestamp function
- created_at for product_ratings is set dynamically to the value of the now() function
As you can see from the transformation, the syntax is similar to SQL, but the inner language used to traverse the JSON document and manipulate arrays can be unfamiliar. The approach used to select a property from the source JSON is to follow the tree with dot notation. For example:
Given a series of two documents like:
You can access a property like the country simply by referencing the property:
You can access nested properties easily as well:
If you want to join the capital sub-properties together, you can easily do so:
Note that these two documents are not fully cleansed, and you want to capture the states or provinces array and assign it to a common destination column, regions. You can do this with a coalesce operator ||:
Where arrays are encountered, special array notation is used. The benefit of this syntax is that properties in objects in arrays can also be accessed to return an array.
Another useful capability is the flattening of arrays. In the event that the values returned from a transformation includes nested arrays, you can flatten it to a single dimension array by appending [] to the end of the transformation expression. For example:
In some cases, a single record from a loading source should represent more than one row in the target table. This is referred to as exploding the data. This allows nested array data to be expanded and to be associated with the corresponding values from the record.
For example, if you would like to "explode" the data in the languages array into a single row per value to populate a standalone "country_languages" table, you could do the following for the transform section of your pipeline.json file:
This would result in the following output. Note that instead of the original 2 records, this result will produce four rows into the new database table. The "exploded" values are unique, but the records inherit their parent values like country from the record when the explode occurs:
Example Rows:
If multiple columns in the transformation include an EXPLODE, they results will be "zipped" together. If the arrays are different lengths, the number of rows will match the longest array. The shorter array will populate the end of its rows with NULL. This is sometimes referred to as an "Outer Explode."
One common requirement is to split a string on a token to make an array. This is accomplished using transformation functions in the LAT.
Assuming a data input with a space separated string:
You can extract an array of state values in a few ways using the tokenize function. Tokenize has a number of flags to control how the pattern should be interpreted. For example 'q' indicates the pattern is a literal and 'i' indicates case insensitive regular expression.
For more complex situations, you can split on any regular expression such as cases where values are wrapped with double quotes and might contain spaces. For this, you can use the expression (?=(([^\\'\"]*[\\'\"]){2})*[^\\'\"]*$) that leverages a zero-width positive lookahead assertion. Note that because this regex will be presented in a JSON document, certain characters must be escaped within the JSON string. For example, in JSON:
This would actually pass the following regular expression to the tokenize function when the backslash escape sequences are processed:
Similarly, note that double backslashes are required to provide escape characters in the regular expression. For example, '\\s' would actually pass the regular expression \s to the tokenize function.
Similar to splitting on a token, the LAT can easily join an array into a string with the join function. With source data that includes an array like the following:
You can use the join function to produce this string:
LAT Overview