Documentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
Data Pipelines are now the preferred method for loading data into the System. For details, see Load Data.
This guide presents examples of some more advanced transformation topics such as exploding data, loading multiple tables from a single topic, working with arrays, and accessing nested data.
Loading Multiple Tables from a Topic
The LAT Load JSON Data from Kafka guide shows an example of loading each topic into its own table. However, each topic can route data to multiple tables and many topics can stream load simultaneously to one or more tables in a single pipeline.
Here is an example pipeline.json file that will load data into two separate tables:
{
"version": 2,
"pipeline_id": "pipeline-metabase",
"source": {
"type": "kafka",
"kafka": {
"bootstrap.servers": "127.0.0.1:9092",
}
},
"transform": {
"topics": {
"products": {
"tables": {
"metabase.public.products": {
"columns": {
"id": "id",
"ean": "ean",
"title": "title",
"category": "category",
"vendor": "vendor",
"price": "price",
"rating": "rating",
"created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')"
}
},
"metabase.public.product_ratings": {
"columns": {
"created_at": "now()",
"product_id": "id",
"rating": "rating",
}
}
}
}
}
}
}
A few key points here.
- You are loading data from a Kafka topic called
products.
- That topic is loading into a table called
database.schema.table.
- Most columns are simply mapped from the source JSON document into, however some are transformed from the original data.
created_at is parsed from a string using the to_timestamp function
created_at for product_ratings is set dynamically to the value of the now() function
Loading Nested JSON Data
As you can see from the transformation, the syntax is similar to SQL, but the inner language used to traverse the JSON document and manipulate arrays can be unfamiliar. The approach used to select a property from the source JSON is to follow the tree with dot notation. For example:
Given a series of two documents like:
{
"country" : "US",
"states" : ["AR", "VA", "MD", "IL"],
"languages" : [
{
"name" : "English",
"locale" : "en-US",
"dialects" : ["Northern American English", "Southeast Super-Regional English", "Western American English"]
},
{
"name" : "Spanish",
"locale" : "es-US",
"dialects" : ["Mexican Spanish", "Caribbean Spanish", "Central American Spanish", "South American Spanish", "Colonial Spanish"]
}
],
"capital" : {
"city" : "Washington",
"region" : "District of Columbia"
},
"founded_in" : 1776
}
{
"country" : "CA",
"provinces" : ["BC", "ON", "QC", "MB"],
"languages" : [
{
"name" : "English",
"locale" : "en-CA"
},
{
"name" : "French",
"locale" : "fr-CA"
}
],
"capital" : {
"city" : "Ottawa",
"region" : "Ontario"
},
"founded_in" : 1867
}
You can access a property like the country simply by referencing the property:
/* Ocient transformations are in this format */
{
"column_name" : "transformation"
}
/* to apply the country property to the country_code column */
{
"country_code" : "country"
}
You can access nested properties easily as well:
/* sets the column capital_city to values like "Ottawa" */
{
"capital_city" : "capital.city"
}
If you want to join the capital sub-properties together, you can easily do so:
/* sets the column capital_city to values like "Ottawa, Ontario" */
{
"capital_city" : "concat(capital.city, ',', capital.region)"
}
Note that these two documents are not fully cleansed, and you want to capture the states or provinces array and assign it to a common destination column, regions. You can do this with a coalesce operator ||:
/* sets the column regions to ["AR", "VA", "MD", "IL"] for the US and ["BC", "ON", "QC", "MB"] for Canada */
{
"regions" : "states || provinces"
}
Loading Array Data
Where arrays are encountered, special array notation is used. The benefit of this syntax is that properties in objects in arrays can also be accessed to return an array.
Another useful capability is the flattening of arrays. In the event that the values returned from a transformation includes nested arrays, you can flatten it to a single dimension array by appending [] to the end of the transformation expression. For example:
/* For the US, this transformation would return a structure like: [ ["", ""], ["", ""] ] */
{
"regions" : "languages.dialects"
}
/* You can flatten this to return this one-dimensional array with structure like: [ "", "", "", "" ] */
{
"regions" : "languages.dialects[]"
}
Exploding Array Data
In some cases, a single record from a loading source should represent more than one row in the target table. This is referred to as exploding the data. This allows nested array data to be expanded and to be associated with the corresponding values from the record.
For example, if you would like to “explode” the data in the languages array into a single row per value to populate a standalone “country_languages” table, you could do the following for the transform section of your pipeline.json file:
{
"transform": {
"tables": {
"my_database.my_schema.country_languages": {
"columns": {
"country_name": "country",
"language_name": "EXPLODE(languages[].name)",
"dialects": "EXPLODE(languages[].dialects)",
"locale": "EXPLODE(languages[].locale)"
}
}
}
}
}
This would result in the following output. Note that instead of the original 2 records, this result will produce four rows into the new database table. The “exploded” values are unique, but the records inherit their parent values like country from the record when the explode occurs:
Example Rows:
{
"country" : "US",
"language_name" : "English",
"dialects" : ["Northern American English", "Southeast Super-Regional English", "Western American English"],
"locale": "en-US"
}
{
"country" : "US",
"language_name" : "Spanish",
"dialects" : ["Mexican Spanish", "Caribbean Spanish", "Central American Spanish", "South American Spanish", "Colonial Spanish"],
"locale": "es-US"
}
{
"country" : "CA",
"language_name" : "English",
"dialects" : NULL,
"locale" : "en-CA"
}
{
"country" : "CA",
"language_name" : "French",
"dialects" : NULL,
"locale" : "fr-CA"
}
If multiple columns in the transformation include an EXPLODE, they results will be “zipped” together. If the arrays are different lengths, the number of rows will match the longest array. The shorter array will populate the end of its rows with NULL. This is sometimes referred to as an “Outer Explode.”
One common requirement is to split a string on a token to make an array. This is accomplished using transformation functions in the LAT.
Assuming a data input with a space separated string:
{ "states": "AL AK AZ AR CA CO CT DE" }
You can extract an array of state values in a few ways using the tokenize function. Tokenize has a number of flags to control how the pattern should be interpreted. For example 'q' indicates the pattern is a literal and 'i' indicates case insensitive regular expression.
/* You can use a literal space to split and load ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE"] */
{
"states_col" : "tokenize(states, ' ', 'q')"
}
/* You can use a regular expression for whitespace to split and load ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE"]
{
"states_col" : "tokenize(states, '\\s', '')"
}
For more complex situations, you can split on any regular expression such as cases where values are wrapped with double quotes and might contain spaces. For this, you can use the expression (?=(([^\\'\"]*[\\'\"]){2})*[^\\'\"]*$) that leverages a zero-width positive lookahead assertion. Note that because this regex will be presented in a JSON document, certain characters must be escaped within the JSON string. For example, in JSON:
{
"my_column" : "tokenize(states, ' (?=(([^\\'\"]*[\\'\"]){2})*[^\\'\"]*$)', '')"
}
This would actually pass the following regular expression to the tokenize function when the backslash escape sequences are processed:
(?=(([^'"]*['"]){2})*[^'"]*$)
Similarly, note that double backslashes are required to provide escape characters in the regular expression. For example, '\\s' would actually pass the regular expression \s to the tokenize function.
Similar to splitting on a token, the LAT can easily join an array into a string with the join function. With source data that includes an array like the following:
{ "state_names": ["Alabama", "Alaska", "Arizona", "Arkansas", "California"] }
You can use the join function to produce this string:
/* join takes a string delimiter and will produce "Alabama, Alaska, Arizona, Arkansas, California"*/
{
"state_names_col": "join(', ', state_names)"
}
LAT Overview
LAT Data Types in Loading
LAT Advanced Topics