LAT Advanced Loading and Transformations

data pipelines are now the preferred method for loading data into the ocient system for details, see docid\ zncvnrhsf6fg1yvqk6mxt this guide presents examples of some more advanced transformation topics such as exploding data, loading multiple tables from a single topic, working with arrays, and accessing nested data loading multiple tables from a topic the docid\ lnba3gj5kjzbjlsp3nowt guide shows an example of loading each topic into its own table however, each topic can route data to multiple tables and many {{kafka}} topics can stream load simultaneously to one or more tables in a single pipeline here is an example pipeline json file that will load data into two separate tables { "version" 2, "pipeline id" "pipeline metabase", "source" { "type" "kafka", "kafka" { "bootstrap servers" "127 0 0 1 9092", } }, "transform" { "topics" { "products" { "tables" { "metabase public products" { "columns" { "id" "id", "ean" "ean", "title" "title", "category" "category", "vendor" "vendor", "price" "price", "rating" "rating", "created at" "to timestamp(created at, 'yyyy mm dd\\\\'t\\\\'hh\ mm\ ss\[ sss]x')" } }, "metabase public product ratings" { "columns" { "created at" "now()", "product id" "id", "rating" "rating", } } } } } } } a few key points here you are loading data from a kafka topic called products that topic is loading into a table called database schema table most columns are simply mapped from the source json document into , however some are transformed from the original data created at is parsed from a string using the to timestamp function created at for product ratings is set dynamically to the value of the now() function loading nested json data as you can see from the transformation, the syntax is similar to sql, but the inner language used to traverse the json document and manipulate arrays can be unfamiliar the approach used to select a property from the source json is to follow the tree with dot notation for example given a series of two documents like { "country" "us", "states" \["ar", "va", "md", "il"], "languages" \[ { "name" "english", "locale" "en us", "dialects" \["northern american english", "southeast super regional english", "western american english"] }, { "name" "spanish", "locale" "es us", "dialects" \["mexican spanish", "caribbean spanish", "central american spanish", "south american spanish", "colonial spanish"] } ], "capital" { "city" "washington", "region" "district of columbia" }, "founded in" 1776 } { "country" "ca", "provinces" \["bc", "on", "qc", "mb"], "languages" \[ { "name" "english", "locale" "en ca" }, { "name" "french", "locale" "fr ca" } ], "capital" { "city" "ottawa", "region" "ontario" }, "founded in" 1867 } you can access a property like the country simply by referencing the property / ocient transformations are in this format / { "column name" "transformation" } / to apply the country property to the country code column / { "country code" "country" } you can access nested properties easily as well / sets the column capital city to values like "ottawa" / { "capital city" "capital city" } if you want to join the capital sub properties together, you can easily do so / sets the column capital city to values like "ottawa, ontario" / { "capital city" "concat(capital city, ',', capital region)" } note that these two documents are not fully cleansed, and you want to capture the states or provinces array and assign it to a common destination column, regions you can do this with a coalesce operator || / sets the column regions to \["ar", "va", "md", "il"] for the us and \["bc", "on", "qc", "mb"] for canada / { "regions" "states || provinces" } loading array data where arrays are encountered, special array notation is used the benefit of this syntax is that properties in objects in arrays can also be accessed to return an array another useful capability is the flattening of arrays in the event that the values returned from a transformation includes nested arrays, you can flatten it to a single dimension array by appending \[] to the end of the transformation expression for example / for the us, this transformation would return a structure like \[ \["", ""], \["", ""] ] / { "regions" "languages dialects" } / you can flatten this to return this one dimensional array with structure like \[ "", "", "", "" ] / { "regions" "languages dialects\[]" } exploding array data in some cases, a single record from a loading source should represent more than one row in the target table this is referred to as exploding the data this allows nested array data to be expanded and to be associated with the corresponding values from the record for example, if you would like to "explode" the data in the languages array into a single row per value to populate a standalone "country languages" table, you could do the following for the transform section of your pipeline json file { "transform" { "tables" { "my database my schema country languages" { "columns" { "country name" "country", "language name" "explode(languages\[] name)", "dialects" "explode(languages\[] dialects)", "locale" "explode(languages\[] locale)" } } } } } this would result in the following output note that instead of the original 2 records, this result will produce four rows into the new database table the "exploded" values are unique, but the records inherit their parent values like country from the record when the explode occurs example rows { "country" "us", "language name" "english", "dialects" \["northern american english", "southeast super regional english", "western american english"], "locale" "en us" } { "country" "us", "language name" "spanish", "dialects" \["mexican spanish", "caribbean spanish", "central american spanish", "south american spanish", "colonial spanish"], "locale" "es us" } { "country" "ca", "language name" "english", "dialects" null, "locale" "en ca" } { "country" "ca", "language name" "french", "dialects" null, "locale" "fr ca" } if multiple columns in the transformation include an explode , they results will be "zipped" together if the arrays are different lengths, the number of rows will match the longest array the shorter array will populate the end of its rows with null this is sometimes referred to as an "outer explode " transforming and joining string data one common requirement is to split a string on a token to make an array this is accomplished using transformation functions in the lat assuming a data input with a space separated string { "states" "al ak az ar ca co ct de" } you can extract an array of state values in a few ways using the tokenize function tokenize has a number of flags to control how the pattern should be interpreted for example 'q' indicates the pattern is a literal and 'i' indicates case insensitive regular expression / you can use a literal space to split and load \["al", "ak", "az", "ar", "ca", "co", "ct", "de"] / { "states col" "tokenize(states, ' ', 'q')" } / you can use a regular expression for whitespace to split and load \["al", "ak", "az", "ar", "ca", "co", "ct", "de"] { "states col" "tokenize(states, '\\\s', '')" } for more complex situations, you can split on any regular expression such as cases where values are wrapped with double quotes and might contain spaces for this, you can use the expression (?=((\[^\\\\'\\"] \[\\\\'\\"]){2}) \[^\\\\'\\"] $) that leverages a zero width positive lookahead assertion note that because this regex will be presented in a json document, certain characters must be escaped within the json string for example, in json { "my column" "tokenize(states, ' (?=((\[^\\\\'\\"] \[\\\\'\\"]){2}) \[^\\\\'\\"] $)', '')" } this would actually pass the following regular expression to the tokenize function when the backslash escape sequences are processed (?=((\[^'"] \['"]){2}) \[^'"] $) similarly, note that double backslashes are required to provide escape characters in the regular expression for example, '\\\s' would actually pass the regular expression \s to the tokenize function similar to splitting on a token, the lat can easily join an array into a string with the join function with source data that includes an array like the following { "state names" \["alabama", "alaska", "arizona", "arkansas", "california"] } you can use the join function to produce this string / join takes a string delimiter and will produce "alabama, alaska, arizona, arkansas, california" / { "state names col" "join(', ', state names)" } related links docid\ pkkbn6a iuheai7nus2fk docid\ idbsoeuf1iaaszw s1j1k docid 8uxqvvqf bjb47xxr1y14