Databricks Unity Catalog Volume

9 min

you can configure the databricks unity catalog volume on the litmus edge webui to send files directly to the databricks unity catalog (supports aws, azure, and gcp) before you begin make sure you have the following access to litmus edge webui see access the litmus edge web ui docid\ hqybg3t5t6irowc2rbbhf access to databricks workspace url and access token step 1 add device follow the steps to connect a device docid\ pal6abpzbrimdu9lvgj30 the device will be used to store tags that will be eventually used to create outbound topics in the connector make sure to select the enable data store checkbox step 2 add tags after connecting the device in litmus edge, you can add tags docid\ xgwokqbtpevii7or82ll0 to the device create tags that you want to use to create outbound topics for the connector step 3 add cloud sync job to add the cloud storage sync job navigate to integration > object click add sync job the add cloud sync job dialog box displays from the add cloud sync job dialog box, enter the following details name enter a name for the cloud sync job provider select databricks unity catalog volume provider from the drop down list step 4 configure databricks unity catalog volume note it is recommended to review the run your first etl workload on databricks | databricks on aws guide before starting this section to configure the databricks unity catalog volume from the add cloud sync job dialog box, enter the following details name enter a friendly user defined name workspace url enter the url of your databricks workspace see get identifiers for workspace objects | databricks on aws for more details access token copy and paste the access token from your databricks account see databricks sql driver for go | databricks on aws for more details source enter the source path from where the files will be copied destination enter the path of the remote destination for this scenario, it is the path for your unity catalog volume on databricks which must be created prior to setting your litmus edge see create and work with volumes | databricks on aws for more details transfer mode select copy to ensure files are copied from the source to the destination click save note to generate csv , json , or parquet files for syncing with the databricks unity catalog, you can utilize the file reading processor docid\ ss3f njyazq1dm nm33dg in litmus edge step 5 enable the cloud storage sync click the toggle button to enable the storage sync job and start transferring files from the source to the destination once a successful connection is established, the status changes to connected from transferring step 6 confirm transfer completion to verify the files in the databricks unity catalog volume go to your databricks unity catalog volume workspace refresh the page to see the newly uploaded files confirm that the test csv has been uploaded successfully example notebook this notebook will use the data transferred to the databricks unity catalog volume follow these steps to create and run a delta live tables (dlt) pipeline 1\ import the necessary libraries and dependencies from pyspark sql functions import col, current timestamp import dlt 2\ define the file path, table name, and dlt configuration variables see create a delta live tables materialized view https //docs databricks com/en/delta live tables/python ref html#create a delta live tables materialized view or streaming table for more details \# define variables used in code below file path = "\<volume path>" table name = "\<table name>" checkpoint path = "\<check point path>" @dlt table(table properties={"\<key>" "\<value>", "\<key>" "\<value>"}) def \<dlt name>() return ( spark readstream format('cloudfiles') option('cloudfiles format', '\<file format>') load(f'{file path}') ) 3\ after defining the variables, create and publish a pipeline see create and publish a pipeline https //docs databricks com/en/ingestion/onboard data html#step 4 create and publish a pipeline guide for detailed steps 4\ schedule the pipeline to run at desired intervals see schedule the pipeline https //docs databricks com/en/ingestion/onboard data html#step 5 schedule the pipeline guide for detailed steps 5\ monitor the pipeline to ensure data is being processed as expected you can query the table created in the unity catalog volume to confirm that the data is correctly ingested