Technical: How to create table and load data in Databricks from a file (in CSV format or any structured format)? (Approach 1)

Step 1: Login to Databricks account

URL : https://community.cloud.databricks.com/login.html

Step 2: You will get “Welcome to databricks” screen

Step 3: Go and create “Cluster”, if it is not running.

(Please see “How to create a new Cluster in Databricks?” blog for further details)

Step 4: Screenshot of sample Employee CSV file.

Step 5: Click on “Data” and “AddData” button.

Step 6: Click on “Browse” link to choose the CSV file.

Step 7: You will get below screen with selected CSV file.

Step 8: Click on “Create Table in Notebook” button in above screen. You will get below window.

Step 9: Since the first row is header, you need to change ‘ first_row_is_header = ”true” ‘ in the cell as shown below.

Step 10: Click on “Run Cell” to load the data into databricks.

Key points in above screen are:

First Row is header

Column delimiter is comma

Step 11: You will get below screen, with list of records. The key point is that a Spark Job will be executed to load the data in databricks environment.

Step 12: Run the below step. It creates temporary table from the CSV file.

Step 13: In this step, you have switches to “sql” mode and listing down all records from temp table (EmployeeTable_Sample_csv).

Step 14: Create a permanent table. This format of this table is “parquet”.

Step 15: I changed the permanent table as “EmployeeTable”. Please find the below 2 screenshots for your reference.

Step 16: You can use normal Query to find Sum, Average and other functions.

Step 17: Below query is used for extracting employees who’s Tax are greater than or equal to 10% of their salary.

Step 18: Without writing code, you can group number of employees by their vacations. You need to click on Graph => Bar Chart => Plot options => Select “Series Grouping” as “NoofDaysVacation”, in “Values” select “EmpNo” and then in “Aggregation” choose “count”. Click on “Apply” button to view the plot / graph in the notebook.

Technical

Thursday, December 27, 2018

How to create table and load data in Databricks from a file (in CSV format or any structured format)? (Approach 1)

No comments:

Post a Comment