Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (2024)

Prajwal S R

Report this post

I am very glad to share that I have completed and received the Azure Databricks Platform Architect Accreditation badge from the Databricks Academy.Databricks#azuredatabricks #platformarchitect

Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R • Databricks Badges credentials.databricks.com

8 Comments

Like Comment

Tejaswini Paturi

Senior Manager, Agile Leadership, Product Vision, Strategic Planning, Operational Excellence and Customer Success

Report this comment

Congratulations Prajwal S R

Like Reply

1Reaction 2Reactions

Smrithy C

Azure Developer@DATABEAT || Top DataEngineering Voice || Ex-Mindtree || Ex-Picktail || AZ-900/DP 900 /DP 600 Microsoft Certified

Report this comment

Congrats! Prajwal S R

Like Reply

1Reaction 2Reactions

Kapa Jahnavi

Big Data Engineer at LTIMindtree | Azure Databricks

Report this comment

Congratulations! Prajwal S R 👏

More Relevant Posts

Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

1d
Report this post
How to add a new column to the created dataframe and add new data to the columns using the existing data. For example, if I have employee details in a table which has First_name, Last_name, Emp_ID, Role. How can we add a new column to the dataframe by creating the email ID for all the employees.Consider the below data:|F_Name|L_Name| Company | ID ||----------|------------|------------|----||Sachin |Tendulkar| ABC | 10 ||Rahul |Dravid | BAC | 19 ||Virat |Kohli | XYZ | 18 ||Rohit |Sharma | ABC | 45 ||Jasprit |Bumrah | BAC | 93 |With this data, If I have to add a new column to the dataframe: Email which has to take the data from the existing columns and populate the email ID all the users, I can use the CONCAT function along with the df.withCoulmn() functions available in pyspark to add a new column and generate new data for the column. Below is the example code snippet which we can use:from pyspark.sql.functions import litfrom pyspark.sql.functions import concatdf1 = df.withColumn("Email", concat("F_name", lit("."), "ID", lit("@"), "Company", lit(".com")))We will have to import the lit, concat functions and with the above command, there will be a new column added to the dataframe by populating the email ID's from the data available in the dataframe itself.|F_Name|L_Name | Company | ID | Email ||----------|-------------|------------|----|------------------------||Sachin |Tendulkar | ABC | 10 | Sachin.10@ABC.com||Rahul | Dravid | BAC | 19 | Rahul.19@BAC.com||Virat | Kohli | XYZ | 18 | Virat.18@XYZ.com||Rohit | Sharma | ABC | 45 | Rohit.45@ABC.com||Jasprit | Bumrah | BAC | 93 | Jasprit.93@BAC.com |Please feel free to add any points about this in the comments and the methods you would be using for this.#azuredatabricks #dataengineer #databricks

16

Like Comment

To view or add a comment, sign in
Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

1w
Report this post
In my previous post, I had explained about the different types of Private endpoint sub-resources that we can create and also its uses. In this post, we will discuss about the different types of private endpoints that we can create with respect to the Azure Databricks workspaces. Based on how the private endpoint is created, there are 2 types of endpoints: Frontend endpoint and Backend endpoint. Even though there is no specific pages or configs separately available for these endpoints, it depends on which VNET we are using to create the endpoints. Based on this, we will identify the of endpoint that is created.Frontend endpoint: This is the endpoint that is created for connections from the users to the Control Plane. For example, the frontend endpoint will make sure the connection requests from the users (Azure Portal page) REST API's etc are securely connected.When we are creating a VNET-injected Databricks workspace, there will already be 2 subnets created for private and public subnets. Along with these 2, we can create another subnet in the same VNET and use it to create a private endpoint. This is considered as the Frontend endpoint.Backend endpoint: This is the endpoint which is created for connections between the Data plane and the control plane, which is from the workspace to the control plane. All the cluster startup requests, job run requests etc will go through this private endpoint to connect securely.This endpoint can be created for provide access to the on-prem networks or the other networks. So for this, along with the VNET used to deploy the Databricks workspace, we can create another VNET. In this separate VNET, we can create a subnet and use it to create the endpoint. This will be considered as the Backend endpoint. This VNET can have a peering with the on-prem network or with other networks for which the access should be allowed securely.Please feel free to add any points if I have missed.#azuredatabricks #privateendpoint #dataengineer #networking

8

Like Comment

To view or add a comment, sign in
Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

1w
Report this post
When we get the data in raw format, there will be a need to clean the data and have it ready with the desired format.It is important to find the null values and remove the duplicate data which will also reduce the number of records fetched while querying the table. Below I have written sample queries in SQL and Pyspark to find the null records and also to remove the duplicate records:Finding null values:Select count_if(email IS NULL) from users;Select count(*) FROM users WHERE email IS NULL;From pyspark.sql.functions import colusersDF = spark.read.table("users")usersDF.selectExpr("count_if(email IS NULL")usersDF.where(col("email").isNull()).count()‐-----------------------------------------------Remove duplicate records:Create or replace TEMP VIEW sample ASSELECT user_id, timestamp, max(email) AS email_id, max(updated) AS max_updatedFROM usersWHERE user_id IS NOT NULLGROUP BY user_id, timestamp;SELECT count(*) FROM sample From pyspark.sql.functions import maxsampleDF = (usersDF .where(col("user_id").isNotNull()) .groupBy("email", "timestamp") .agg(max("email").alias("email_id"), (max("updated").alias("max_updated"))sampleDF.count()Let me know in the comments about the methods you have used to find null and duplicate records.#azuredatabricks #sql #pyspark #dataengineer

13

Like Comment

To view or add a comment, sign in
Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

2w
Report this post
Once we start using Unity Catalog in our Azure Databricks account, we come across the different types of tables that we can create: Managed tables and External tables. It is important to know the difference between these types.1. Managed Tables: These are the tables which will be saved in the managed storage. This is the location which we would have given while creating the metastore. 2. External table: These are the tables whicj will be saved in the external location that we have created in the locations. It can be either the exact external location or the nested folder of the external location.In both the cases, the metadata will be stored in the Unity catalog only. Also there is no differnce in the access permissions. When we drop the table which is a managed storage, both the data and the metadata will be deleted. But with external tables, only the metdata available in the workspace will be deleted and the data will still be available in the external location.Which type of table have you created and which is better. Let me know about your thoughts in the comments.#azuredatabricks #databricks #tables #dataengineering

18

Like Comment

To view or add a comment, sign in
Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

2w
Report this post
One of the recent features added to the Azure Databricks service is the ability to create Private link connection for the workspaces to ensure the connection is secure is going through only the approved network. We can also connect from our on-prem networks securely using the transit VNET.Private link is a feature where we can create a private endpoint for the resources and there will be a IP assigned to the endpoint which will be used for all the connections. We can create private endpoints for a wide range of resources and the types of the sub-resoucres in the endpoint varies depending on the types of service we are creating the endpoint for.For ex, we can create a private endpoint for ADLS with the below sub-resource types: dfs, blob, file etc. Similarly there are two different types we can select for the Azure Databricks workspaces: databricks_ui_api and browser_authentication:1.Browser_authentication: This endpoint type can be selected when we have multiple workspaces in the same region where we can create one endpoint per region. Once created and connected with the network, all the authentication requests (SSO) for all the workspaces in that region will go through this endpoint itself.2.Databricks_ui_api: This is the endpoint that is used to connect to the Databricks control plane and also the connections to the other Azure resources. Each workspace must have a separate endpoint with this type. The network traffic for a Private Link connection between a transit VNet and the workspace control plane always traverses over the Microsoft backbone network.Please feel free to add any points if I have missed. I will be posting later about the different types of private endpoints that we can create for Azure Databricks workspaces.#azuredatabricks #networking #privatelink

15

Like Comment

To view or add a comment, sign in
Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

3w
Report this post
How can we access the ADLS resource from Azure Databricks workspaces and which is the better method to use.If we are not using Unity Catalog in our environment, we can still connect to the ADLS resource from the workspace using different methods. First of all, there are 3 main different types of authentication available as listed below:1. Service Principal authentication (also called as Oauth method).2. SAS Key method.3. Account Key method.All the above methods has its own advantages and disadvantages. Bearing in mind managing the keys and key rotation, many of them go for SPN authentication. Even this method has secret key creation step and requires specifying key expiry. It requires to generate a new key if it expires and we must update the same in the spark config commands or in the secrets that we have stored in the Key vault.Once we have decided which authentication method to use, there are two different access methods which we can use with any of the above 3 authentication types:1. Mounting method.2. Direct access method.Even though mounting method is used by most of the users, it is not the recommended method as this method has been deprecated by the Databricks team. There are several reasons for deprecating this method, such as:A. The mount point created using a cluster can be accessed from any other cluster if the user has the mount point name.B. The mount point can also be deleted by any user if they have the mount point name and has access to any cluster.So it is recommended to use Direct access method where we avoid creating the mount point. As we will be using the spark config commands to access ADLS, we can utilize the below points to ensure the content is not visible to everyone:1. Store the credentials in Key vault and access it using dbutils command.2. Use notebook ACL's to give access to limited people.3. Pass the spark configs through the Advanced configs tab in the cluster and enable cluster ACL's.Feel free to add more points on this.#azuredatabricks #dataengineer #adls #spark

32

Like Comment

To view or add a comment, sign in
Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

1mo
Report this post
To calculate the databricks usage cost, here is the formula: Total cost for Databricks service = VM Cost + DBU CostVM Cost = [Total Hours] X [No. of Instances] X [Linux VM Price]DBU Cost = [Total Hours] X [No. of Instances] X [DBU] X [DBU Price/hour - Standard / Premium Tier] Here is an example on how Azure Databricks billing works? Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute or All-Purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-Purpose Compute workload. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-Purpose Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for All-Purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100 The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600 The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Light Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for Jobs Light Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.22/DBU = $440 The total cost would therefore be $598 (VM Cost) + $440 (DBU Cost) = $1,038. In addition to VM and DBU charges, you may also be charged for bandwidth, managed disks, storage cost.#databricks #azuredatabricks #dataengineer #cost

30

9 Comments

Like Comment

To view or add a comment, sign in
Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

1mo
Report this post
I’m happy to share that I’m starting a new position as Consultant at Capgemini!

This content isn’t available here

Access this content and more in the LinkedIn app

119

94 Comments

Like Comment

To view or add a comment, sign in

Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (32)

866 followers

27 Posts

View Profile

Explore topics

Sales
Marketing
Business Administration
HR Management
Content Management
Engineering
Soft Skills
See All