Connecting to big data Hadoop clusters from Excel is now relatively simple. But those who have tried would have noticed multiple ways to achieve what seems to be the same thing. This post aims to clear these muddy waters a bit.
I am using Microsoft Office 365 Pro Plus (Excel 2013) updated to 15.0.4719.1002
Power Query updated to 2.22.4007.242
All tests performed show relative and average results only and should not be taken as authoritative performance tests, but rather as observations, averaged across multiple tests.
Areas from where you can access HDInsight clusters
There are 3 areas from where you can access big data in HDInsight clusters from within Excel, with the aim of integrating it into your BI solution:
The first is the ‘old school’ Data tab:
If you use this ‘old school’ area:
- The variety of data source options are more limited, but it does include Hadoop by way of an ODBC connection.
- It is a more convoluted path via ODBC to Hadoop clusters (specifically compared the Power Query method).
- You will be limited to the row and column limitation in Excel (1048576 rows by 16384 columns).
The second is to use the data connection functionality inside Power Pivot:
- The variety of data source options are more plentiful when compared to the Data tab ODBC connection, and it also includes Hadoop by way of an ODBC connection.
- It is still a more convoluted path via ODBC to Hadoop clusters (specifically compared the Power Query method).
- You are not subject to the limitations posed by Excel.
The third is to use the Power Query functionality – it is this area which will be discussed in more detail in this post:
- You have the widest variety of data source options available via Power Query, including three methods for accessing Hadoop clusters (discussed below). New data source options are constantly added.
- Although it provides two methods for accessing Hadoop clusters beyond ODBC, it is important to understand the limitations of each (discussed below).
- You are not subject to the limitations imposed by Excel.
As you know HD Insights is a Hadoop cluster of compute nodes that provides the MapReduce compute power over data stored in HDFS which is not co-located with the compute nodes. By separating storage from compute, HD Insight allows for data stored in containers to be accessed through HDFS APIs or through Azure Blobs Storage REST APIs. This means that the data exists beyond the life of the cluster so that compute power can be achieved when required, yet data remains intact and available when the cluster no longer exists. This obviously holds a cost benefit as compute clusters are much more costly than storage. There are other benefits not discussed here as the focus here is less on HD Insight and more on its accessibility through Excel as a BI tool.
Power Query provides 3 methods for accessing the data stored in the HDFS. Two via the “From Azure” menu, and one via the “From Other Sources: menu.
In the From Azure menu, you have both “From Azure HD Insight” as well as “From Azure Blob Storage” methods available (as shown in the diagram above):
The Test 1 chart below shows relative and average results from multiple tests that accessed the same data set and then comparing the two methods against each other.
I see no substantial performance difference between the two methods both when the cluster is running, and after it has been deleted. Also note that the queries generated by the two methods are very similar.
From Azure Blob Storage – query
From Azure HD Insight – query
- Using From Azure Blob Storage simply means that the storage account is accessed directly and you do not leverage the compute power of the VM’s nor HIVE’s MapReduce available via the cluster.
- You would have thought that opting for the From Azure HD Insight method means that you can leverage the compute power of the VM’s and HIVE’s MapReduce available via the cluster whilst the cluster runs, but tests (see the test 1 chart) do not show the substantial improvements expected with HIVE’s MapReduce. The results of Azure HD Insight also remains largely unchanged irrespective of the existence of the HDInsight cluster.
- This method (as is the case with From Azure Blob Storage) is therefore limited to accessing blobs in the container and it provides no no way to access HIVE tables so it therefore renders the MapReduceprogramming model and its parallel, distributed compute power superfluous via this connection method. How then do we leverage the power provided by MapReduce? Read on…
From Other Sources:
This menu provides the third method for accessing HDInsight and that is through an ODBC connection after installing the Microsoft HIVE ODBC connection driver and setting up the appropriate System DSN. Set up the initial ODBC connection only after the cluster has been provisioned. The ODBC method will therefore only be applicable whilst the cluster runs, yet it is not necessary to reconfigure the ODBC connection if a deleted cluster is recreated, as long as the name (and therefore the OBDC host address), user and password is created the same as before.
Retesting shows a marked improvement in relative and average results when accessing the HIVE table version created over the blob that was tested previously in Test 1 via the ODBC method.
Also note that the query generated compared to the previous two methods is notably different.
Given that there are essentially 5 ways to access data associated with the Hadoop cluster and the various limitations listed above, how do you decide which method is the most appropriate? I would suggest decide base on your particular use case:
- If you do not have Power Pivot and Power Query enabled in your Excel deployment, and your number of records do not exceed Excel’s record limitation, then you can use the ‘old school’ ODBC DSN connection.
- If you have Power Pivot and Power Query enabled in your Excel deployment and you require uninterrupted access to your data in the HDFS without the need for the compute power brought by Hive’s MapReduce, then consider either of the two connection options in the From Azure menu of Power Query.
- If you do have Power Pivot and Power Query enabled in your Excel deployment, require ad hoc/ intermittent access to your data in the HDFS and support of the compute power brought by Hive’s MapReduce (for example during a monthly reporting cycle only), then consider the From ODBC connection option in the From Other Sources menu in Power Query, and a consistent and pre-defined configuration of the cluster each time you enter a period where data is to be accessed supported by superior compute power.
- If you do have Power Pivot and Power Query enabled in your Excel deployment, require uninterrupted access to your data in the HDFS supported by the compute power brought by Hive’s MapReduce, the then consider the From ODBC connection option in the From Other Sources menu in Power Query and a constantly running cluster. But please be aware of the cost implications of this.