Tag Archives: amazon

List paths to external s3 partitions in Hive external table

bigdatamarkIf you have external Apache Hive tables with partitions stored in Amazon S3, the easiest way to list the S3 file paths is to query the MySQL hive metastore directly.

When running a Hive query against our Amazon S3 backed table, I encountered this error:

java.lang.IllegalArgumentException: Can not create a Path from an empty string

It was suggested to me that perhaps one of the files in S3 had been moved or deleted. After saying “Crap!”, I decided that I needed to get a list of all of the hive table partition locations so that I can see which ones, if any were no longer there.

Using the DESCRIBE FORMATTED HQL command, one can get the Location value of a single partition, but this table had over 50,000 partitions, so that meant running over 50,000 queries and then parsing the results. This wasn’t going to do.

Instead, I found where in the MySQL Hive metastore these paths were stored and exported them in order to pipe into an aws s3 ls command.

Here is the query that I used to get the data that I needed out of a MySQL metastore database:

SELECT SDS.LOCATION
FROM TBLS
INNER JOIN PARTITIONS
ON(TBLS.TBL_ID=PARTITIONS.TBL_ID)
INNER JOIN SDS
ON(PARTITIONS.SD_ID=SDS.SD_ID)
WHERE TBLS.TBL_NAME='table_name'
INTO OUTFILE '/tmp/s3paths.csv';

Note that in my install of CDH, the default hive metastore user is hiveuser and the default password is password. However, by default, the hiveuser won’t have access to write to a file. You may need to grant it permissions or just use the MySQL root user.

mysql --user=root -p
use metastore;