When I try to create a table that is partitioned on a value (column) and the data contains a second field (column) with the same name, I receive an error.
A user can't have the same name for a schema definition field and the partition column.
When a user is creating a partitioned table, the field that is partitioned on must be given a different name. Otherwise, ODAS complains about duplicate field names. For instance:
- To partition on a column in the data AND on an s3 object key (directory name), one can't have the same name for the schema definition field and the partition column.
- Or if a parquet file is “col1, col2, col3, col4, col5” and the data is partitioned on col3, the partitioned statement has to do the “create table col1, col2, col3-donotusep, col4, col5 partitioned by col3…”
The cause of this limitation is that when a column name is specified in the "PARTITIONED BY" clause, that creates what the Hive docs call a pseudocolumn that can be queried over. If there is already a field with the same name, those two names collide, which must be prevented.
The Hive documentation suggests naming the colliding column in the schema "dtDontQuery" or something similar so that it is clear to the users that field could be ignored from the schema perspective.
The best way to clean this up from the users' perspective would be to create the table with the misnamed column and then create a view over that table that omits the misnamed column. Then, users only see usable column names and the partitioning works as desired.
Example data schema:
customerID int, customerName String, purchaseID int, purchaseAmount Float
Example table create statement:
CREATE TABLE purchaseHistory_internal
(customerIDDoNotUse int, customerName String, purchaseID int, purchaseAmount Float)
PARTITIONED BY (customerID int) STORED AS PARQUET;
Example View create statement:
CREATE VIEW purchaseHistory AS SELECT customerIDDoNotUse as customerID, customerName, purchaseID, purchaseAmount FROM