Advanced
Тема интерфейса

Interconnecting Doris with the Hive Data Source

By connecting to Hive Metastore, or a metadata service compatible with Hive Metastore, Doris can automatically obtain Hive database table information and perform data queries. This avoids complex manual mapping and data migration when there is a large number of conventional external data directories.

Prerequisites

  • You have created an MRS cluster. For details, see "Getting Started" > "Purchasing and Using an MRS Cluster" in the MapReduce Service User Guide.
  • Create a Doris cluster.
  • To ensure network connectivity, the MRS cluster must have the same VPC and subnet as the Doris cluster, and all node IP addresses of the Doris cluster must be added to the security group of the MRS cluster.

Step 1: Creating an MRS Hive Connection

  1. Log in to the CloudTable management console.
  2. On the Cluster Management page, click the target cluster to go to the cluster details page.
  3. Choose Connection Management > Create. The Create page is displayed.
  4. Set the connection type:

    • Set Authentication Type to SIMPLE and disable OBS Storage.
    • Set Authentication Type to KERBEROS and disable OBS Storage.
    • Set Authentication Type to SIMPLE and enable OBS Storage.
    • Set Authentication Type to KERBEROS and enable OBS Storage.
    Table 1 Parameter description

    Parameter

    Description

    Connection Name

    Connection name. You can enter a name indicating the type of the data source to be connected.

    NOTE:

    A connection name must start with a letter and contain 4 to 64 letters, digits, and hyphens (-). It cannot contain other special characters.

    Connector

    External data directory component supported by Doris. Currently, only Hive is available.

    Hadoop Type

    Hadoop component supported by Doris. Currently, only MRS is available.

    Cluster Name

    Name of an MRS cluster.

    NOTE:

    Doris can connect to only one user in a Kerberos cluster.

    Manager IP

    Floating IP address of MRS Manager. After you select the cluster to be connected, the IP address is automatically filled in.

    Authentication Type

    MRS cluster authentication type. Values of this parameter include KERBEROS (for secure clusters) and SIMPLE (for common clusters). After you select the cluster to be connected, this parameter is automatically filled in.

    NOTE:

    Only one KERBEROS connection can be created.

    Connector Version

    Version of the data source component of the MRS cluster. Hive supports version 3.x. The value can be 3.X.

    Username

    Username of the MRS cluster. The user must have adequate permissions to access the underlying data of Hive and HDFS. If you want to create a user, see .

    Password

    Password of the MRS user mrs_user_name.

    OBS Storage

    When this toggle is on, the data table associated with the created catalog connection is stored in OBS. When toggled off, the data table associated with the catalog connection is stored in HDFS.

  5. After setting the parameters, click Test. If the test fails, check whether the username and password are correct.
  6. After the test is complete, click OK. The created connection record will be displayed on the Connection Management page.

Step 2: Running SQL Commands to Create a Hive Catalog Table

  1. Use the SSH login tool to remotely log in to the Linux ECS through the EIP.

    For details, see "Instance" > "Logging In to a Linux ECS" > "Logging In to an ECS Using an SSH Password" in the Elastic Cloud Server User Guide.

  2. Access the Doris cluster. For details, see Using the MySQL Client to Connect to a Common Doris Cluster.
  3. Create a Hive catalog table.

    • Create a catalog table with the authentication type set to SIMPLE to access Hive data stored in HDFS.
      CREATE CATALOG hive_catalog_simple PROPERTIES (
      'type'='hms',
      'hive.metastore.uris'='thrift://192.X.X.X:port,thrift://192.x.x.x:port',
      'hive.metastore.sasl.enabled' = 'false',
      'dfs.nameservices'='hacluster',
      'dfs.ha.namenodes.hacluster'='3,4',
      'dfs.namenode.rpc-address.hacluster.3'='192.x.x.x:port',
      'dfs.namenode.rpc-address.hacluster.4'='192.x.x.x:port',
      'dfs.client.failover.proxy.provider.hacluster'='***',
      'hive.version' = '3.1.0'
      );
    • Create a catalog table with the authentication type set to KERBEROS to access Hive data stored in HDFS.
      CREATE CATALOG hive_catalog PROPERTIES (
      'type'='hms',
      'hive.metastore.uris' = 'thrift://192.x.x.x:port,thrift://192.x.x.x:port',
      'hive.metastore.sasl.enabled' = 'true',
      'hive.server2.thrift.sasl.qop'='auth-conf',
      'hive.server2.authentication' = 'KERBEROS',
      'hive.server2.authentication.kerberos.principal' = '***',
      'hive.metastore.kerberos.principal' = '***',
      'dfs.nameservices'='hacluster',
      'dfs.ha.namenodes.hacluster'='3,4',
      'dfs.namenode.rpc-address.hacluster.3'='192.x.x.x:port',
      'dfs.namenode.rpc-address.hacluster.4'='192.x.x.x:port',
      'dfs.client.failover.proxy.provider.hacluster'='***',
      'hadoop.security.authentication'='kerberos',
      'hadoop.kerberos.principal' = '{kinit_result}', -- Result of kinit {USER_NAME}
      'hive.version' = '3.1.0',
      'fs.defaultFS'='hdfs://hacluster',
      'hadoop.rpc.protection'='privacy'
      );
    • Create a catalog table with the authentication type set to SIMPLE and access Hive data stored in OBS.
      CREATE CATALOG hive_catalog_simple_obs PROPERTIES (
      'type'='hms',
      'hive.metastore.uris'='thrift://192.x.x.x:port,thrift://192.x.x.x:port',
      'obs.access_key' = '***',
      'obs.secret_key' = '***',
      'obs.endpoint' = '***',
      'obs.region' = '***',
      'hive.metastore.sasl.enabled' = 'true',
      'hive.version' = '3.1.0'
      );
    • Create a catalog table with the authentication type set to KERBEROS and access Hive data stored in OBS.
      CREATE CATALOG hive_catalog_OBS PROPERTIES (
      'type'='hms',
      'hive.metastore.uris' = 'thrift://192.x.x.x:port,thrift://192.x.x.x:port',
      'hive.metastore.sasl.enabled' = 'true',
      'hive.server2.thrift.sasl.qop'='auth-conf',
      'hive.server2.authentication' = 'KERBEROS',
      'hive.server2.authentication.kerberos.principal' = '***',
      'hive.metastore.kerberos.principal' = '***',
      'hadoop.security.authentication'='kerberos',
      'hadoop.kerberos.principal' = 'USER_NAME',
      'hive.version' = '3.1.0',
      'fs.defaultFS'='hdfs://hacluster',
      'hadoop.rpc.protection'='privacy',
      'obs.access_key' = '***',
      'obs.secret_key' = '***',
      'obs.endpoint' = '***',
      'obs.region' = '***'
      );

      The following table describes related parameters and how to obtain parameter values.

      Table 2 Parameter description

      Parameter

      Description

      type

      Type of the external data to be connected.

      hive.metastore.uris

      Hive metadata URI, which can be viewed in the hive-site.xml configuration file.

      hive.metastore.sasl.enabled

      Obtain the value in the hive-site.xml configuration file.

      dfs.nameservices

      Obtain the value in the hdfs-site.xml configuration file.

      dfs.ha.namenodes.hacluster

      Obtain the value in the hdfs-site.xml configuration file.

      dfs.namenode.rpc-address.hacluster.3

      Obtain the value in the hdfs-site.xml configuration file.

      NOTE:

      Obtain the IP address of this parameter on FusionInsight Manager. Log in to the home page of the FusionInsight Manager, choose the target component and click the corresponding instance.

      dfs.namenode.rpc-address.hacluster.4

      Obtain the value in the hdfs-site.xml configuration file.

      NOTE:

      Obtain the IP address of this parameter on FusionInsight Manager. Log in to the home page of the FusionInsight Manager, choose the target component and click the corresponding instance.

      dfs.client.failover.proxy.provider.hacluster

      Obtain the value in the hdfs-site.xml configuration file.

      hive.version

      Hive version.

      hive.server2.thrift.sasl.qop

      Obtain the value in the hive-site.xml configuration file.

      hive.server2.authentication

      Authentication type, which can be viewed on the Connection Management page.

      hive.server2.authentication.kerberos.principal

      Obtain the value in the hive-site.xml configuration file.

      hive.metastore.kerberos.principal

      Obtain the value in the hive-site.xml configuration file.

      hadoop.kerberos.principal

      Obtain the value of hadoop.kerberos.principal as follows:

      • Method 1:
        1. Log in to the MRS Manager page of the cluster.
        2. Choose System > Permission > User to access the user page and obtain the user.
        3. Click Domain and Mutual Trust to access the Domain and Mutual Trust page and obtain the local domain.
        4. The parameter value is in the format of User name+Local domain.
      • Method 2:
        1. Log in to the node where the MRS client is located as user root.
        2. Go to the client installation path.
          cd /opt/Bigdata/client/
        3. Run the following command to load the environment variables:
          source bigdata_env
        4. Run the kinit command to authenticate the user and obtain the parameter value.
          kinit Username

      hive.version

      Hive version.

      fs.defaultFS

      Obtain the value in the core-site.xml configuration file.

      hadoop.rpc.protection

      Obtain the value in the core-site.xml configuration file.

      obs.access_key

      Access key. For details about how to obtain the access key, see section "Access Keys" in My Credentials User Guide.

      NOTE:

      obs.access_key can be used to access underlying data files.

      obs.secret_key

      Secret key. For details about how to obtain the access key, see section "Access Keys" in My Credentials User Guide.

      obs.endpoint

      OBS address. Contact technical support to obtain the address.

      obs.region

      OBS region, which can be viewed on the OBS console.

      1. Log in to the node (Master1) where the MRS client is deployed as user root.
      2. Go to the client directory. This directory contains the folders of all MRS components. Currently, only the folders of HDFS and Hive are displayed.
        cd /opt/Bigdata/client/
        Note

        The HDFS folder contains the core-site.xml and hdfs-site.xml configuration files and the Hive folder contains the hive-site.xml configuration file.

      3. Obtain the parameter values in these configuration files.
        1. Check the HDFS component.
          cd ./HDFS/hadoop/etc/

          Go to the hadoop directory.

          cd hadoop

          Check the hdfs-site.xml configuration file and find the related parameters.

          cat hdfs-site.xml
        2. View the hive-site.xml configuration file of the Hive component and search for related parameters.
          cat /opt/Bigdata/client/Hive/config/hive-site.xml

Step 3: Querying the Data Mapping Table

  1. View catalogs.

    show catalogs;

  2. Run the following command to query the databases in a catalog:

    show databases from catalog name;

  3. Switch to the specified catalog.

    switch catalog name;

  4. View the specified table in a database.

    show tables from 'catalog name'.'database';

    Specify the table.

    select * from 'catalog name'.'database name'.'table name';
    Note

    The use database; command is intended solely for testing environments with a limited number of data tables. Use this command with extreme caution in production clusters. In environments with a large number of data tables, executing this command will load metadata for all tables, potentially causing a significant surge in pressure on the Hive Metastore. This increased load can negatively impact the stability and performance of the MRS production environment.

FAQs

  • If the following error message is displayed, perform a active/standby switchover for HDFS nodes on FusionInsight Manager.
    mysql> select * from hive_hdfs_test2;
    ERROR 1105 (HY000): errCode = 2, detailMessage = get file split failed for table: hive_hdfs_test2, err: org.apache.hadoop.ipc.RemoteException: Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:108)
    at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2200)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:
  • When creating a Catalog, you can use the parameter file.meta.cache.ttl-second to set the automatic expiration time of the Hive partition file cache, or set this value to 0 (unit: second) to disable the partition file cache. The following provides an example.
    CREATE CATALOG hive_catalog PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://127.x.x.x:port',
    'obs.access_key' = '***',
    'obs.secret_key' = '***',
    'obs.endpoint' = '***',
    'obs.region' = '***'
    'file.meta.cache.ttl-second' = '60'
    );