About Doris Multi-Source Data
Hive foreign tables of Doris support CREATE CATALOG. By connecting to Hive Metastore, or a metadata service compatible with Hive Metastore, Doris can automatically obtain Hive database table information and perform data queries. This avoids complex manual mapping and data migration when there is a large number of conventional external data directories.
Multi-Catalog is designed to make it easier to connect to external data catalogs to enhance Doris's data lake analysis and federated data query capabilities.
In older versions of Doris, user data is in a two-tiered structure: database and table. Therefore, connections to external catalogs could only be done at the database or table level. For example, you could create a mapping to a table in an external catalog via create external table, or to a database via create external database. If there were large amounts of databases or tables in the external catalog, you would need to create mappings to them one by one, which could be a heavy workload.
With the advent of Multi-Catalog, Doris now has a new three-tiered metadata hierarchy (including catalog, database, and table), A catalog may directly correspond to an external data catalog. Currently, the following external data catalogs are supported:
- Hive
- JDBC: JDBC Catalogs in Doris are connected to databases using the standard JDBC protocol, facilitating data access.
Background Information
Customer Hive table data is stored in either OBS or HDFS. Doris needs to connect to these Hive external tables. The MRS clusters are categorized into security and common clusters. Based on the specific environment, data can be queried using one of the following four methods:
- Set the authentication type to SIMPLE to access Hive data stored in HDFS.
- Set the authentication type to KERBEROS to access Hive data stored in HDFS.
- Set the authentication type to SIMPLE to access Hive data stored in OBS.
- Set the authentication type to KERBEROS to access Hive data stored in OBS.
Kerberos Introduction
The Hadoop community version provides two authentication modes: Kerberos authentication (security mode) and Simple authentication (normal mode). When creating a cluster, you can choose to enable or disable Kerberos authentication.
The clusters in security mode use the Kerberos authentication protocol for security authentication.
- Function
Kerberos adopts a client/server structure and encryption technologies such as AES, and supports mutual authentication (both the client and server can authenticate each other). Kerberos is used to prevent interception and replay attacks and protect data integrity. It is a system that manages keys by using a symmetric key mechanism.
- Prerequisites
The Kerberos client, keytab path, Kerberos authentication username, and client configuration file krb5.conf are prepared.
- Kerberos architecture
The following figure shows the Kerberos architecture. For details, see in MRS.
Figure 1 Kerberos architecture
Table 1 Parameters Parameter
Description
Application Client
An application client, which is usually an application that submits tasks or jobs.
Application Server
An application server, which is usually an application that an application client accesses.
Kerberos
A service that provides security authentication.
KerberosAdmin
A process that provides authentication user management
KerberosServer
A process that provides authentication ticket distribution
Basic Concepts
- Internal Catalog
Existing databases and tables in Doris are all under the Internal Catalog, which is the default catalog in Doris and cannot be modified or deleted.
- External Catalog
You can run the CREATE CATALOG command to create an External Catalog, and view the existing catalogs using the SHOW CATALOGS command.
- Switch Catalog
After login, you will enter the Internal Catalog by default. Then, you can view or switch to your target database via SHOW DATABASES and USE DB.
You can run the SWITCH command to switch the catalog. Example:
SWITCH internal;SWITCH hive_catalog;After switching the catalog, you can view or switch to your target database in that catalog via SHOW DATABASES and USE DB. Doris automatically passes through databases and tables in Catalog. You can view and access data in External Catalogs the same way as doing that in Internal Catalogs.
Doris only supports read-only access to data in External Catalogs currently.
- Delete Catalog
Databases and tables in external catalogs are read-only. External Catalogs are deletable via the DROP CATALOG command. (The Internal catalog cannot be deleted.) You can run the DROP CATALOG command to delete an External Catalog.
This operation only deletes the mapping information of the Catalog in Doris, but does not modify or change the content of any external data catalog.
- Resource
Resource is a set of configurations. You can run the CREATE RESOURCE command to create a Resource. Then, you can use the Resource when creating a catalog.
A resource can be used by multiple catalogs to reuse the configuration of the resource.
- Background Information
- Kerberos Introduction
- Basic Concepts