Облачная платформаAdvanced

What Is Data Lake Insight

Язык статьи: Английский
Перевести

DLI Introduction

Data Lake Insight (DLI) is a serverless data processing and analysis service fully compatible with Apache Spark and Apache Flink ecosystems. It frees you from managing any servers.

DLI supports multiple querying methods including standard SQL, Spark SQL, and Flink SQL, with compatibility with mainstream data formats. You can use standard SQL or Spark and Flink applications to query mainstream data formats without data ETL. DLI supports SQL statements and Spark applications for heterogeneous data sources, including CloudTable, RDS, DWS, CSS, OBS, custom databases on ECSs, and offline databases.

Core Functions

For details about DLI functions, see Features.

Table 1 DLI core functions

Function

Description

DLI is a data processing and analytics service built on the serverless architecture.

DLI is a serverless big data query and analytics service. With DLI, you only pay for the actual compute resources used, with no need to maintain or manage cloud servers.

  • Auto scaling: DLI ensures you always have enough capacity on hand to deal with any traffic spikes.

DLI supports multiple compute engines.

DLI is fully compatible with ecosystems like Apache Spark and Apache Flink, and supports standard SQL, Spark SQL, and Flink SQL. It is compatible with mainstream data formats such as CSV, JSON, Parquet, and ORC.

  • Spark is a unified analytics engine designed for large-scale data processing, focusing on query, compute, and analysis. DLI has undergone extensive performance optimization and service-oriented enhancements over the open-source Spark, maintaining compatibility with the Apache Spark ecosystem and APIs while boosting performance by 2.5 times, enabling exabyte-scale data queries and analyses within hours.
  • Flink is a distributed compute engine that can be used for batch processing, which involves handling static datasets and historical datasets. It can also be used for stream processing, enabling the real-time processing of live data streams and the immediate generation of data results. DLI has enhanced features and security based on the open-source Flink and offers the Stream SQL feature needed for data processing.

DLI supports multiple connection methods.

DLI provides multiple connection methods to meet diverse user needs and scenarios.

Connection methods:

  • Web-based console
  • APIs
  • SDKs
  • Client tools
  • Submitting DLI jobs using DataArts Studio
  • Connecting to BI tools for visual analysis

DLI can connect to multiple data sources for cross-source data analysis.

  • Spark datasource connection: Data sources such as DWS, RDS, and CSS can be accessed through DLI.
  • Flink supports cross-source connectivity with various cloud services, forming a rich streaming ecosystem. DLI's streaming ecosystem is divided into cloud service ecosystems and open-source ecosystems:
    • Cloud service ecosystem: DLI supports connectivity with other services in Flink SQL. You can directly use SQL to read and write data from these cloud services.
    • Open-source ecosystem: By establishing network connections with other VPCs through enhanced datasource connections, you can access all Flink and Spark-supported data sources and output sources, such as Kafka, Hbase, Elasticsearch, in the tenant-authorized DLI queues.

Three basic job types supported by DLI

  • SQL jobs allow you to query data using standard SQL statements.
  • Flink jobs support Flink SQL online analysis capabilities: supporting aggregation functions such as Window and Join, using SQL to express service logic, and achieving service implementation conveniently and quickly.
  • Spark jobs provide fully managed Spark computing features: You can submit computing tasks through interactive sessions and batch processing, and perform data analysis on fully managed Spark queues.

DLI supports decoupled storage and compute.

After storing data in OBS, you can connect DLI to OBS for data analysis. Under the decoupled storage and compute architecture, storage resources and compute resources can be requested and billed separately, reducing costs and improving resource utilization.

You can choose single-AZ or multi-AZ storage when creating an OBS bucket for storing redundant data on the DLI console. The differences between the two storage policies are as follows:

  • Multi-AZ storage means data will be redundantly stored across multiple AZs, offering higher reliability. Buckets with multi-AZ storage will store data in multiple different AZs within the same region. If one AZ becomes unavailable, data can still be accessed normally from other AZs, making it suitable for data storage scenarios requiring high reliability. You are advised to use this policy.
  • Single-AZ storage means data is only stored in a single AZ, but it is more cost-effective compared to multi-AZ storage.

DLI manages and schedules resources in a unified manner using elastic resource pools.

The backend of elastic resource pools adopts a CCE cluster architecture, supporting heterogeneous resources, so you can manage and schedule resources in a unified manner.

DLI Product Architecture

DLI includes the following core modules:

Table 2 DLI core modules

Module

Description

Ecosystem tools

DLI leverages its robust serverless architecture and multimodal engine support to fulfill the diverse needs of various industries, driving their digital transformation and fostering innovation.

Compute engine

  • Spark: supports batch processing and interactive analysis of large-scale data and provides high-performance distributed computing capabilities.
  • Flink: supports real-time stream processing, capable of handling large-scale real-time data streams, with support for event time processing and state management.
  • HetuEngine: supports interactive data analysis, swiftly handles complex SQL queries, and facilitates connections and queries across various data sources.

Unified resource management

  • Resource decoupling: DLI adopts a decoupled compute and storage architecture, decoupling compute resources from storage resources. This allows for flexible adjustment of the ratio between compute and storage resources based on actual needs, enhancing resource utilization and reducing costs.
  • Elastic scaling: DLI compute resources are built upon containerized Kubernetes and possess elastic scaling capabilities. Resources can be automatically adjusted based on job demands.
  • Multi-tenant support: Compute resources can be isolated by tenant to ensure independence among different tenants. Each tenant can independently manage their own compute resources, enabling fine-grained resource management and facilitating inter-departmental data sharing and permissions management within enterprises.
  • Pay-per-use compute resources: You only pay for the compute resources you actually use, with no need to pre-purchase or manage servers, enhancing usage efficiency.

Unified metadata management

  • Multi-source metadata integration: DLI supports centralized management of metadata from various data sources, including cloud-based data sources (such as OBS, RDS, DWS, and CSS) and on-premises data sources (such as self-built databases and Redis). You can manage and analyze metadata across different data sources without the need to migrate data to a unified data lake.
  • Metadata synchronization: DLI provides metadata management to ensure the timeliness and consistency of metadata.
  • Metadata query and management: DLI offers standard SQL APIs, enabling you to query and manage metadata using SQL statements. You can add, delete, modify, and query metadata to facilitate data governance and analysis.
  • Data security and permission management: Permissions on data catalogs, databases, and tables can be managed. You can assign different permissions to various tenants and user groups to ensure data security and compliance.

Storage service

OBS and databases are used to store structured or unstructured data for data analysis, providing persistent data storage services.

Data source connection

  • Cloud data sources can be connected. For example, OBS can be used to store and manage unstructured data. Relational database service (RDS) can be used to store and manage structured data. DWS can be used to efficiently query and analyze data.
  • On-premises data sources, such as self-built databases (MySQL, PostgreSQL, and HDFS), can be connected.

Data applications

DLI can connect to mainstream BI tools in the industry to flexibly meet data presentation needs.

Accessing DLI

A web-based service management platform is provided. You can access DLI using the management console or HTTPS-based APIs, or connect to the DLI server through the JDBC client.

  • Using the management console

    You can submit SQL, Spark, or Flink jobs on the DLI management console.

  • Using APIs

    If you need to integrate DLI into a third-party system for secondary development, you can call DLI APIs to use the service.

    For details, see Data Lake Insight API Reference.

  • DataArts Studio

    DataArts Studio is a one-stop data operations platform that provides intelligent data lifecycle management. It supports intelligent construction of industrial knowledge libraries and incorporates data foundations such as big data storage, computing, and analysis engines. With DataArts Studio, your company can easily construct end-to-end intelligent data systems. These systems can help eliminate data silos, unify data standards, accelerate data monetization, and promote digital transformation.

    Create a data connection on the DataArts Studio management console to access DLI for data analysis.