diff --git a/content/en/docs/Developerguide/administration.md b/content/en/docs/Developerguide/administration.md deleted file mode 100644 index 01d1effd3446f45984f6ef5a2c5e238c4f419ced..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/administration.md +++ /dev/null @@ -1,14 +0,0 @@ -# Administration - -The following describes various MOT administration topics – - -- **Durability** -- [Logging – WAL Redo Log](logging-wal-redo-log.md) -- [Recovery](recovery.md) -- [Replication and High Availability](replication-and-high-availability.md) -- [Memory Management](memory-management.md) -- [Vacuum](vacuum.md) -- [Statistics](statistics.md) -- [Monitoring](monitoring.md) -- [Error Messages](error-messages.md) - diff --git a/content/en/docs/Developerguide/benchmarksql-an-open-source-tpc-c-tool.md b/content/en/docs/Developerguide/benchmarksql-an-open-source-tpc-c-tool.md deleted file mode 100644 index 59b612cb8be0d78b114e73651f2d29ca7c940653..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/benchmarksql-an-open-source-tpc-c-tool.md +++ /dev/null @@ -1,39 +0,0 @@ -# BenchmarkSQL – An Open-Source TPC-C Tool - -For example, to test TPCC, the **BenchmarkSQL** can****be used, as follows – - -- Download **benchmarksql** from the following link – [https://osdn.net/frs/g\_redir.php?m=kent&f=benchmarksql%2Fbenchmarksql-5.0.zip](https://osdn.net/frs/g_redir.php?m=kent&f=benchmarksql%2Fbenchmarksql-5.0.zip) -- Under run/sql.common, adjust the schema creation scripts to MOT syntax and avoid unsupported DDLs. Alternatively, the adjusted scripts can be directly downloaded from the following link – [https://opengauss.obs.cn-south-1.myhuaweicloud.com/1.0.0/sql.common.opengauss.mot.tar.gz](https://opengauss.obs.cn-south-1.myhuaweicloud.com/1.0.0/sql.common.opengauss.mot.tar.gz) - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->The benchmark test is executed using a standard interactive SQL mode without stored procedures. - -## Setting Up and Running BenchmarkSQL - -The following describes how to set up and run a BenchmarkSQL test: - -- TPCC Configuration - -Configure TPC-C as follows: - -- Full Transactions – 5 -- Standard Workload Distribution: - - newOrderWeight=45 - - paymentWeight=43 - - orderStatusWeight=4 - - deliveryWeight=4 - - stockLevelWeight=4 - -- Number of Warehouses – 512 Warehouses - -## Running the Benchmark - -Anyone can run the benchmark by starting up the server and running the **benchmarksql** scripts. - -To run the benchmark - -1. Go to the client folder and link **sql.common** to **sql.common.opengauss.mot** in order to test MOT. -2. Start up the database server. -3. Configure the **props.pg** file in the client. -4. Run the benchmark**.** - diff --git a/content/en/docs/Developerguide/calc-checkpoint-algorithm-low-overhead-in-memory-and-compute.md b/content/en/docs/Developerguide/calc-checkpoint-algorithm-low-overhead-in-memory-and-compute.md deleted file mode 100644 index df686ce3c47cca60704ecdd9f9aa385b796bda4a..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/calc-checkpoint-algorithm-low-overhead-in-memory-and-compute.md +++ /dev/null @@ -1,8 +0,0 @@ -# CALC Checkpoint algorithm: low overhead in memory and compute - -The checkpoint algorithm provides the following benefits - -- Reduced memory usage: At most two copies of each record are stored at any time. Memory usage is minimized by only storing one physical copy of a record when its live and stable versions are equal or when no checkpoint is actively being recorded. -- Low overhead. CALC's overhead is smaller than other asynchronous checkpointing algorithms. -- Uses virtual points of consistency. CALC does not require quiescing of the database in order to achieve a physical point of consistency. - diff --git a/content/en/docs/Developerguide/checkpoint-activation.md b/content/en/docs/Developerguide/checkpoint-activation.md deleted file mode 100644 index 296d6194fed1170f102bdf2856dc0d32b22e7708..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/checkpoint-activation.md +++ /dev/null @@ -1,6 +0,0 @@ -# Checkpoint Activation - -MOT checkpoints are integrated into the envelope's checkpoint mechanism. The process can be triggered manually by executing “**CHECKPOINT;**” command or by automatically considering the envelope's triggering settings \(time/size\). - -Checkpoint configuration is done in the mot.conf file – see the relevant [Default MOT.conf](default-mot-conf.md). ++ - diff --git a/content/en/docs/Developerguide/checkpoint.md b/content/en/docs/Developerguide/checkpoint.md deleted file mode 100644 index 7482d69a8bcfb7c01188288828dbc6f9363edb0d..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/checkpoint.md +++ /dev/null @@ -1,19 +0,0 @@ -# Checkpoint - -In openGauss the Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before the checkpoint. - -At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. - -The MOT does not store its data like openGauss does and there is no dirty pages concept. The data is stored directly in memory. - -For this reason we have researched and implemented the CALC algorithm described in the paper Low-Overhead Asynchronous Checkpointing in Main-Memory Database Systems, SIGMOND 2016 from Yale University. - -Reference to CALC footnote: - -K. Ren, T. Diamond, D. J. Abadi, and A. Thomson. Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, 2016. - -- **[CALC Checkpoint algorithm: low overhead in memory and compute](calc-checkpoint-algorithm-low-overhead-in-memory-and-compute.md)** - -- **[Checkpoint Activation](checkpoint-activation.md)** - - diff --git a/content/en/docs/Developerguide/checkpoint_mot.md b/content/en/docs/Developerguide/checkpoint_mot.md deleted file mode 100644 index a87914582824e087d2d34916876afaf5e7fb778b..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/checkpoint_mot.md +++ /dev/null @@ -1,24 +0,0 @@ -# CHECKPOINT \(MOT\) - -- enable\_checkpoint = true - - Specifies whether to use periodic checkpoint. - - -- checkpoint\_dir = - - Specifies the directory in which checkpoint data is to be stored. The default location is in the data folder of each data node. - - -- checkpoint\_segsize = 16 MB - - Specifies the segment size used during checkpoint. Checkpoint is performed in segments. When a segment is full, it is serialized to disk and a new segment is opened for the subsequent checkpoint data. - - -- checkpoint\_workers = 3 - - Specifies the number of workers to use during checkpoint. - - Checkpoint is performed in parallel by several MOT engine workers. The quantity of workers may substantially affect the overall performance of the entire checkpoint operation, as well as the operation of other running transactions. To achieve a shorter checkpoint duration, a larger number of workers should be used, up to the optimal number \(which varies based on the hardware and workload\). However, be aware that if this number is too large, it may negatively impact the execution time of other running transactions. Keep this number as low as possible to minimize the effect on the runtime of other running transactions. When this number is too high, longer checkpoint durations occur. - - diff --git a/content/en/docs/Developerguide/checkpoints.md b/content/en/docs/Developerguide/checkpoints.md deleted file mode 100644 index a0e4a667765488d695c7dc6e0d752f5a4cb03319..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/checkpoints.md +++ /dev/null @@ -1,19 +0,0 @@ -# Checkpoints - -A Checkpoint is the point in time at which all the data of a table's rows is saved in files on persistent storage in order to create a full durable database image. It is a snapshot of the data at a specific point in time. - -A Checkpoint is required in order to reduce a database's recovery time by shortening the quantity of WAL \(Redo Log\) entries that must be replayed in order to ensure durability. Checkpoint's also reduce the storage space required to keep all the log entries. - -If there were no Checkpoints, then in order to recover a database, all the WAL redo entries would have to be replayed from the beginning of time, which could take days/weeks depending on the quantity of records in the database. Checkpoints record the current state of the database and enable old redo entries to be discarded. - -Checkpoints are essential during recovery scenarios \(especially for a cold start\). First, the data is loaded from the last known or a specific Checkpoint; and then the WAL is used to complete the data changes that occurred since then. - -For example – If the same table row is modified 100 times, then 100 entries are recorded in the log. When Checkpoints are used, then even if a specific table row was modified 100 times, it is recorded in the Checkpoint a single time. After the recording of a Checkpoint, recovery can be performed on the basis of that Checkpoint and only the WAL Redo Log entries that occurred since the Checkpoint need be played. - -## Configuring Checkpoints - -Checkpoint configuration is performed in the CHECKPOINT; section of the mot.conf file. You may refer to the [CHECKPOINT \(MOT\)](checkpoint_mot.md)section of this user manual for a description of these configuration parameters. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->In a production deployment, the value must be TRUE \#enable\_Checkpoint = true. A FALSE value can only be used for testing. - diff --git a/content/en/docs/Developerguide/comparison-disk-vs-mot.md b/content/en/docs/Developerguide/comparison-disk-vs-mot.md index bf9415bc1ae254f78154acdc7a217c52a7e1eb65..193f3c0adc8712791bb3c94509f5ea38baca1fab 100644 --- a/content/en/docs/Developerguide/comparison-disk-vs-mot.md +++ b/content/en/docs/Developerguide/comparison-disk-vs-mot.md @@ -1,108 +1,400 @@ -# Comparison – Disk vs. MOT - -The following table briefly compares the various features of a openGauss Disk-based storage engine and a MOT storage engine. - -**Table 1** Comparison – Disk vs. MOT - - -

Feature

+# Comparison – Disk vs. MOT + +The following table briefly compares the various features of the openGauss disk-based storage engine and the MOT storage engine. + +**Table 1** Comparison – Disk-based vs. MOT + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Feature

openGauss Disk Store

+

openGauss

+

Disk Store

openGauss MOT Engine

+

openGauss

+

MOT Engine

Intel x86 + Kunpeng ARM

+

Intel x86 + Kunpeng ARM

Yes

+

Yes

Yes

+

Yes

SQL and Feature-set Coverage

+

SQL and Feature-set Coverage

100%

+

100%

98%

+

98%

Scale-up (Many-cores, NUMA)

+

Scale-up (Many-cores, NUMA)

Low efficiency

+

Low Efficiency

High Efficiency

+

High Efficiency

Throughput

+

Throughput

High

+

High

Extremely High

+

Extremely High

Latency

+

Latency

Low

+

Low

Extremely Low

+

Extremely Low

Distributed (Cluster Mode)

+

Distributed (Cluster Mode)

Yes

+

Yes

Yes

+

Yes

Isolation Levels

+

Isolation Levels

  • RC+SI
  • RR
  • Serializable
+
  • RC+SI
  • RR
  • Serializable
  • RC
  • RR
  • RC+SI (in V2 release)
+
  • RC
  • RR
  • RC+SI (in V2 release)

Concurrency Control

+

Concurrency Control

Pessimistic

+

Pessimistic

Optimistic

+

Optimistic

Data Capacity (Data + Index)

+

Data Capacity (Data + Index)

Unlimited

+

Unlimited

Limited to DRAM

+

Limited to DRAM

Native Compilation

+

Native Compilation

No

+

No

Yes

+

Yes

Replication, Recovery

+

Replication, Recovery

Yes

+

Yes

Yes

+

Yes

Replication Options

+

Replication Options

2 (sync, async)

+

2 (sync, async)

3 (sync, async, group-commit)

+

3 (sync, async, group-commit)

- -**Legend** - -- RR = Repeatable Reads -- RC = Read Committed -- SI = Snapshot Isolation - +
+ +**Legend –** + +- RR = Repeatable Reads +- RC = Read Committed +- SI = Snapshot Isolation + +Appendices + +References + +\[1\] Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In Proc. 7th ACM European Conference on Computer Systems \(EuroSys\), Apr. 2012. + +\[2\] K. Ren, T. Diamond, D. J. Abadi, and A. Thomson. Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, 2016. + +\[3\] https://e.huawei.com/en/products/servers/taishan-server/taishan-2280-v2. + +\[4\] [https://e.huawei.com/en/products/servers/taishan-server/taishan-2480-v2](https://e.huawei.com/en/products/servers/taishan-server/taishan-2480-v2). + +\[5\] Tu, S., Zheng, W., Kohler, E., Liskov, B., and Madden, S. Speedy transactions in multicore in-memory databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles \(New York, NY, USA, 2013\), SOSP ’13, ACM, pp. 18–32. + +\[6\] H. Avni at al. Industrial-Strength OLTP Using Main Memory and Many-cores, VLDB 2020. + +\[7\] Bernstein, P. A., and Goodman, N. Concurrency control in distributed database systems. ACM Comput. Surv. 13, 2 \(1981\), 185–221. + +\[8\] Felber, P., Fetzer, C., and Riegel, T. Dynamic performance tuning of word-based software transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2008, Salt Lake City, UT, USA, February 20-23, 2008 \(2008\), + +pp. 237–246. + +\[9\] Appuswamy, R., Anadiotis, A., Porobic, D., Iman, M., and Ailamaki, A. Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads. PVLDB 11, 2 \(2017\), + +121–134. + +\[10\] R. Sherkat, C. Florendo, M. Andrei, R. Blanco, A. Dragusanu, A. Pathak, P. Khadilkar, N. Kulkarni, C. Lemke, S. Seifert, S. Iyer, S. Gottapu, R. Schulze, C. Gottipati, N. Basak, Y. Wang, V. Kandiyanallur, S. Pendap, D. Gala, R. Almeida, and P. Ghosh. Native store extension for SAP HANA. PVLDB, 12\(12\): + +2047–2058, 2019. + +\[11\] X. Yu, A. Pavlo, D. Sanchez, and S. Devadas. Tictoc: Time traveling optimistic concurrency control. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 1629–1642, 2016. + +\[12\] V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing for main-memory databases. In C. S. Jensen, C. M. Jermaine, and X. Zhou, editors, 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 38–49. IEEE Computer Society, 2013. + +\[13\] S. K. Cha, S. Hwang, K. Kim, and K. Kwon. Cache-conscious concurrency control of main-memory indexes on shared-memory multiprocessor systems. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, pages 181–190. Morga Kaufmann, 2001. + +Glossary + +Table 11 – Glossary + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Acronym

+

Definition/Description

+

2PL

+

2-Phase Locking

+

ACID

+

Atomicity, Consistency, Isolation, Durability

+

AP

+

Analytical Processing

+

ARM

+

Advanced RISC Machine, a hardware architecture alternative to x86

+

CC

+

Concurrency Control

+

CPU

+

Central Processing Unit

+

DB

+

Database

+

DBA

+

Database Administrator

+

DBMS

+

Database Management System

+

DDL

+

Data Definition Language. Database Schema management language

+

DML

+

Data Modification Language

+

ETL

+

Extract, Transform, Load or Encounter Time Locking

+

FDW

+

Foreign Data Wrapper

+

GC

+

Garbage Collector

+

HA

+

High Availability

+

HTAP

+

Hybrid Transactional-Analytical Processing

+

IoT

+

Internet of Things

+

IM

+

In-Memory

+

IMDB

+

In-Memory Database

+

IR

+

Intermediate Representation of a source code, used in compilation and optimization

+

JIT

+

Just In Time

+

JSON

+

JavaScript Object Notation

+

KV

+

Key Value

+

LLVM

+

Low-Level Virtual Machine, refers to a compilation code or queries to IR

+

M2M

+

Machine-to-Machine

+

ML

+

Machine Learning

+

MM

+

Main Memory

+

MO

+

Memory Optimized

+

MOT

+

Memory Optimized Tables storage engine (SE), pronounced as /em/ /oh/ /tee/

+

MVCC

+

Multi-Version Concurrency Control

+

NUMA

+

Non-Uniform Memory Access

+

OCC

+

Optimistic Concurrency Control

+

OLTP

+

Online Transaction Processing

+

PG

+

PostgreSQL

+

RAW

+

Reads-After-Writes

+

RC

+

Return Code

+

RTO

+

Recovery Time Objective

+

SE

+

Storage Engine

+

SQL

+

Structured Query Language

+

TCO

+

Total Cost of Ownership

+

TP

+

Transactional Processing

+

TPC-C

+

An On-Line Transaction Processing Benchmark

+

Tpm-C

+

Transactions-per-minute-C. A performance metric for TPC-C benchmark that counts new-order transactions.

+

TVM

+

Tiny Virtual Machine

+

TSO

+

Time Sharing Option

+

UDT

+

User-Defined Type

+

WAL

+

Write Ahead Log

+

XLOG

+

A PostgreSQL implementation of transaction logging (WAL - described above)

+
+ diff --git a/content/en/docs/Developerguide/concepts-of-mot.md b/content/en/docs/Developerguide/concepts-of-mot.md index 63537997dd382c91d7e735d1cbd85aeef0f02c8b..19db35a2550ccc306424519b8fc3e4b37f818fa3 100644 --- a/content/en/docs/Developerguide/concepts-of-mot.md +++ b/content/en/docs/Developerguide/concepts-of-mot.md @@ -1,23 +1,23 @@ -# Concepts of MOT - -This**** chapter describes how openGauss MOT is designed and how it works. It also sheds light on its advanced features and capabilities and how to use them. This chapter serves to educate the reader about various technical details of how MOT operates, details of important MOT features and innovative differentiators. The content of this chapter may be useful for decision-making regarding MOT’s suitability to specific application requirements and for using and managing it most efficiently. - -- **[Scale-up Architecture](scale-up-architecture.md)** - -- **[Concurrency Control Mechanism](concurrency-control-mechanism.md)** - -- **[Extended FDW and Other openGauss Features](extended-fdw-and-other-opengauss-features.md)** - -- **[NUMA Awareness Allocation and Affinity](numa-awareness-allocation-and-affinity.md)** - -- **[Indexes](indexes.md)** - -- **[Durability](durability-0.md)** - -- **[Recovery](recovery-1.md)** - -- **[Query Native Compilation \(JIT\)](query-native-compilation-(jit).md)** - -- **[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)** - - +# Concepts of MOT + +This chapter describes how openGauss MOT is designed and how it works. It also sheds light on its advanced features and capabilities and how to use them. This chapter serves to educate the reader about various technical details of how MOT operates, details of important MOT features and innovative differentiators. The content of this chapter may be useful for decision-making regarding MOT's suitability to specific application requirements and for using and managing it most efficiently. + +- **[MOT Scale-up Architecture](mot-scale-up-architecture.md)** + +- **[MOT Concurrency Control Mechanism](mot-concurrency-control-mechanism.md)** + +- **[Extended FDW and Other openGauss Features](extended-fdw-and-other-opengauss-features.md)** + +- **[NUMA Awareness Allocation and Affinity](numa-awareness-allocation-and-affinity.md)** + +- **[MOT Indexes](mot-indexes.md)** + +- **[MOT Durability Concepts](mot-durability-concepts.md)** + +- **[MOT Recovery Concepts](mot-recovery-concepts.md)** + +- **[MOT Query Native Compilation \(JIT\)](mot-query-native-compilation-(jit).md)** + +- **[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)** + + diff --git a/content/en/docs/Developerguide/concurrency-control-mechanism.md b/content/en/docs/Developerguide/concurrency-control-mechanism.md deleted file mode 100644 index 1516f467a94cc49ba5c228fccad9b13db256d5ae..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/concurrency-control-mechanism.md +++ /dev/null @@ -1,18 +0,0 @@ -# Concurrency Control Mechanism - -After investing extensive research to find the best concurrency control mechanism, we concluded that SILO[\[5\]](#_ftn5)-based on OCC is the best ACID-compliant OCC algorithm for MOT. SILO provides the best foundation for MOT’s challenging requirements. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->MOT is fully Atomicity, Consistency, Isolation, Durability \(ACID\)-compliant, as described in the ++ section. - -- **[Local and Global MOT Memory](local-and-global-mot-memory.md)** - -- **[SILO Enhancements for MOT](silo-enhancements-for-mot.md)** - -- **[Isolation Levels](isolation-levels.md)** - -- **[Optimistic Concurrency Control](optimistic-concurrency-control.md)** - -- **[OCC vs 2PL Differences by Example](occ-vs-2pl-differences-by-example.md)** - - diff --git a/content/en/docs/Developerguide/configuring-durability.md b/content/en/docs/Developerguide/configuring-durability.md deleted file mode 100644 index 5acf279eec44e991f26941ee2ced7d3c28618736..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/configuring-durability.md +++ /dev/null @@ -1,4 +0,0 @@ -# Configuring Durability - -To ensure strict consistency, configure the **synchronous\_commit **parameter to **On** in the **postgres.conf **configuration file. - diff --git a/content/en/docs/Developerguide/configuring-logging.md b/content/en/docs/Developerguide/configuring-logging.md deleted file mode 100644 index b24335d247ae306af6ae39c4f76c875a4202bc58..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/configuring-logging.md +++ /dev/null @@ -1,15 +0,0 @@ -# Configuring Logging - -Two synchronous transaction logging options and one asynchronous transaction logging option are supported by the standard openGauss disk engine. - -The determination of whether synchronous or asynchronous transaction logging is performed is configured in the **synchronous\_commit \(On = Synchronous\)** parameters in the **postgres.conf** configuration file. - -Set the **enable\_redo\_log** parameter to **True** in the REDO LOG section of the **mot.conf **configuration file. - -If a synchronous mode of transaction logging has been selected \(**synchronous\_commit** = **On**, as described above\), then the **enable\_group\_commit** parameter in the mot.conf configuration file determines whether the **Group Synchronous Redo Logging** option or the **Synchronous Redo Logging** option is used. For **Group Synchronous Redo Logging**, you must also define in the **mot.conf** file which of the following thresholds determine when a group of transactions is recorded in the WAL - -- **group\_commit\_size** ** **The quantity of committed transactions in a group. For example, **16** means that when 16 transactions in the same group have been committed by a client application, then an entry is written to disk in the WAL Redo Log for all 16 transactions. -- **group\_commit\_timeout **A timeout period in ms. For example, **10** means that after 10 ms, an entry is written to disk in the WAL Redo Log for each of the transactions in the same group that have been committed by their client application in the last 10 ms. - -You may refer to ++ for more information. - diff --git a/content/en/docs/Developerguide/converting-a-disk-table-into-a-mot-table.md b/content/en/docs/Developerguide/converting-a-disk-table-into-a-mot-table.md deleted file mode 100644 index fb8757ce676c432fdc6eab4532a0b9685b38b186..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/converting-a-disk-table-into-a-mot-table.md +++ /dev/null @@ -1,15 +0,0 @@ -# Converting a Disk Table into a MOT Table - -The direct conversion of disk tables into MOT tables is not yet possible, meaning that no ALTER TABLE statement yet exists that converts a disk-based table into a MOT table. - -The following describes how to manually perform a few steps in order to convert a disk-based table into a MOT table, as well as how the **gs\_dump** tool is used to export data and the **gs\_restore **tool is used to import data. - -- **[Prerequisite Check](prerequisite-check.md)** - -- **[Converting](converting.md)** - -- **[Conversion Example](conversion-example.md)** - -- **[Query Native Compilation](query-native-compilation.md)** - - diff --git a/content/en/docs/Developerguide/conversion-example.md b/content/en/docs/Developerguide/converting-a-disk-table-into-an-mot-table.md similarity index 44% rename from content/en/docs/Developerguide/conversion-example.md rename to content/en/docs/Developerguide/converting-a-disk-table-into-an-mot-table.md index 1011b8e01f0a84236c1cf03417cd38fd3fab38e4..ab0560c86775c89a1f38d1eb77d1d5e92ad5dbfc 100644 --- a/content/en/docs/Developerguide/conversion-example.md +++ b/content/en/docs/Developerguide/converting-a-disk-table-into-an-mot-table.md @@ -1,84 +1,110 @@ -# Conversion Example - -Let's say that you have a database name **benchmarksql** and a table named **customer** \(which is a disk-based table\) to be migrated it into a MOT table. - -To migrate the customer table into a MOT table, perform the following - -1. Check your source table column types. Verify that all types are supported by MOT, refer to section _Unsupported Data Types_. - - ``` - benchmarksql-# \d+ customer - Table "public.customer" - Column | Type | Modifiers | Storage | Stats target | Description - --------+---------+-----------+---------+--------------+------------- - x | integer | | plain | | - y | integer | | plain | | - Has OIDs: no - Options: orientation=row, compression=no - ``` - -2. Check your source table data. - - ``` - benchmarksql=# select * from customer; - x | y - ---+--- - 1 | 2 - 3 | 4 - (2 rows) - ``` - -3. Dump table data only by using **gs\_dump**. - - ``` - $ gs_dump -Fc benchmarksql -a --table customer -f customer.dump - gs_dump[port='15500'][benchmarksql][2020-06-04 16:45:38]: dump database benchmarksql successfully - gs_dump[port='15500'][benchmarksql][2020-06-04 16:45:38]: total time: 332 ms - ``` - -4. Rename the source table name. - - ``` - benchmarksql=# alter table customer rename to customer_bk; - ALTER TABLE - ``` - -5. Create the MOT table to be exactly the same as the source table. - - ``` - benchmarksql=# create foreign table customer (x int, y int); - CREATE FOREIGN TABLE - benchmarksql=# select * from customer; - x | y - ---+--- - (0 rows) - ``` - -6. Import the source dump data into the new MOT table. - - ``` - $ gs_restore -C -d benchmarksql customer.dump - restore operation successful - total time: 24 ms - ``` - -7. Check that the data was imported successfully. - - ``` - benchmarksql=# select * from customer; - x | y - ---+--- - 1 | 2 - 3 | 4 - (2 rows) - - benchmarksql=# \d - List of relations - Schema | Name | Type | Owner | Storage - --------+-------------+---------------+--------+---------------------------------- - public | customer | foreign table | aharon | - public | customer_bk | table | aharon | {orientation=row,compression=no} - (2 rows) - ``` - - +# Converting a Disk Table into an MOT Table + +The direct conversion of disk tables into MOT tables is not yet possible, meaning that no ALTER TABLE statement yet exists that converts a disk-based table into an MOT table. + +The following describes how to manually perform a few steps in order to convert a disk-based table into an MOT table, as well as how the **gs\_dump** tool is used to export data and the **gs\_restore** tool is used to import data. + +## Prerequisite Check + +Check that the schema of the disk table to be converted into an MOT table contains all required columns. + +Check whether the schema contains any unsupported column data types, as described in the Unsupported Data Types_ _section. + +If a specific column is not supported, then it is recommended to first create a secondary disk table with an updated schema. This schema is the same as the original table, except that all the unsupported types have been converted into supported types. + +Afterwards, use the following script to export this secondary disk table and then import it into an MOT table. + +## Converting + +To covert a disk-based table into an MOT table, perform the following – + +1. Suspend application activity. +2. Use **gs\_dump** tool to dump the table’s data into a physical file on disk. Make sure to use the **data only**. +3. Rename your original disk-based table. +4. Create an MOT table with the same table name and schema. Make sure to use the create FOREIGN keyword to specify that it will be an MOT table. +5. Use **gs\_restore** to load/restore data from the disk file into the database table. +6. Visually/manually verify that all the original data was imported correctly into the new MOT table. An example is provided below. +7. Resume application activity. + +**IMPORTANT Note** **–** In this way, since the table name remains the same, application queries and relevant database stored-procedures will be able to access the new MOT table seamlessly without code changes. Please note that MOT does not currently support cross-engine multi-table queries \(such as by using Join, Union and sub-query\) and cross-engine multi-table transactions. Therefore, if an original table is accessed somewhere in a multi-table query, stored procedure or transaction, you must either convert all related disk-tables into MOT tables or alter the relevant code in the application or the database. + +## Conversion Example + +Let's say that you have a database name **benchmarksql** and a table named **customer** \(which is a disk-based table\) to be migrated it into an MOT table. + +To migrate the customer table into an MOT table, perform the following – + +1. Check your source table column types. Verify that all types are supported by MOT, refer to section _Unsupported Data Types_. + + ``` + benchmarksql-# \d+ customer + Table "public.customer" + Column | Type | Modifiers | Storage | Stats target | Description + --------+---------+-----------+---------+--------------+------------- + x | integer | | plain | | + y | integer | | plain | | + Has OIDs: no + Options: orientation=row, compression=no + ``` + +2. Check your source table data. + + ``` + benchmarksql=# select * from customer; + x | y + ---+--- + 1 | 2 + 3 | 4 + (2 rows) + ``` + +3. Dump table data only by using **gs\_dump**. + + ``` + $ gs_dump -Fc benchmarksql -a --table customer -f customer.dump + gs_dump[port='15500'][benchmarksql][2020-06-04 16:45:38]: dump database benchmarksql successfully + gs_dump[port='15500'][benchmarksql][2020-06-04 16:45:38]: total time: 332 ms + ``` + +4. Rename the source table name. + + ``` + benchmarksql=# alter table customer rename to customer_bk; + ALTER TABLE + ``` + +5. Create the MOT table to be exactly the same as the source table. + + ``` + benchmarksql=# create foreign table customer (x int, y int); + CREATE FOREIGN TABLE + benchmarksql=# select * from customer; + x | y + ---+--- + (0 rows) + ``` + +6. Import the source dump data into the new MOT table. + + ``` + $ gs_restore -C -d benchmarksql customer.dump + restore operation successful + total time: 24 ms + Check that the data was imported successfully. + benchmarksql=# select * from customer; + x | y + ---+--- + 1 | 2 + 3 | 4 + (2 rows) + + benchmarksql=# \d + List of relations + Schema | Name | Type | Owner | Storage + --------+-------------+---------------+--------+---------------------------------- + public | customer | foreign table | aharon | + public | customer_bk | table | aharon | {orientation=row,compression=no} + (2 rows) + ``` + + diff --git a/content/en/docs/Developerguide/converting.md b/content/en/docs/Developerguide/converting.md deleted file mode 100644 index 644a78dd05a14c8e3cf1678488960268282d4bcb..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/converting.md +++ /dev/null @@ -1,15 +0,0 @@ -# Converting - -To covert a disk-based table into a MOT table, perform the following: - -1. Suspend application activity. -2. Use **gs\_dump** tool to dump the table's data into a physical file on disk. Make sure to use the **data only**. -3. Rename your original disk-based table. -4. Create a MOT table with the same table name and schema. Make sure to use the create FOREIGN keyword to specify that it will be a MOT table. -5. Use** gs\_restore** to load/restore data from the disk file into the database table. -6. Visually/manually verify that all the original data was imported correctly into the new MOT table. An example is provided below. -7. Resume application activity. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->In this way, since the table name remains the same, application queries and relevant database stored-procedures will be able to access the new MOT table seamlessly without code changes. Please note that MOT does not currently support cross-engine multi-table queries \(such as by using Join, Union and sub-query\) and cross-engine multi-table transactions. Therefore, if an original table is accessed somewhere in a multi-table query, stored procedure or transaction, you must either convert all related disk-tables into MOT tables or alter the relevant code in the application or the database. - diff --git a/content/en/docs/Developerguide/creating-an-index-for-mot-table.md b/content/en/docs/Developerguide/creating-an-index-for-an-mot-table.md similarity index 75% rename from content/en/docs/Developerguide/creating-an-index-for-mot-table.md rename to content/en/docs/Developerguide/creating-an-index-for-an-mot-table.md index f69fce4861334b1ec708f0a2c119bbaa14684f8a..af987e5ee8c8481979327f36a4b094f80b550347 100644 --- a/content/en/docs/Developerguide/creating-an-index-for-mot-table.md +++ b/content/en/docs/Developerguide/creating-an-index-for-an-mot-table.md @@ -1,32 +1,33 @@ -# Creating an Index for MOT Table - -Standard PostgreSQL create and drop index statements are supported. - -For example - -``` -create index text_index1 on test(x) ; -The following is a complete example of creating an index for the ORDER table in a TPC-C workload – -create FOREIGN table bmsql_oorder ( - o_w_id integer not null, - o_d_id integer not null, - o_id integer not null, - o_c_id integer not null, - o_carrier_id integer, - o_ol_cnt integer, - o_all_local integer, - o_entry_d timestamp, - primary key (o_w_id, o_d_id, o_id) -); -create index bmsql_oorder_index1 on bmsql_oorder(o_w_id, o_d_id, o_c_id, o_id) ; -``` - -``` - -``` - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->There is no need to specify the **FOREIGN** keyword before the MOT table name, because it is only created for create and drop table commands. - -For MOT index limitations, see the Index subsection under the _SQL Coverage and Limitations _section. - +# Creating an Index for an MOT Table + +Standard PostgreSQL create and drop index statements are supported. + +For example – + +``` +create index text_index1 on test(x) ; +``` + +The following is a complete example of creating an index for the ORDER table in a TPC-C workload – + +``` +create FOREIGN table bmsql_oorder ( + o_w_id integer not null, + o_d_id integer not null, + o_id integer not null, + o_c_id integer not null, + o_carrier_id integer, + o_ol_cnt integer, + o_all_local integer, + o_entry_d timestamp, + primarykey (o_w_id, o_d_id, o_id) +); + +create index bmsql_oorder_index1 on bmsql_oorder(o_w_id, o_d_id, o_c_id, o_id) ; +``` + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>There is no need to specify the **FOREIGN** keyword before the MOT table name, because it is only created for create and drop table commands. + +For MOT index limitations, see the Index subsection under the _SQL Coverage and Limitations _section. + diff --git a/content/en/docs/Developerguide/creating-dropping-a-mot-table.md b/content/en/docs/Developerguide/creating-dropping-an-mot-table.md similarity index 55% rename from content/en/docs/Developerguide/creating-dropping-a-mot-table.md rename to content/en/docs/Developerguide/creating-dropping-an-mot-table.md index 0bd78752c6ed2126bff460f332ae649eb82cf132..3aa869a0f1bfef2e8314e63a55202f0a3572dc34 100644 --- a/content/en/docs/Developerguide/creating-dropping-a-mot-table.md +++ b/content/en/docs/Developerguide/creating-dropping-an-mot-table.md @@ -1,22 +1,22 @@ -# Creating/Dropping a MOT Table - -Creating a Memory Optimized Table \(MOT\) is very simple. Only the create and drop table statements in MOT differ from the statements for disk-based tables in openGauss. The syntax of **all other **commands for SELECT, DML and DDL are the same for MOT tables as for openGauss disk-based tables. - -- To create a MOT table - - ``` - create FOREIGN table test(x int) [server mot_server]; - ``` - -- Always use the FOREIGN keyword to refer to MOT tables. -- The \[server mot\_server\] part is optional when creating a MOT table because MOT is an integrated engine, not a separate server. -- The above is an extremely simple example creating a table named **test** with a single integer column named **x**. In the next section \(**Creating an Index**\) a more realistic example is provided. -- To drop a MOT table named test - - ``` - drop FOREIGN table test; - ``` - - -For a description of the limitations of supported features for MOT tables, such as data types, see the [SQL Coverage and Limitations](sql-coverage-and-limitations.md) section. - +# Creating/Dropping an MOT Table + +Creating a Memory Optimized Table \(MOT\) is very simple. Only the create and drop table statements in MOT differ from the statements for disk-based tables in openGauss. The syntax of **all other** commands for SELECT, DML and DDL are the same for MOT tables as for openGauss disk-based tables. + +- To create an MOT table – + + ``` + create FOREIGN table test(x int) [server mot_server]; + ``` + +- Always use the FOREIGN keyword to refer to MOT tables. +- The \[server mot\_server\] part is optional when creating an MOT table because MOT is an integrated engine, not a separate server. +- The above is an extremely simple example creating a table named **test** with a single integer column named **x**. In the next section \(**Creating an Index**\) a more realistic example is provided. +- To drop an MOT table named test – + + ``` + drop FOREIGN table test; + ``` + + +For a description of the limitations of supported features for MOT tables, such as data types, see the [MOT SQL Coverage and Limitations](mot-sql-coverage-and-limitations.md) section. + diff --git a/content/en/docs/Developerguide/default-mot-conf.md b/content/en/docs/Developerguide/default-mot-conf.md deleted file mode 100644 index 9137db45f0212a0b0fc10921f6ab2e59f4d28088..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/default-mot-conf.md +++ /dev/null @@ -1,13 +0,0 @@ -# Default MOT.conf - -The minimum settings and configuration specify to point the **Postgresql.conf** file to the location of the **MOT.conf** file. - -``` -Postgresql.conf -mot_config_file = '/tmp/gauss/ MOT.conf' -``` - -Ensure that the value of the max\_process\_memory setting is sufficient to include the global \(data and index\) and local \(sessions\) memory of MOT tables. - -The default content of** MOT.conf **is sufficient to get started. The settings can be optimized later. - diff --git a/content/en/docs/Developerguide/deployment.md b/content/en/docs/Developerguide/deployment.md deleted file mode 100644 index c41a4cc476d52669c2446e274a0e575bdf92d391..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/deployment.md +++ /dev/null @@ -1,11 +0,0 @@ -# Deployment - -The following sections describe various mandatory and optional settings for optimal deployment. - -- **[Server Optimization – x86](server-optimization-x86.md)** - -- **[Server Optimization – ARM Huawei Taishan 2P/4P](server-optimization-arm-huawei-taishan-2p-4p.md)** - -- **[MOT Configuration Settings](mot-configuration-settings.md)** - - diff --git a/content/en/docs/Developerguide/design-principles.md b/content/en/docs/Developerguide/design-principles.md deleted file mode 100644 index 742d38d9342da1d09d2cbc909ed8e5ae02f9f6ba..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/design-principles.md +++ /dev/null @@ -1,12 +0,0 @@ -# Design Principles - -To achieve the requirements described above \(especially in an environment with many-cores\), our storage engine's architecture implements the following techniques and strategies - -- Data and indexes only reside in memory. -- Data and indexes are **not** laid out with physical partitions \(because these might achieve lower performance for certain types of applications\). -- Transaction concurrency control is based on Optimistic Concurrency Control \(OCC\) without any centralized contention points. See the ++ section for more information about OCC. -- Parallel Redo Logs \(ultimately per core\) are used to efficiently avoid a central locking point. See the ++ section for more information about Parallel Redo Logs. -- Indexes are lock-free. See the ++ section for more information about lock-free indexes. -- NUMA-awareness memory allocation is used to avoid cross-socket access, especially for session lifecycle objects. See the ++ section for more information about NUMA‑awareness. -- A customized MOT memory management allocator with pre-cached object pools is used to avoid expensive runtime allocation and extra points of contention. This dedicated MOT memory allocator makes memory allocation more efficient by pre‑accessing relatively large chunks of memory from the operation system as needed and then divvying it out to the MOT as needed. - diff --git a/content/en/docs/Developerguide/durability-0.md b/content/en/docs/Developerguide/durability-0.md deleted file mode 100644 index 4f1e1c98a9af6131ad28c3bd51f5ac24da4cd32b..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/durability-0.md +++ /dev/null @@ -1,15 +0,0 @@ -# Durability - -Write-Ahead Logging \(WAL\) is a standard method for ensuring data durability. WAL's central concept is that changes to data files \(where tables and indexes reside\) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage. - -The MOT is fully integrated with the envelope openGauss logging facilities. Besides durability and additional benefit of this method is being able to use it for replication purposes as well. - -Three logging methods are supported, two standard “Synchronous” and “Asynchronous” which are also supported by the standard disk-engine. In addition, in the MOT a new “Group-Commit” with special NUMA-Awareness optimization is introduced. The Group-Commit provides the top performance while maintaining ACID properties. - -- **[Exception Handling](exception-handling.md)** - -- **[Logging](logging.md)** - -- **[Checkpoint](checkpoint.md)** - - diff --git a/content/en/docs/Developerguide/error-log_mot.md b/content/en/docs/Developerguide/error-log_mot.md deleted file mode 100644 index 7ff363355ca03885bda847295d900e42dc20eefa..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/error-log_mot.md +++ /dev/null @@ -1,30 +0,0 @@ -# ERROR LOG \(MOT\) - -- log\_level = INFO - - Configures the log level of messages issued by the MOT engine and recorded in the Error log of the database server. Valid values are PANIC, ERROR, WARN, INFO, TRACE, DEBUG, DIAG1 and DIAG2. - - -- Log/COMPONENT/LOGGER=LOG\_LEVEL - - Configures specific loggers using the syntax described below. - - For example, to configure the TRACE log level for the ThreadIdPool logger in system component, use the following syntax - - ``` - Log/System/ThreadIdPool=TRACE - ``` - - To configure the log level for all loggers under some component, use the following syntax - - ``` - Log/COMPONENT=LOG_LEVEL - ``` - - For example - - ``` - Log/System=DEBUG - ``` - - diff --git a/content/en/docs/Developerguide/error-messages.md b/content/en/docs/Developerguide/error-messages.md deleted file mode 100644 index 7ffb7d931176d8cc6f5cabf8f1f4f2ed383225fe..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/error-messages.md +++ /dev/null @@ -1,14 +0,0 @@ -# Error Messages - -Errors may be caused by a variety of scenarios. All errors are logged in the database server log file. In addition, user-related errors are returned to the user as part of the response to the query, transaction or stored procedure execution or to database administration action. - -- Errors reported in the Server log include – Function, Entity, Context, Error message, Error description and Severity. -- Errors reported to users are translated into standard PostgreSQL error codes and may consist of a MOT-specific message and description. - -The following lists the error messages, error descriptions and error codes. The error code is actually an internal code and not logged or returned to users. - -- **[Errors Written the Log File](errors-written-the-log-file.md)** - -- **[Errors Returned to the User](errors-returned-to-the-user.md)** - - diff --git a/content/en/docs/Developerguide/errors-returned-to-the-user.md b/content/en/docs/Developerguide/errors-returned-to-the-user.md deleted file mode 100644 index 4f44e21548c185a99a9ffa596cd8d7d37101b85e..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/errors-returned-to-the-user.md +++ /dev/null @@ -1,332 +0,0 @@ -# Errors Returned to the User - -The following lists the errors that are written to the database server log file and are returned to the user. - -MOT returns PG standard error codes to the envelope using a Return Code \(RC\). Some RCs cause the generation of an error message to the user who is interacting with the database. - -The PG code \(described below\) is returned internally by MOT to the database envelope, which reacts to it according to standard PG behavior. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->%s, %u and %lu in the message are replaced by relevant error information, such as query, table name or another information. ->- %s – String ->- %u – Number ->- %lu – Number - -**Table 1** Errors Returned to the User and Logged to the Log File - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Short and Long Description Returned to the User

-

PG Code

-

Internal Error Code

-

Success.

-

Denotes success

-

ERRCODE_SUCCESSFUL_

-

COMPLETION

-

RC_OK = 0

-

Failure

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_ERROR = 1

-

Unknown error has occurred.

-

Denotes aborted operation.

-

ERRCODE_FDW_ERROR

-

RC_ABORT

-

Column definition of %s is not supported.

-

Column type %s is not supported yet.

-

ERRCODE_INVALID_COLUMN_DEFINITION

-

RC_UNSUPPORTED_COL_TYPE

-

Column definition of %s is not supported.

-

Column type Array of %s is not supported yet.

-

ERRCODE_INVALID_COLUMN_DEFINITION

-

RC_UNSUPPORTED_COL_TYPE_ARR

-

Column size %d exceeds max tuple size %u.

-

Column definition of %s is not supported.

-

ERRCODE_FEATURE_NOT_SUPPORTED

-

RC_EXCEEDS_MAX_ROW_SIZE

-

Column name %s exceeds max name size %u.

-

Column definition of %s is not supported.

-

ERRCODE_INVALID_COLUMN_DEFINITION

-

RC_COL_NAME_EXCEEDS_MAX_SIZE

-

Column size %d exceeds max size %u.

-

Column definition of %s is not supported.

-

ERRCODE_INVALID_COLUMN_DEFINITION

-

RC_COL_SIZE_INVLALID

-

Cannot create table.

-

Cannot add column %s; as the number of declared columns exceeds the maximum declared columns.

-

ERRCODE_FEATURE_NOT_

-

SUPPORTED

-

RC_TABLE_EXCEEDS_MAX_

-

DECLARED_COLS

-

Cannot create index.

-

Total column size is greater than maximum index size %u.

-

ERRCODE_FDW_KEY_SIZE_

-

EXCEEDS_MAX_ALLOWED

-

RC_INDEX_EXCEEDS_MAX_SIZE

-

Cannot create index.

-

Total number of indexes for table %s is greater than the maximum number of indexes allowed %u.

-

ERRCODE_FDW_TOO_MANY_

-

INDEXES

-

RC_TABLE_EXCEEDS_MAX_INDEXES

-

Cannot execute statement.

-

Maximum number of DDLs per transaction reached the maximum %u.

-

ERRCODE_FDW_TOO_MANY_

-

DDL_CHANGES_IN_

-

TRANSACTION_NOT_

-

ALLOWED

-

RC_TXN_EXCEEDS_MAX_DDLS

-

Unique constraint violation

-

Duplicate key value violates unique constraint \"%s\"".

-

Key %s already exists.

-

ERRCODE_UNIQUE_

-

VIOLATION

-

RC_UNIQUE_VIOLATION

-

Table \"%s\" does not exist.

-

ERRCODE_UNDEFINED_TABLE

-

RC_TABLE_NOT_FOUND

-

Index \"%s\" does not exist.

-

ERRCODE_UNDEFINED_TABLE

-

RC_INDEX_NOT_FOUND

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_LOCAL_ROW_FOUND

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_LOCAL_ROW_NOT_FOUND

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_LOCAL_ROW_DELETED

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_INSERT_ON_EXIST

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_INDEX_RETRY_INSERT

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_INDEX_DELETE

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_LOCAL_ROW_NOT_VISIBLE

-

Memory is temporarily unavailable.

-

ERRCODE_OUT_OF_LOGICAL_MEMORY

-

RC_MEMORY_ALLOCATION_ERROR

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_ILLEGAL_ROW_STATE

-

Null constraint violated.

-

NULL value cannot be inserted into non-null column %s at table %s.

-

ERRCODE_FDW_ERROR

-

RC_NULL_VIOLATION

-

Critical error.

-

Critical error: %s.

-

ERRCODE_FDW_ERROR

-

RC_PANIC

-

A checkpoint is in progress – cannot truncate table.

-

ERRCODE_FDW_OPERATION_NOT_SUPPORTED

-

RC_NA

-

Unknown error has occurred.

-

ERRCODE_FDW_ERROR

-

RC_MAX_VALUE

-

<recovery message>

-
  

ERRCODE_CONFIG_FILE_ERROR

-

<recovery message>

-
  

ERRCODE_INVALID_TABLE_

-

DEFINITION

-

Memory engine – Failed to perform commit prepared.

-
  

ERRCODE_INVALID_TRANSACTION_

-

STATE

-

Invalid option <option name>

-
  

ERRCODE_FDW_INVALID_OPTION_

-

NAME

-

Invalid memory allocation request size.

-
  

ERRCODE_INVALID_PARAMETER_

-

VALUE

-

Memory is temporarily unavailable.

-
  

ERRCODE_OUT_OF_LOGICAL_

-

MEMORY

-

Could not serialize access due to concurrent update.

-
  

ERRCODE_T_R_SERIALIZATION_

-

FAILURE

-

Alter table operation is not supported for memory table.

-

Cannot create MOT tables while incremental checkpoint is enabled.

-

Re-index is not supported for memory tables.

-
  

ERRCODE_FDW_OPERATION_NOT_

-

SUPPORTED

-

Allocation of table metadata failed.

-
  

ERRCODE_OUT_OF_MEMORY

-

Database with OID %u does not exist.

-
  

ERRCODE_UNDEFINED_DATABASE

-

Value exceeds maximum precision: %d.

-
  

ERRCODE_NUMERIC_VALUE_OUT_

-

OF_RANGE

-

You have reached a maximum logical capacity %lu of allowed %lu.

-
  

ERRCODE_OUT_OF_LOGICAL_

-

MEMORY

-
- diff --git a/content/en/docs/Developerguide/errors-written-the-log-file.md b/content/en/docs/Developerguide/errors-written-the-log-file.md deleted file mode 100644 index 0249ff4b523e40d1583808d0a3379b86e36376b9..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/errors-written-the-log-file.md +++ /dev/null @@ -1,76 +0,0 @@ -# Errors Written the Log File - -All errors are logged in the database server log file. The following lists the errors that are written to the database server log file and are **not** returned to the user. The log is located in the data folder and named **postgresql-DATE-TIME.log**. - -**Table 1** Errors Written Only to the Log File - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Message in the Log

-

Error Internal Code

-

Error code denoting success

-

MOT_NO_ERROR 0

-

Out of memory

-

MOT_ERROR_OOM 1

-

Invalid configuration

-

MOT_ERROR_INVALID_CFG 2

-

Invalid argument passed to function

-

MOT_ERROR_INVALID_ARG 3

-

System call failed

-

MOT_ERROR_SYSTEM_FAILURE 4

-

Resource limit reached

-

MOT_ERROR_RESOURCE_LIMIT 5

-

Internal logic error

-

MOT_ERROR_INTERNAL 6

-

Resource unavailable

-

MOT_ERROR_RESOURCE_UNAVAILABLE 7

-

Unique violation

-

MOT_ERROR_UNIQUE_VIOLATION 8

-

Invalid memory allocation size

-

MOT_ERROR_INVALID_MEMORY_SIZE 9

-

Index out of range

-

MOT_ERROR_INDEX_OUT_OF_RANGE 10

-

Error code unknown

-

MOT_ERROR_INVALID_STATE 11

-
- diff --git a/content/en/docs/Developerguide/extended-fdw-and-other-opengauss-features.md b/content/en/docs/Developerguide/extended-fdw-and-other-opengauss-features.md index ffd96e98236b946c2eec8770eb3722e2ee719cf8..4574a663f8dd7320de9095066b15d2c5419c01b1 100644 --- a/content/en/docs/Developerguide/extended-fdw-and-other-opengauss-features.md +++ b/content/en/docs/Developerguide/extended-fdw-and-other-opengauss-features.md @@ -1,50 +1,52 @@ -# Extended FDW and Other openGauss Features - -openGauss is based on PostgreSQL, which does not have a built-in storage engine adapter, such as MySQL handlerton. To enable the integration of the MOT storage engine into openGauss, we decided to make use the existing Foreign Data Wrapper \(FDW\) mechanism and to extend it. - -With the introduction of FDW into PostgreSQL 9.1 it became possible to access externally managed databases in a way that presented these foreign tables and data sources as united, locally accessible relations. - -In contrast, the MOT storage engine is embedded inside openGauss and its tables are managed by it. Access to tables is controlled by the openGauss planner and executor. MOT gets logging and checkpointing services from openGauss and participates in the recovery process of the openGauss and more processes. We refer to all the components that are in use or are accessing the MOT storage engine as the _Envelope_. - -The following figure shows how the MOT storage engine is embedded and its bi-directional access to database functionality. - -![](figures/en-us_image_0260488301.png) - -We have extended the capabilities of FDW by extending and modifying the FdwRoutine structure. Support for new features was also added, such as Add index, Drop index/table, Truncate, Vacuum and Table/Index Memory Statistics. A significant emphasis was put on integration with openGauss logging, replication and checkpointing mechanisms in order to provide consistency for cross-table transactions through failures. In this case, the MOT itself sometimes initiates calls to openGauss functionality through the FDW layer. Such calls were never required previous to the introduction of MOT. - -## Creating Tables and Indexes - -In order to support the creation of MOT tables, the standard FDW syntax was reused – **create FOREIGN table**. - -The MOT FDW mechanism passes the instruction to the MOT storage engine for actual table creation. Similarly we needed ability to support index creation \(create index …\), previously not available and unneeded in FDW, since its tables are managed externally. - -To support both, in the MOT FDW function ValidateTableDef actually creates the specified table, and in addition it handles index creation on that relation and DROP TABLE and DROP INDEX, as well as VACUUM and ALTER TABLE, which also previously were not supported in FDW. - -## Index Usage for Planning and Execution - -Query life breaks into two phases, the planning and the execution. In the planning phase, which may take place once per multiple executions, the best index for the scan is chosen. The choice is made based on matching query's WHERE clauses, JOIN clauses, and ORDER BY conditions. During execution a query iterates on the relevant table rows, and performs some work, e.g., update or delete, per iteration. An insert is a special case where the table adds the row to all indexes and no scanning is required. - -Planner: In the standard FDW the query is passed for execution to a foreign data source, thus indexes filtering and actual planning, i.e. choice of indexes, is not done locally in the database, rather in the external data source. Internally the FDW returns to the database planner a general plan. In case of MOT tables, similarly as for database Disk-tables, relevant MOT indexes are filtered, matched, and the one that will minimize the set of traversed rows is selected and added the plan. - -Executor: The query execution is using the chosen MOT index for iteration over the relevant rows of the table. Each row is inspected by openGauss envelope, and according to the query conditions, an update or delete is called for it. - -## Durability and HA - -While a storage engine is responsible to store, read, update and delete data in the underlying memory and storage systems. The Logging, checkpointing and recovery are not parts of the storage engine, especially as some transactions encompass multiple tables with different storage engines. Thus, to persist and replicate the data we use the high-availability facilities from the openGauss envelope as follows: - -To ensure Durability, the MOT persists data by write-ahead logging \(WAL\) records using the openGauss's XLOG interface. By doing that we also gain openGauss's replication capabilities that are using the same APIs. - -Checkpointing: MOT checkpoint is enabled by registering a callback to openGauss checkpointer. Whenever a general database checkpoint is performed, the MOT checkpoint process is called as well. MOT keeps the checkpoint's LSN \(Log Sequence Number\) in order to be aligned with openGauss recovery. The MOT checkpointing algorithm highly optimized, asynchronous and is not stopping concurrent transactions. More information on the matter can be found in section [CHECKPOINT \(MOT\)](checkpoint-(mot).md). - -Recovery: Upon startup, openGauss first calls an MOT callback that first recovers the MOT checkpoint by loading into memory rows and creating indexes, followed by execution of the WAL recovery by replaying records according to the checkpoint's LSN. MOT checkpoint is recovered in parallel using multiple threads, each one reading a different data segment. This makes MOT checkpoint recovery quite fast on many-core hardware, though still potentially slower compared to disk-based tables where only WAL records are replayed. - -## VACUUM and DROP - -To complete the functionality of MOT we added support for VACUUM, DROP TABLE, and DROP INDEX. All three execute with exclusive table lock, i.e. without concurrent transactions on the table. The system VACUUM calls a new FDW function to perform the MOT vacuuming, while DROP was added to the ValidateTableDef\(\) function. - -Deleting memory pools: Each index and table tracks all memory pools it uses. At DROP INDEX command its metadata is removed and memory pools are deleted as one consecutive block. The MOT VACUUM is only doing compaction of the used memory, as memory reclamation is done continuously in the background by the epoch based Garbage Collector \(GC\). To perform the compaction we switch the index or the table to new memory pools, traverse all the live data, delete each row and insert it using the new pools, and finally delete the pools as done for drop. - -## Query Native Compilation \(JIT\) - -The FDW adapter to MOT engine also contains a lite execution path that employs Just-In-Time \(JIT\) compiled query execution using the LLVM compiler. More information about MOT Query Native Compilation or can be found in section .[Query Native Compilation \(JIT\)](query-native-compilation-(jit).md) - +# Extended FDW and Other openGauss Features + +openGauss is based on PostgreSQL, which does not have a built-in storage engine adapter, such as MySQL handlerton. To enable the integration of the MOT storage engine into openGauss, we have leveraged and extended the existing Foreign Data Wrapper \(FDW\) mechanism. With the introduction of FDW into PostgreSQL 9.1, externally managed databases can now be accessed in a way that presents these foreign tables and data sources as united, locally accessible relations. + +In contrast, the MOT storage engine is embedded inside openGauss and its tables are managed by it. Access to tables is controlled by the openGauss planner and executor. MOT gets logging and checkpointing services from openGauss and participates in the openGauss recovery process in addition to other processes. + +We refer to all the components that are in use or are accessing the MOT storage engine as the _Envelope_. + +The following figure shows how the MOT storage engine is embedded inside openGauss and its bi‑directional access to database functionality. + +**Figure 1** MOT Storage Engine Embedded inside openGauss – FDW Access to External Databases +![](figures/mot-architecture.png "mot-architecture") + +We have extended the capabilities of FDW by extending and modifying the FdwRoutine structure in order to introduce features and calls that were not required before the introduction of MOT. For example, support for The following new features was added – Add Index, Drop Index/Table, Truncate, Vacuum and Table/Index Memory Statistics. A significant emphasis was put on integration with openGauss logging, replication and checkpointing mechanisms in order to provide consistency for cross-table transactions through failures. In this case, the MOT itself sometimes initiates calls to openGauss functionality through the FDW layer. + +## Creating Tables and Indexes + +In order to support the creation of MOT tables, standard FDW syntax was reused. + +For example, create FOREIGN table. + +The MOT FDW mechanism passes the instruction to the MOT storage engine for actual table creation. Similarly, we support index creation \(create index …\). This feature was not previously available in FDW, because it was not needed since its tables are managed externally. + +To support both in MOT FDW, the **ValidateTableDef** function actually creates the specified table. It also handles the index creation of that relation, as well as DROP TABLE and DROP INDEX, in addition to VACUUM and ALTER TABLE, which were not previously supported in FDW. + +## Index Usage for Planning and Execution + +A query has two phases – **Planning** and **Execution**. During the Planning phase \(which may take place once per multiple executions\), the best index for the scan is chosen. This choice is made based on the matching query's WHERE clauses, JOIN clauses and ORDER BY conditions. During execution, a query iterates over the relevant table rows and performs various tasks, such as update or delete, per iteration. An insert is a special case where the table adds the row to all indexes and no scanning is required. + +- **Planner –** In standard FDW, a query is passed for execution to a foreign data source. This means that index filtering and the actual planning \(such as the choice of indexes\) is not performed locally in the database, rather it is performed in the external data source. Internally, the FDW returns a general plan to the database planner. MOT tables are handled in a similar manner as disk tables. This means that relevant MOT indexes are filtered and matched, and the indexes that minimize the set of traversed rows are selected and are added to the plan. +- **Executor –** The Query Executor uses the chosen MOT index in order to iterate over the relevant rows of the table. Each row is inspected by the openGauss envelope, and according to the query conditions, an update or delete is called to handle the relevant row. + +## Durability, Replication and High Availability + +A storage engine is responsible for storing, reading, updating and deleting data in the underlying memory and storage systems. The logging, checkpointing and recovery are not handled by the storage engine, especially because some transactions encompass multiple tables with different storage engines. Therefore, in order to persist and replicate data, the high-availability facilities from the openGauss envelope are used as follows – + +- **Durability –** In order to ensure Durability, the MOT engine persists data by Write-Ahead Logging \(WAL\) records using the openGauss's XLOG interface. This also provides the benefits of openGauss's replication capabilities that use the same APIs. You may refer to the [MOT Durability Concepts](mot-durability-concepts.md) for more information. +- **Checkpointing –** A MOT Checkpoint is enabled by registering a callback to the openGauss Checkpointer. Whenever a general database Checkpoint is performed, the MOT Checkpoint process is called as well. MOT keeps the Checkpoint's Log Sequence Number \(LSN\) in order to be aligned with openGauss recovery. The MOT Checkpointing algorithm is highly optimized and asynchronous and does not stop concurrent transactions. You may refer to the [MOT Checkpoint Concepts](mot-checkpoint-concepts.md) for more information. +- **Recovery –** Upon startup, openGauss first calls an MOT callback that recovers the MOT Checkpoint by loading into memory rows and creating indexes, followed by the execution of the WAL recovery by replaying records according to the Checkpoint's LSN. The MOT Checkpoint is recovered in parallel using multiple threads – each thread reads a different data segment. This makes MOT Checkpoint recovery quite fast on many-core hardware, though it is still potentially slower compared to disk-based tables where only WAL records are replayed. You may refer to the [MOT Recovery Concepts](mot-recovery-concepts.md) for more information. + +## VACUUM and DROP + +In order to maximize MOT functionality, we added support for VACUUM, DROP TABLE and DROP INDEX. All three execute with an exclusive table lock, meaning without allowing concurrent transactions on the table. The system VACUUM calls a new FDW function to perform the MOT vacuuming, while DROP was added to the ValidateTableDef\(\) function. + +## Deleting Memory Pools + +Each index and table tracks all the memory pools that it uses. A DROP INDEX command is used to remove metadata. Memory pools are deleted as a single consecutive block. The MOT VACUUM only compacts used memory, because memory reclamation is performed continuously in the background by the epoch-based Garbage Collector \(GC\). In order to perform the compaction, we switch the index or the table to new memory pools, traverse all the live data, delete each row and insert it using the new pools and finally delete the pools as is done for a drop. + +## Query Native Compilation \(JIT\) + +The FDW adapter to MOT engine also contains a lite execution path that employs Just-In-Time \(JIT\) compiled query execution using the LLVM compiler. More information about MOT Query Native Compilation can be found in the **Query Native Compilation \(JIT\)** section. + diff --git a/content/en/docs/Developerguide/external-support-tools.md b/content/en/docs/Developerguide/external-support-tools.md deleted file mode 100644 index 17cf50cf08bc5dfb82ecf1dc68ec487cea3a86a4..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/external-support-tools.md +++ /dev/null @@ -1,13 +0,0 @@ -# External Support Tools - -The following external openGauss tools have been modified in order to support MOT. Make sure to use the most recent version of each. An overview describing MOT-related usage is provided below. For a full description of these tools and their usage, refer to the openGauss Tools Reference document. - -- **[gs\_ctl \(Full and Incremental\)](gs_ctl-(full-and-incremental).md)** - -- **[gs\_basebackup](gs_basebackup.md)** - -- **[gs\_dump](gs_dump.md)** - -- **[gs\_restore](gs_restore.md)** - - diff --git a/content/en/docs/Developerguide/figures/4-socket-96-cores-performance-benchmarks.png b/content/en/docs/Developerguide/figures/4-socket-96-cores-performance-benchmarks.png new file mode 100644 index 0000000000000000000000000000000000000000..aa95741527774d02de67066db83769fd12794427 Binary files /dev/null and b/content/en/docs/Developerguide/figures/4-socket-96-cores-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/arm-kunpeng-2-socket-128-cores-performance-benchmarks.png b/content/en/docs/Developerguide/figures/arm-kunpeng-2-socket-128-cores-performance-benchmarks.png index 6bc816ee56ee7cf74dc151cea5024bf129845dee..959f9c9385b5a086afa479b26623b659efb8c118 100644 Binary files a/content/en/docs/Developerguide/figures/arm-kunpeng-2-socket-128-cores-performance-benchmarks.png and b/content/en/docs/Developerguide/figures/arm-kunpeng-2-socket-128-cores-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/arm-kunpeng-2-socket-128-cores-performance-per-core-benchmarks.png b/content/en/docs/Developerguide/figures/arm-kunpeng-2-socket-128-cores-performance-per-core-benchmarks.png new file mode 100644 index 0000000000000000000000000000000000000000..5ec130e84a9c94340cf6619ee22438f73687aaf1 Binary files /dev/null and b/content/en/docs/Developerguide/figures/arm-kunpeng-2-socket-128-cores-performance-per-core-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/arm-kunpeng-4-socket-256-cores-performance-benchmarks.png b/content/en/docs/Developerguide/figures/arm-kunpeng-4-socket-256-cores-performance-benchmarks.png index a7d20ee215519aa886ee026b1d385fee74975bf0..5331dc903269adac2aad8a673d57c0f4ef09d2ad 100644 Binary files a/content/en/docs/Developerguide/figures/arm-kunpeng-4-socket-256-cores-performance-benchmarks.png and b/content/en/docs/Developerguide/figures/arm-kunpeng-4-socket-256-cores-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/asynchronous-logging.png b/content/en/docs/Developerguide/figures/asynchronous-logging.png new file mode 100644 index 0000000000000000000000000000000000000000..f0115c4f34e6b53c6170650be489f8e4887c0162 Binary files /dev/null and b/content/en/docs/Developerguide/figures/asynchronous-logging.png differ diff --git a/content/en/docs/Developerguide/figures/cold-start-time-performance-benchmarks.png b/content/en/docs/Developerguide/figures/cold-start-time-performance-benchmarks.png new file mode 100644 index 0000000000000000000000000000000000000000..9f44fd4d9fddffaa2bb9743e6b659c7404abf091 Binary files /dev/null and b/content/en/docs/Developerguide/figures/cold-start-time-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/detailed-result-report.png b/content/en/docs/Developerguide/figures/detailed-result-report.png new file mode 100644 index 0000000000000000000000000000000000000000..ae93dca9942d92f00c63bcbea882cdfadcf70f24 Binary files /dev/null and b/content/en/docs/Developerguide/figures/detailed-result-report.png differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270171684.png b/content/en/docs/Developerguide/figures/en-us_image_0270171684.png new file mode 100644 index 0000000000000000000000000000000000000000..135af0baf5fc18e9d76a014e341ce6b87f11a08e Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270171684.png differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270171686.png b/content/en/docs/Developerguide/figures/en-us_image_0270171686.png new file mode 100644 index 0000000000000000000000000000000000000000..6c4717aab4a6a8a42d06352bb2776b35896c8c5f Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270171686.png differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270362942.png b/content/en/docs/Developerguide/figures/en-us_image_0270362942.png new file mode 100644 index 0000000000000000000000000000000000000000..184f2248d2e037ab50f46851273143c2fb95a904 Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270362942.png differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270362943.png b/content/en/docs/Developerguide/figures/en-us_image_0270362943.png new file mode 100644 index 0000000000000000000000000000000000000000..1e97aa4eca7e89b2c8194c30c30ecebed0fc1d8d Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270362943.png differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270362944.png b/content/en/docs/Developerguide/figures/en-us_image_0270362944.png new file mode 100644 index 0000000000000000000000000000000000000000..acd844a569e5efb89441376911e7d98de334a3c2 Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270362944.png differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270447139.jpg b/content/en/docs/Developerguide/figures/en-us_image_0270447139.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3c9b7ed4488edf0b386d937924439448981ccc83 Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270447139.jpg differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270447141.png b/content/en/docs/Developerguide/figures/en-us_image_0270447141.png new file mode 100644 index 0000000000000000000000000000000000000000..4a2ff4e441666446f154811420da438878011543 Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270447141.png differ diff --git a/content/en/docs/Developerguide/figures/en-us_image_0270643558.png b/content/en/docs/Developerguide/figures/en-us_image_0270643558.png new file mode 100644 index 0000000000000000000000000000000000000000..0c0f437cff9f3cc451d669957948c7f1d24edd84 Binary files /dev/null and b/content/en/docs/Developerguide/figures/en-us_image_0270643558.png differ diff --git a/content/en/docs/Developerguide/figures/group-commit-with-numa-awareness.png b/content/en/docs/Developerguide/figures/group-commit-with-numa-awareness.png new file mode 100644 index 0000000000000000000000000000000000000000..4c06ba273f713e12f46852e76bb257877d5ef47c Binary files /dev/null and b/content/en/docs/Developerguide/figures/group-commit-with-numa-awareness.png differ diff --git a/content/en/docs/Developerguide/figures/low-latency-(90th-)-performance-benchmarks.png b/content/en/docs/Developerguide/figures/low-latency-(90th-)-performance-benchmarks.png index 0e3977fb139f44a43703d1742022d64a3e6a545c..b2b12e8f62122853a7bf8c4df669fd8a8a9603b5 100644 Binary files a/content/en/docs/Developerguide/figures/low-latency-(90th-)-performance-benchmarks.png and b/content/en/docs/Developerguide/figures/low-latency-(90th-)-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/low-latency-(90th-transaction-average)-performance-benchmarks.png b/content/en/docs/Developerguide/figures/low-latency-(90th-transaction-average)-performance-benchmarks.png index 79f1cf75071a1fc5bf25fa920d03873f26f3c6f8..540d4b8b4b40646fc95159d0abd3940e6c99f80c 100644 Binary files a/content/en/docs/Developerguide/figures/low-latency-(90th-transaction-average)-performance-benchmarks.png and b/content/en/docs/Developerguide/figures/low-latency-(90th-transaction-average)-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/mot-architecture.png b/content/en/docs/Developerguide/figures/mot-architecture.png index 554751a51dee003d64c0322bfaa034525a0dd58a..cd1662927c13458ba3c121500601e7fb15846e94 100644 Binary files a/content/en/docs/Developerguide/figures/mot-architecture.png and b/content/en/docs/Developerguide/figures/mot-architecture.png differ diff --git a/content/en/docs/Developerguide/figures/per-transaction-logging.png b/content/en/docs/Developerguide/figures/per-transaction-logging.png new file mode 100644 index 0000000000000000000000000000000000000000..73cea5befa30afd409135a15c4b7092e879dace6 Binary files /dev/null and b/content/en/docs/Developerguide/figures/per-transaction-logging.png differ diff --git a/content/en/docs/Developerguide/figures/private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t.png b/content/en/docs/Developerguide/figures/private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t.png index 1ca76aab7819f0b8fb05a40cfc4c01e8ca28ab52..4e13f1571a81bd44aba77412d658b82602b21b1b 100644 Binary files a/content/en/docs/Developerguide/figures/private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t.png and b/content/en/docs/Developerguide/figures/private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t.png differ diff --git a/content/en/docs/Developerguide/figures/resource-utilization-performance-benchmarks.png b/content/en/docs/Developerguide/figures/resource-utilization-performance-benchmarks.png index 0e06d1b133cb424284bb8668d5ae96094b152462..0aa58dcc42a859a812e6cdc0c19276eb040a19e1 100644 Binary files a/content/en/docs/Developerguide/figures/resource-utilization-performance-benchmarks.png and b/content/en/docs/Developerguide/figures/resource-utilization-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/figures/synchronous-logging.png b/content/en/docs/Developerguide/figures/synchronous-logging.png new file mode 100644 index 0000000000000000000000000000000000000000..6186fcc64e78a3ba541c430549617f36f495d79c Binary files /dev/null and b/content/en/docs/Developerguide/figures/synchronous-logging.png differ diff --git a/content/en/docs/Developerguide/figures/three-logging-options.png b/content/en/docs/Developerguide/figures/three-logging-options.png new file mode 100644 index 0000000000000000000000000000000000000000..fe564340e56a54d4ff67a8274cd53fc4f6f16675 Binary files /dev/null and b/content/en/docs/Developerguide/figures/three-logging-options.png differ diff --git a/content/en/docs/Developerguide/figures/tpc-c-on-arm-(256-cores).png b/content/en/docs/Developerguide/figures/tpc-c-on-arm-(256-cores).png new file mode 100644 index 0000000000000000000000000000000000000000..5331dc903269adac2aad8a673d57c0f4ef09d2ad Binary files /dev/null and b/content/en/docs/Developerguide/figures/tpc-c-on-arm-(256-cores).png differ diff --git a/content/en/docs/Developerguide/figures/tpmc-vs-cpu-usage.png b/content/en/docs/Developerguide/figures/tpmc-vs-cpu-usage.png new file mode 100644 index 0000000000000000000000000000000000000000..0aa58dcc42a859a812e6cdc0c19276eb040a19e1 Binary files /dev/null and b/content/en/docs/Developerguide/figures/tpmc-vs-cpu-usage.png differ diff --git a/content/en/docs/Developerguide/figures/x86-8-socket-384-cores-performance-benchmarks.png b/content/en/docs/Developerguide/figures/x86-8-socket-384-cores-performance-benchmarks.png index be072db571e533f1ad6e84f1e7760fb6a1888f5a..9462c56835ccc4303808ae029529240c6d71acd9 100644 Binary files a/content/en/docs/Developerguide/figures/x86-8-socket-384-cores-performance-benchmarks.png and b/content/en/docs/Developerguide/figures/x86-8-socket-384-cores-performance-benchmarks.png differ diff --git a/content/en/docs/Developerguide/garbage-collection_mot.md b/content/en/docs/Developerguide/garbage-collection_mot.md deleted file mode 100644 index cd2566e8c00427cf342528fdcbe355df76e7540c..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/garbage-collection_mot.md +++ /dev/null @@ -1,27 +0,0 @@ -# GARBAGE COLLECTION \(MOT\) - -- enable\_gc = true - - Specifies whether to use the garbage collector. - -- reclaim\_threshold = 512 KB - - Configures the memory threshold for the garbage collector. - - Each session manages its own list of to-be-reclaimed objects and performs its own garbage collection during transaction commitment. This value determines the total memory threshold of objects waiting to be reclaimed, above which garbage collection is triggered for a session. - - In general, the trade-off here is between un-reclaimed objects vs garbage collection frequency. Setting a low value keeps low levels of un-reclaimed memory, but causes frequent garbage collection that may affect performance. Setting a high value triggers garbage collection less frequently, but results in higher levels of un-reclaimed memory. This setting is dependent upon the overall workload. - -- reclaim\_batch\_size = 8000 - - Configures the batch size for garbage collection. - - The garbage collector reclaims memory from objects in batches, in order to restrict the number of objects being reclaimed in a single garbage collection pass. The intent of this approach is to minimize the operation time of a single garbage collection pass. - -- high\_reclaim\_threshold = 8 MB - - Configures the high memory threshold for garbage collection. - - Because garbage collection works in batches, it is possible that a session may have many objects that can be reclaimed, but which were not. In such situations, in order to prevent garbage collection lists from becoming too bloated, this value is used to continue reclaiming objects within a single pass, even though that batch size limit has been reached, until the total size of the still-waiting-to-be-reclaimed objects is less than this threshold, or there are no more objects eligible for reclamation. - - diff --git a/content/en/docs/Developerguide/general-guidelines.md b/content/en/docs/Developerguide/general-guidelines.md deleted file mode 100644 index 4cc435b090d052fbb9622a6fcbb54e45f2a842f1..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/general-guidelines.md +++ /dev/null @@ -1,34 +0,0 @@ -# General Guidelines - -The following are general guidelines for editing the mot.conf file. - -- Each setting appears with its default value as follows - - ``` - # name = value - ``` - - -- Blank/white space is acceptable. -- Comments are indicated by placing a number sign \(\#\) anywhere on a line. -- The default values of each setting appear as a comment throughout this file. -- In case a parameter is uncommented and a new value is placed, the new setting is defined. -- Changes to the mot.conf file are applied only at a start or reload of the database server. - -Memory Units are represented as follows - -- KB – Kilobytes -- MB – Megabytes -- GB – Gigabytes -- TB – Terabytes -- Some memory units are represented as a percentage of the **max\_process\_memory** setting that is configured in **postgresql.conf**. For example – **20%**. - -Time units are represented as follows - -- us – Microseconds \(or micros\) -- ms – milliseconds \(or millis\) -- s – Seconds \(or secs\) -- min – Minutes \(or mins\) -- h – Hours -- d – Days - diff --git a/content/en/docs/Developerguide/granting-user-permissions.md b/content/en/docs/Developerguide/granting-user-permissions.md index f64f82afbf2ed60268fe5eb38cc01f8df6f75c5d..540581bfab8f6f2cba3555992c670b65056ee55f 100644 --- a/content/en/docs/Developerguide/granting-user-permissions.md +++ b/content/en/docs/Developerguide/granting-user-permissions.md @@ -1,19 +1,19 @@ -# Granting User Permissions - -The following describes how to assign a database user permission to access the MOT storage engine. This is performed only once per database user, and is usually done during the initial configuration phase. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->The granting of user permissions is required because MOT is integrated into the openGauss database by using and extending the Foreign Data Wrapper \(FDW\) mechanism, which requires granting user access permissions. - -To enable a specific user to create and access MOT tables \(DDL, DML, SELECT\) - -Run the following statement only once - -``` -GRANT USAGE ON FOREIGN SERVER mot_server TO ; -``` - -The red in this statement represents the special MOT part. - -All keywords are not case sensitive. - +# Granting User Permissions + +The following describes how to assign a database user permission to access the MOT storage engine. This is performed only once per database user, and is usually done during the initial configuration phase. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The granting of user permissions is required because MOT is integrated into the openGauss database by using and extending the Foreign Data Wrapper \(FDW\) mechanism, which requires granting user access permissions. + +To enable a specific user to create and access MOT tables \(DDL, DML, SELECT\) – + +Run the following statement only once – + +``` +GRANT USAGE ON FOREIGN SERVER mot_server TO ; +``` + +The red in this statement represents the special MOT part. + +All keywords are not case sensitive. + diff --git a/content/en/docs/Developerguide/gs_basebackup.md b/content/en/docs/Developerguide/gs_basebackup.md deleted file mode 100644 index abb682ee67370fd5be1b590d48cf0164d386f450..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/gs_basebackup.md +++ /dev/null @@ -1,6 +0,0 @@ -# gs\_basebackup - -gs\_basebackup is used to prepare base backups of a running server, without affecting other database clients. - -The MOT checkpoint is fetched at the end of the operation as well. However, the checkpoint's location is taken from **checkpoint\_dir** in the source server and is transferred to the data directory of the source in order to back it up correctly. - diff --git a/content/en/docs/Developerguide/gs_ctl-(full-and-incremental).md b/content/en/docs/Developerguide/gs_ctl-(full-and-incremental).md deleted file mode 100644 index 3916fb8abe3a6c16dd30cdf5782ad7ccafb8c0d4..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/gs_ctl-(full-and-incremental).md +++ /dev/null @@ -1,10 +0,0 @@ -# gs\_ctl \(Full and Incremental\) - -This tool is used to create a standby server from a primary server, as well as to synchronize a server with another copy of the same server after their timelines have diverged. - -At the end of the operation, the latest MOT checkpoint is fetched by the tool, taking into consideration the **checkpoint\_dir** configuration setting value. - -The checkpoint is fetched from the source server's **checkpoint\_dir** to the destination server's **checkpoint\_dir**. - -Currently, MOT does not support an incremental checkpoint. Therefore, the gs\_ctl****incremental****build does not work in an incremental manner for MOT, but rather in FULL mode. The Postgres \(disk-tables\) incremental build can still be done incrementally. - diff --git a/content/en/docs/Developerguide/gs_dump.md b/content/en/docs/Developerguide/gs_dump.md deleted file mode 100644 index 7fcd3dac11e3484ecb97242516168b37fc2912d2..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/gs_dump.md +++ /dev/null @@ -1,4 +0,0 @@ -# gs\_dump - -gs\_dump is used to export the database schema and data to a file. It also supports MOT tables. - diff --git a/content/en/docs/Developerguide/gs_restore.md b/content/en/docs/Developerguide/gs_restore.md deleted file mode 100644 index 502a856ae9bc40918a458fb129cb8757b4d32447..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/gs_restore.md +++ /dev/null @@ -1,4 +0,0 @@ -# gs\_restore - -gs\_restore is used to import the database schema and data from a file. It also supports MOT tables. - diff --git a/content/en/docs/Developerguide/indexes.md b/content/en/docs/Developerguide/indexes.md deleted file mode 100644 index 882e7f2e4ec0fdbaffb0e50ff256d5df835742c9..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/indexes.md +++ /dev/null @@ -1,22 +0,0 @@ -# Indexes - -MOT Index is a lock-free index based on state-of-the-art Masstree[\[13\]](#_ftn13) – a fast and scalable KV store for multicore systems, implemented as tries of B+ trees. It achieves excellent performance on many-core servers and high concurrent workloads. It uses some advanced techniques, such as an optimistic lock approach, cache-awareness and memory prefetching. - -Based on our empirical experiments, the combination of the mature Masstree[\[14\]](#_ftn14) lock-free implementation and our robust improvements to Silo[\[15\]](#_ftn15) provided us with exactly what we needed in that regard. For the index, we compared state-of-the-art solutions, such as[\[16\]](#_ftn16), [\[17\]](#_ftn17) and chose Masstree[\[18\]](#_ftn18) as it demonstrated the best overall performance for point queries, iteration and modifications. Masstree is a combination of tries and a B+ tree, implemented to carefully exploit caching, prefetching, optimistic navigation and fine-grained locking. It is optimized for high contention and adds various optimizations to its predecessors, such as OLFIT[\[19\]](#_ftn19). However, a Masstree index downside is its higher memory consumption. While row data consumes the same memory size, the memory per row per each index, primary or secondary, is higher on average by 16 bytes – 29 bytes in the lock‑based B-Tree used in disk-based tables vs. 45 bytes in MOT's Masstree. - -Another challenge is to make an optimistic insertion to a table with multiple indexes. - -The Masstree index is at the core of MOT memory layout for data and index management. Our team enhanced and significantly improved Masstree, and submitted some of the key contributions to the Masstree open source. These improvements include - -- Dedicated memory pools per index – Efficient allocation and fast index drop -- Global GC for Masstree – Fast, on-demand memory reclamation -- Masstree iterator implementation with access to an insertion key -- ARM architecture support - -We contributed our Masstree index improvements to the Masstree open-source implementation, which can be found here [https://github.com/kohler/masstree-beta](https://github.com/kohler/masstree-beta). - -- **[Secondary Index Support](secondary-index-support.md)** - -- **[Non-unique Indexes](non-unique-indexes.md)** - - diff --git a/content/en/docs/Developerguide/integration-using-foreign-data-wrappers-(fdw).md b/content/en/docs/Developerguide/integration-using-foreign-data-wrappers-(fdw).md deleted file mode 100644 index d69b86fbd7b3f0e78b1a1aa51c8ac25a5ed019c8..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/integration-using-foreign-data-wrappers-(fdw).md +++ /dev/null @@ -1,29 +0,0 @@ -# Integration using Foreign Data Wrappers \(FDW\) - -MOT complies with and leverages openGauss’s standard extensibility mechanism – Foreign Data Wrapper \(FDW\), as shown in the following diagram. - -The PostgreSQL Foreign Data Wrapper \(FDW\) feature enables the creation of foreign tables in a MOT database that are proxies for some other data source, such as MySQL, Redis, X3 and so on. When a query is made on a foreign table, the FDW queries the external data source and returns the results, as if they were coming from a table in your database. - -openGauss relies on the PostgreSQL Foreign Data Wrappers \(FDW\) and Index support so that SQL is entirely covered, including stored procedures, user defined functions, system functions calls. - -**Figure 1** MOT Architecture -![](figures/mot-architecture.png "mot-architecture") - -In the diagram above, the MOT engine is represented in green, while the existing openGauss \(based on Postgres\) components are represented in the top part of this diagram in blue. As you can see, the Foreign Data Wrapper \(FDW\) mediates between the MOT engine and the openGauss components. - -MOT-Related FDW Customizations - -Integrating MOT through FDW enables the reuse of the most upper layer openGauss functionality and therefore significantly shortened MOT’s time-to-market without compromising SQL coverage. - -However, the original FDW mechanism in openGauss was not designed for storage engine extensions, and therefore lackes the following essential functionalities. - -- Index awareness of foreign tables to be calculated in the query planning phase -- Complete DDL interfaces -- Complete transaction lifecycle interfaces -- Checkpoint interfaces -- Redo Log interface -- Recovery interfaces -- Vacuum interfaces - -In order to support all the missing functionalities, the SQL layer and FDW interface layer were extended to provide the necessary infrastructure in order to enable the plugging in of the MOT transactional storage engine. - diff --git a/content/en/docs/Developerguide/introducing-mot.md b/content/en/docs/Developerguide/introducing-mot.md index 0c941d2a19e7e61287e0f34138bc63f7f039f447..5a3dca55b07dd15256745e68c49f6ddd5cdb5e60 100644 --- a/content/en/docs/Developerguide/introducing-mot.md +++ b/content/en/docs/Developerguide/introducing-mot.md @@ -1,15 +1,15 @@ -# Introducing MOT - -This ****chapter introduces Huawei's openGauss Memory-Optimized Tables \(MOT\), describes its features and benefits, key technologies, applicable scenarios, performance benchmarks and its competitive advantages. - -- **[MOT Introduction](mot-introduction.md)** - -- **[Features and Benefits](features-and-benefits.md)** - -- **[MOT Key Technologies](mot-key-technologies.md)** - -- **[Usage Scenarios](usage-scenarios.md)** - -- **[Performance Benchmarks](performance-benchmarks.md)** - - +# Introducing MOT + +This chapter introduces openGauss Memory-Optimized Tables \(MOT\), describes its features and benefits, key technologies, applicable scenarios, performance benchmarks and its competitive advantages. + +- **[MOT Introduction](mot-introduction.md)** + +- **[MOT Features and Benefits](mot-features-and-benefits.md)** + +- **[MOT Key Technologies](mot-key-technologies.md)** + +- **[MOT Usage Scenarios](mot-usage-scenarios.md)** + +- **[MOT Performance Benchmarks](mot-performance-benchmarks.md)** + + diff --git a/content/en/docs/Developerguide/jit_mot.md b/content/en/docs/Developerguide/jit_mot.md deleted file mode 100644 index cafc077d8dd2961fa3f9581610c72b4fa4cbdd8c..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/jit_mot.md +++ /dev/null @@ -1,25 +0,0 @@ -# JIT \(MOT\) - -- enable\_mot\_codegen = true - - Specifies whether to use JIT query compilation and execution for planned queries. - - JIT query execution enables JIT-compiled code to be prepared for a prepared query during its planning phase. The resulting JIT-compiled function is executed whenever the prepared query is invoked. JIT compilation usually takes place in the form of LLVM. On platforms where LLVM is not natively supported, MOT provides a software-based fallback called Tiny Virtual Machine \(TVM\). - -- force\_mot\_pseudo\_codegen = false - - Specifies whether to use TVM \(pseudo-LLVM\) even though LLVM is supported on the current platform. - - On platforms where LLVM is not natively supported, MOT automatically defaults to TVM. - - On platforms where LLVM is natively supported, LLVM is used by default. This configuration item enables the use of TVM for JIT compilation and execution on platforms on which LLVM is supported. - -- enable\_mot\_codegen\_print = false - - Specifies whether to print emitted LLVM/TVM IR code for JIT-compiled queries. - -- mot\_codegen\_limit = 100 - - Limits the number of JIT queries allowed per user session. - - diff --git a/content/en/docs/Developerguide/logging-types.md b/content/en/docs/Developerguide/logging-types.md deleted file mode 100644 index 06abda06ea22f7d18ff809f36c320d5a0e089d8b..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/logging-types.md +++ /dev/null @@ -1,71 +0,0 @@ -# Logging Types - -Two synchronous transaction logging options and one asynchronous transaction logging option are supported \(these are also supported by the standard openGauss disk engine\). MOT also supports synchronous Group Commit logging with NUMA-awareness optimization, as described below. - -According to your configuration, one of the following types of logging is implemented – - -[Synchronous Redo Logging](#section13294122621816) - -[Group Synchronous Redo Logging](#section1544013621916) - -[Asynchronous Redo Logging](#section16446161112210) - -## Synchronous Redo Logging - -The **Synchronous Redo Logging** option is the simplest and most strict redo logger. When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL \(Redo Log\), as follows - -1. While a transaction is in progress, it is stored in the MOT’s memory. -2. After a transaction finishes and the client application sends a **Commit **command, the transaction is locked and then written to the WAL Redo Log on the disk. This means that while the transaction log entries are being written to the log, the client application is still waiting for a response. -3. As soon as the transaction’s entire buffer is written to the log, the changes to the data in memory take place and then the transaction is committed. After the transaction has been committed, the client application is notified that the transaction is complete. - -**Summary** - -The** Synchronous Redo Logging** option is the safest and most strict because it ensures total synchronization of the client application and the WAL Redo log entries for each transaction as it is committed; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk. - -The downside of the** Synchronous Redo Logging** option is that it is the slowest logging mechanism of the three options. This is because a client application must wait until all data is written to disk and because of the frequent disk writes \(which typically slow down the database\). - -## Group Synchronous Redo Logging - -The **Group Synchronous Redo Logging** option is very similar to the **Synchronous Redo Logging** option, because it also ensures total durability with absolutely no data loss and total synchronization of the client application and the WAL \(Redo Log\) entries. The difference is that the** Group Synchronous Redo Logging** option writes _groups of transaction _redo entries to the WAL Redo Log on the disk at the same time, instead of writing each and every transaction as it is committed. Using Group Synchronous Redo Logging reduces the amount of disk I/Os and thus improves performance, especially when running a heavy workload. - -The MOT engine performs synchronous Group Commit logging with Non-Uniform Memory Access \(NUMA\)-awareness optimization by automatically grouping transactions according to the NUMA socket of the core on which the transaction is running. - -You may refer to the ++ section for more information about NUMA-aware memory access. - -When a transaction commits, a group of entries are recorded in the WAL Redo Log, as follows – - -1. While a transaction is in progress, it is stored in the memory. The MOT engine groups transactions in buckets according to the NUMA socket of the core on which the transaction is running. This means that all the transactions running on the same socket are grouped together and that multiple groups will be filling in parallel according to the core on which the transaction is running. - - Writing transactions to the WAL is more efficient in this manner because all the buffers from the same socket are written to disk together. - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >Each thread runs on a single core/CPU which belongs to a single socket and each thread only writes to the socket of the core on which it is running. - -2. After a transaction finishes and the client application sends a Commit command, the transaction redo log entries are serialized together with other transactions that belong to the same group. -3. After the configured criteria are fulfilled for a specific group of transactions \(quantity of committed transactions or timeout period as describes on ++\), the transactions in this group are written to the WAL on the disk. This means that while these log entries are being written to the log, the client applications that issued the commit are waiting for a response. -4. As soon as all the transaction buffers in the NUMA-aware group have been written to the log, all the transactions in the group are performing the necessary changes to the memory store and the clients are notified that these transactions are complete. - -**Summary** - -The** Group Synchronous Redo Logging** option is a an extremely safe and strict logging option because it ensures total synchronization of the client application and the WAL Redo log entries; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk. - -On one hand this option has fewer disk writes than the **Synchronous Redo Logging** option, which may mean that it is faster. The downside is that transactions are locked for longer, meaning that they are locked until after all the transactions in the same NUMA memory have been written to the WAL Redo Log on the disk. - -The benefits of using this option depend on the type of transactional workload. For example, this option benefits systems that have many transactions \(and less so for systems that have few transactions, because there are few disk writes anyway\). - -## Asynchronous Redo Logging - -The **Asynchronous Redo Logging** option is the fastest logging method, However, it does not ensure no data loss, meaning that some data that is still in the buffer and was not yet written to disk may get lost upon a power failure or database crash. When a transaction is committed by a client application, the transaction redo entries are recorded in internal buffers and written to disk at preconfigured intervals. The client application does not wait for the data being written to disk. It continues to the next transaction. This is what makes asynchronous redo logging the fastest logging method. - -When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL Redo Log, as follows: - -1. While a transaction is in progress, it is stored in the MOT's memory. -2. After a transaction finishes and the client application sends a **Commit **command, the transaction redo entries are written to internal buffers, but are not yet written to disk. Then changes to the data in memory takes place and the client application is notified that the transaction is committed. -3. At a preconfigured interval, a redo log thread running in the background collects all the buffered redo log entries and writes them to disk. - -**Summary** - -The Asynchronous Redo Logging option is the fastest logging option because it does not require the client application to wait for data being written to disk. In addition, it groups many transactions redo entries and writes them together, thus reducing the amount of disk I/Os that slow down the MOT engine. - -The downside of the Asynchronous Redo Logging option is that it does not ensure that data will not get lost upon a crash or failure. Data that was committed, but was not yet written to disk, is not durable on commit and thus cannot be recovered in case of a failure. The Asynchronous Redo Logging option is most relevant for applications that are willing to sacrifice data recovery \(consistency\) over performance. - diff --git a/content/en/docs/Developerguide/logging-wal-redo-log.md b/content/en/docs/Developerguide/logging-wal-redo-log.md deleted file mode 100644 index db9f319de46643975de1e13ff68a7e658ac18d27..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/logging-wal-redo-log.md +++ /dev/null @@ -1,18 +0,0 @@ -# Logging – WAL Redo Log - -To ensure Durability, MOT is fully integrated with the openGauss's Write-Ahead Logging \(WAL\) mechanism, so that MOT persists data in WAL records using openGauss's XLOG interface. This means that every addition, update, and deletion to a MOT table's record is recorded as an entry in the WAL. This ensures that the most current data state can be regenerated and recovered from this non-volatile log. For example, if three new rows were added to a table, two were deleted and one was updated, then six entries would be recorded in the log. - -- MOT log records are written to the same WAL as the other records of openGauss disk-based tables. -- MOT only logs an operation at the transaction commit phase. -- MOT only logs the updated delta record in order to minimize the amount of data written to disk. -- During recovery, data is loaded from the last known or a specific Checkpoint; and then the WAL Redo log is used to complete the data changes that occur from that point forward. -- The WAL \(Redo Log\) retains all the table row modifications until a Checkpoint is performed \(as described above\). The log can then be truncated in order to reduce recovery time and to save disk space. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->In order to ensure that the log IO device does not become a bottleneck, the log file must be placed on a drive that has low latency. - -- **[Logging Types](logging-types.md)** - -- **[Configuring Logging](configuring-logging.md)** - - diff --git a/content/en/docs/Developerguide/memory-and-storage-planning.md b/content/en/docs/Developerguide/memory-and-storage-planning.md deleted file mode 100644 index 8f335a9a657f7935ac5827bd81b82e5c361faa22..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/memory-and-storage-planning.md +++ /dev/null @@ -1,9 +0,0 @@ -# Memory and Storage Planning - -This section describes the considerations and guidelines for evaluating, estimating and planning the quantity of memory and storage capacity to suit your specific application needs. This section also describes the various data aspects that affect the quantity of required memory, such as the size of data and indexes for the planned tables, memory to sustain transaction management and how fast the data is growing. - -- **[Memory Planning](memory-planning.md)** - -- **[Storage IO](storage-io.md)** - - diff --git a/content/en/docs/Developerguide/memory-management.md b/content/en/docs/Developerguide/memory-management.md deleted file mode 100644 index 622f46aa0fa05db4bbf69fdc1f6ff2652c946e6d..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/memory-management.md +++ /dev/null @@ -1,4 +0,0 @@ -# Memory Management - -For planning and finetuning, see the [Memory and Storage Planning](memory-and-storage-planning.md) section and the [MOT Configuration Settings](mot-configuration-settings.md) sections. - diff --git a/content/en/docs/Developerguide/memory-planning.md b/content/en/docs/Developerguide/memory-planning.md deleted file mode 100644 index 1c90db2f7c256ca198f960aa2e05cc65d587ba44..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/memory-planning.md +++ /dev/null @@ -1,125 +0,0 @@ -# Memory Planning - -MOT belongs to the in-memory database class \(IMDB\) in which all tables and indexes reside entirely in memory. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->Memory storage is volatile, meaning that it requires power to maintain the stored information. Disk storage is persistent, meaning that it is written to disk, which is non-volatile storage. MOT uses both, having all data in memory, while persisting \(by WAL logging\) transactional changes to disk with strict consistency \(in synchronous logging mode\). - -Sufficient physical memory must exist on the server in order to maintain the tables in their initial state, as well as to accommodate the related workload and growth of data. All this is in addition to the memory that is required for the traditional disk-based engine, tables and sessions that support the workload of disk-based tables. Therefore, planning ahead for enough memory to contain them all is essential. - -Even so, you can get started with whatever amount of memory you have and perform basic tasks and evaluation tests. Later, when you are ready for production, the following issues should be addressed. - -## Memory Configuration Settings - -Similar to standard PG , the memory of the openGauss database process is controlled by the upper limit in its **max\_process\_memory** setting, which is defined in the** postgres.conf** file. The MOT engine and all its components and threads, reside within the openGauss process. Therefore, the memory allocated to MOT also operates within the upper boundary defined by **max\_process\_memory** for the entire openGauss database process. - -The amount of memory that MOT can reserve for itself is defined as a portion of **max\_process\_memory**. It is either a percentage of it or an absolute value that is less than it. This portion is defined in the **mot.conf** configuration file by the **\_mot\_\_memory** settings. - -The portion of **max\_process\_memory** that can be used by MOT must still leave at least 2 GB available for the PG \(openGauss\) envelope. Therefore, in order to ensure this, MOT verifies the following during database startup. - -``` -(max_mot_global_memory + max_mot_local_memory) + 2GB < max_process_memory -``` - -If this limit is breached, then MOT memory internal limits are adjusted in order to provide the maximum possible within the limitations described above. This adjustment is performed during startup and calculates the value of **MOT max memory** accordingly. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->**MOT max memory** is a logically calculated value of either the configured settings or their adjusted values of **\(max\_mot\_global\_memory + max\_mot\_local\_memory\)**. - -In this case, a warning is issued to the server log, as shown below. - -## Warning Examples - -Two messages are reported – the problem and the solution. - -The following is an example of a warning message reporting the problem. - -``` -[WARNING] MOT engine maximum memory definitions (global: 9830 MB, local: 1843 MB, session large store: 0 MB, total: 11673 MB) breach GaussDB maximum process memory restriction (12288 MB) and/or total system memory (64243 MB). MOT values shall be adjusted accordingly to preserve required gap (2048 MB). -``` - -The following is an example of a warning message indicating that MOT is automatically adjusting the memory limits. - -``` -[WARNING] Adjusting MOT memory limits: global = 8623 MB, local = 1617 MB, session large store = 0 MB, total = 10240 MB -``` - -This is the only place that shows the new memory limits. - -Additionally, MOT does not allow the insertion of additional data when the total memory usage approaches the chosen memory limits. The threshold for determining when additional data insertions are no longer allowed, is defined as a percentage of **MOT\_max\_memory** \(which is a calculated value, as described above\). The percentage of **MOT\_max\_memory** is configured in the **high\_red\_mark\_percent** setting of the **mot.conf** file. The default is 90, meaning 90%. Attempting to add additional data over this threshold returns an error to the user and is also registered in the database log file. - -## Minimum and Maximum - -In order to secure memory for future operations, MOT pre-allocates memory based on the minimum global and local settings. The database administrator should specify the minimum amount of memory required for the MOT tables and sessions to sustain their workload. This ensures that this minimal memory is allocated to MOT even if another excessive memory‑consuming application runs on the same server as the database and competes with the database for memory resources. The maximum values are used to limit memory growth. - -## Global and Local - -The memory used by MOT is comprised of two parts: - -- **Global Memory** – Global memory is a long-term memory pool that contains the data and indexes of MOT tables. It is evenly distributed across NUMA-nodes and is shared by all CPU cores. - -- **Local Memory** – Local memory is a memory pool used for short-term objects. Its primary consumers are sessions handling transactions. These sessions are storing data changes in the part of the memory dedicated to the relevant specific transaction \(known as _transaction private memory_\). Data changes are moved to the global memory at the commit phase. Memory object allocation is performed in NUMA-local manner in order to achieve the lowest possible latency. - -Deallocated objects are put back in the relevant memory pools. Minimal use of operating system memory allocation \(malloc\) functions during transactions circumvents unnecessary locks and latches. - -The allocation of these two memory parts is controlled by the dedicated **min/max\_mot\_global\_memory** and **min/max\_mot\_local\_memory** settings. If MOT global memory usage gets too close to this defined maximum, then MOT protects itself and does not accept new data. Attempts to allocate memory beyond this limit are denied and an error is reported to the user. - -**Minimum Memory Requirements** - -To get started and perform a minimal evaluation of MOT performance, there are a few requirements. - -Make sure that the **max\_process\_memory** \(as defined in **postgres.conf**\) has sufficient capacity for MOT tables and sessions \(configured by **mix/max\_mot\_global\_memory** and **mix/max\_mot\_local\_memory**\), in addition to the disk tables buffer and extra memory. For simple tests, the default **mot.conf** settings can be used. - -## Actual Memory Requirements During Production - -In a typical OLTP workload, with 80:20 read:write ratio on average, MOT memory usage per table is 60% higher than in disk-based tables \(this includes both the data and the indexes\). This is due to use of more optimal data structures and algorithms, allowing faster access, with CPU-cache awareness and memory-prefetching. - -The actual memory requirement for a specific application depends on the quantity of data, the expected workload and especially on the data growth. - -**Max Global Memory Planning – Data + Index Size** - -- To plan for maximum global memory – - -1. Determine the size of a specific disk table \(including both its data and all its indexes\). The following statistical query can be used to determine the data size of the **customer** table and the **customer\_pkey** index size: - - **Data size** - - ``` - select pg_relation_size(‘customer'); - ``` - - - **Index** - - ``` - select pg_relation_size('customer_pkey'); - ``` - -2. Add 60%, which is the common requirement in MOT relative to the current size of the disk-based data and index. -3. Add an additional percentage for the expected growth of data. For example: - - 5% monthly growth = 80% yearly growth \(1.05^12\). Thus, in order to sustain a year's growth, allocate 80% more memory than is currently used by the tables. - - -This completes the estimation and planning of the **max\_mot\_global\_memory** value. The actual setting can be defined either as an absolute value or a percentage of the Postgres **max\_process\_memory**. The exact value is typically finetuned during deployment. - -**Max Local Memory Planning – Concurrent Session Support** - -Local memory needs are primarily a function of the quantity of concurrent sessions. The typical OLTP workload of an average session uses up to 8 MB. This should be multiplied by the quantity of sessions and then a little bit extra should be added. - -A memory calculation can be performed in this manner and then finetuned, as follows: - -``` -SESSION_COUNT * SESSION_SIZE (8 MB) + SOME_EXTRA (100MB should be enough) -``` - -The default specifies 15% of Postgres's **max\_process\_memory**, which by default is 12 GB. This equals 1.8 GB, which is sufficient for 230 sessions, which is the requirement for the **max\_mot\_local** **memory**. The actual setting can be defined either in absolute values or as a percentage of the Postgres **max\_process\_memory**. The exact value is typically finetuned during deployment. - -**Unusually Large Transactions** - -Some transactions are unusually large because they apply changes to a large number of rows. This may increase a single session's local memory up to the maximum allowed limit, which is 1 GB. For example – - -``` -delete from SOME_VERY_LARGE_TABLE; -``` - -Take this scenario into consideration when configuring the **max\_mot\_local\_memory** setting, as well as during application development. - diff --git a/content/en/docs/Developerguide/memory_mot.md b/content/en/docs/Developerguide/memory_mot.md deleted file mode 100644 index 2436e69f82c8b03173298ab5af061b3fa5138cec..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/memory_mot.md +++ /dev/null @@ -1,150 +0,0 @@ -# MEMORY \(MOT\) - -- max\_threads = 1024 - - Configures the maximum number of threads allowed to run in the MOT engine. - - When not using a thread pool, this value restricts the number of sessions that can interact concurrently with MOT tables. This value does not restrict non-MOT sessions. - - When using a thread pool, this value restricts the number of worker threads that can interact concurrently with MOT tables. - -- max\_connections = 1024 - - Configures the maximum number of connections allowed to run in the MOT engine. - - This value restricts the number of sessions that can interact concurrently with MOT tables, regardless of the thread-pool configuration. This value does not restrict non-MOT sessions. - -- affinity\_mode = fill-physical-first - - Configures the affinity mode of threads for the user session and internal MOT tasks. - - When a thread pool is used, this value is ignored for user sessions, as their affinity is governed by the thread pool. However, it is still used for internal MOT tasks. - - Valid values are **fill-socket-first**, **equal-per-socket**, **fill-physical-first** and **none** - - - **Fill-socket-first** attaches threads to cores in the same socket until the socket is full and then moves to the next socket. - - **Equal-per-socket** spreads threads evenly among all sockets. - - **Fill-physical-first** attaches threads to physical cores in the same socket until all physical cores are employed and then moves to the next socket. When all physical cores are used, then the process begins again with hyper-threaded cores. - - **None** disables any affinity configuration and lets the system scheduler determine on which core each thread is scheduled to run. - -- lazy\_load\_chunk\_directory = true - - Configures the chunk directory mode that is used for memory chunk lookup. - - **Lazy** mode configures the chunk directory to load parts of it on demand, thus reducing the initial memory footprint \(from 1 GB to 1 MB approximately\). However, this may result in minor performance penalties and errors in extreme conditions of memory distress. In contrast, using a **non-lazy** chunk directory allocates an additional 1 GB of initial memory, produces slightly higher performance and ensures that chunk directory errors are avoided during memory distress. - -- reserve\_memory\_mode = virtual - - Configures the memory reservation mode \(either **physical** or **virtual**\). - - Whenever memory is allocated from the kernel, this configuration value is consulted to determine whether the allocated memory is to be resident \(**physical**\) or not \(**virtual**\). This relates primarily to preallocation, but may also affect runtime allocations. For **physical** reservation mode, the entire allocated memory region is made resident by forcing page faults on all pages spanned by the memory region. Configuring **virtual** memory reservation may result in faster memory allocation \(particularly during preallocation\), but may result in page faults during the initial access \(and thus may result in a slight performance hit\) and more sever errors when physical memory is unavailable. In contrast, physical memory allocation is slower, but later access is both faster and guaranteed. - -- store\_memory\_policy = compact - - Configures the memory storage policy \(**compact** or **expanding**\). - - When **compact** policy is defined, unused memory is released back to the kernel, until the lower memory limit is reached \(see **min\_mot\_memory** below\). In **expanding** policy, unused memory is stored in the MOT engine for later reuse. A **compact** storage policy reduces the memory footprint of the MOT engine, but may occasionally result in minor performance degradation. In addition, it may result in unavailable memory during memory distress. In contrast, **expanding** mode uses more memory, but results in faster memory allocation and provides a greater guarantee that memory can be re-allocated after being de-allocated. - -- chunk\_alloc\_policy = auto - - Configures the chunk allocation policy for global memory. - - MOT memory is organized in chunks of 2 MB each. The source NUMA node and the memory layout of each chunk affect the spread of table data among NUMA nodes, and therefore can significantly affect the data access time. When allocating a chunk on a specific NUMA node, the allocation policy is consulted. - - Available values are **auto**, **local**, **page-interleaved**, **chunk-interleaved** and **native** - - - **Auto** policy selects a chunk allocation policy based on the current hardware. - - **Local** policy allocates each chunk on its respective NUMA node. - - **Page-interleaved** policy allocates chunks that are composed of interleaved memory 4‑kilobyte pages from all NUMA nodes. - - **Chunk-interleaved** policy allocates chunks in a round robin fashion from all NUMA nodes. - - **Native** policy allocates chunks by calling the native system memory allocator. - -- chunk\_prealloc\_worker\_count = 8 - - Configures the number of workers per NUMA node participating in memory preallocation. - -- max\_mot\_global\_memory = 80% - - Configures the maximum memory limit for the global memory of the MOT engine. - - Specifying a percentage value relates to the total defined by **max\_process\_memory** configured in **postgresql.conf**. - - The MOT engine memory is divided into global \(long-term\) memory that is mainly used to store user data and local \(short-term\) memory that is mainly used by user sessions for local needs. - - Any attempt to allocate memory beyond this limit is denied and an error is reported to the user. Ensure that the sum of **max\_mot\_global\_memory** and **max\_mot\_local\_memory** do not exceed the **max\_process\_memory** configured in **postgresql.conf**. - -- min\_mot\_global\_memory = 0 MB - - Configures the minimum memory limit for the global memory of the MOT engine. - - Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. - - This value is used for the preallocation of memory during startup, as well as to ensure that a minimum amount of memory is available for the MOT engine during its normal operation. When using **compact** storage policy \(see **store\_memory\_policy** above\), this value designates the lower limit under which memory is not released back to the kernel, but rather kept in the MOT engine for later reuse. - -- max\_mot\_local\_memory = 15% - - Configures the maximum memory limit for the local memory of the MOT engine. - - Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. - - MOT engine memory is divided into global \(long-term\) memory that is mainly used to store user data and local \(short-term\) memory that is mainly used by user session for local needs. - - Any attempt to allocate memory beyond this limit is denied and an error is reported to the user. Ensure that the sum of **max\_mot\_global\_memory** and **max\_mot\_local\_memory** do not exceed the **max\_process\_memory** configured in **postgresql.conf**. - -- min\_mot\_local\_memory = 0 MB - - Configures the minimum memory limit for the local memory of the MOT engine. - - Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. - - This value is used for preallocation of memory during startup, as well as to ensure that a minimum amount of memory is available for the MOT engine during its normal operation. When using compact storage policy \(see **store\_memory\_policy** above\), this value designates the lower limit under which memory is not released back to the kernel, but rather kept in the MOT engine for later reuse. - -- max\_mot\_session\_memory = 0 MB - - Configures the maximum memory limit for a single session in the MOT engine. - - Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. - - Typically, sessions in the MOT engine can allocate as much local memory as needed, so long as the local memory limit is not exceeded. To prevent a single session from taking too much memory, and thereby denying memory from other sessions, this configuration item is used to restrict small session-local memory allocations \(up to 1,022 KB\). - - Make sure that this configuration item does not affect large or huge session-local memory allocations. - - A value of zero denotes no restriction on any session-local small allocations per session, except for the restriction arising from the local memory allocation limit configured by **max\_mot\_local\_memory**. - -- min\_mot\_session\_memory = 0 MB - - Configures the minimum memory reservation for a single session in the MOT engine. - - Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. - - This value is used to preallocate memory during session creation, as well as to ensure that a minimum amount of memory is available for the session to perform its normal operation. - -- high\_red\_mark\_percent = 90 - - Configures the high red mark for memory allocations. - - This is calculated as a percentage of the maximum for the MOT engine, as configured by **max\_mot\_memory**. The default is 90, meaning 90%. When total memory usage by MOT reaches this level, only destructive operations are permitted. All other operations report an error to the user. - -- session\_large\_buffer\_store\_size = 0 MB - - Configures the large buffer store for sessions. - - When a user session executes a query that requires a lot of memory \(for example, when using many rows\), the large buffer store is used to increase the certainty level that such memory is available and to serve this memory request more quickly. Any memory allocation for a session exceeding 1,022 KB is considered as a large memory allocation. If the large buffer store is not used or is depleted, such allocations are treated as huge allocations that are served directly from the kernel. - -- session\_large\_buffer\_store\_max\_object\_size = 0 MB - - Configures the maximum object size in the large allocation buffer store for sessions. - - Internally, the large buffer store is divided into objects of varying sizes. This value is used to set an upper limit on objects originating from the large buffer store, as well as to determine the internal division of the buffer store into objects of various size. - - This size cannot exceed 1/8 of the **session\_large\_buffer\_store\_size**. If it does, it is adjusted to the maximum possible. - -- session\_max\_huge\_object\_size = 1 GB - - Configures the maximum size of a single huge memory allocation made by a session. - - Huge allocations are served directly from the kernel and therefore are not guaranteed to succeed. - - This value also pertains to global \(meaning not session-related\) memory allocations. - - diff --git a/content/en/docs/Developerguide/mot-administration.md b/content/en/docs/Developerguide/mot-administration.md new file mode 100644 index 0000000000000000000000000000000000000000..c52541b47205e17a1c8f07031be4bf79ca9656d6 --- /dev/null +++ b/content/en/docs/Developerguide/mot-administration.md @@ -0,0 +1,21 @@ +# MOT Administration + +The following describes various MOT administration topics – + +- **[MOT Durability](mot-durability.md)** + +- **[MOT Recovery](mot-recovery.md)** + +- **[MOT Replication and High Availability](mot-replication-and-high-availability.md)** + +- **[MOT Memory Management](mot-memory-management.md)** + +- **[MOT Vacuum](mot-vacuum.md)** + +- **[MOT Statistics](mot-statistics.md)** + +- **[MOT Monitoring](mot-monitoring.md)** + +- **[MOT Error Messages](mot-error-messages.md)** + + diff --git a/content/en/docs/Developerguide/mot-checkpoint-concepts.md b/content/en/docs/Developerguide/mot-checkpoint-concepts.md new file mode 100644 index 0000000000000000000000000000000000000000..d227428cb8d20a96963acc42fb5371e67f270466 --- /dev/null +++ b/content/en/docs/Developerguide/mot-checkpoint-concepts.md @@ -0,0 +1,26 @@ +# MOT Checkpoint Concepts + +In openGauss, a Checkpoints is a snapshot of a point in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before the checkpoint. + +At the time of a Checkpoint, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. + +The data is stored directly in memory. The MOT does not store its data it the same way as openGauss so that the concept of dirty pages does not exist. + +For this reason, we have researched and implemented the CALC algorithm, which is described in the paper named Low-Overhead Asynchronous Checkpointing in Main-Memory Database Systems, SIGMOND 2016 from Yale University. + +Low-overhead asynchronous checkpointing in main-memory database systems\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]. + +## CALC Checkpoint Algorithm – Low Overhead in Memory and Compute + +The checkpoint algorithm provides the following benefits – + +- **Reduced Memory Usage –** At most two copies of each record are stored at any time. Memory usage is minimized by only storing a single physical copy of a record while it is live and stable versions are equal or when no checkpoint is actively being recorded. +- **Low Overhead –** CALC's overhead is smaller than other asynchronous checkpointing algorithms. +- **Uses Virtual Points of Consistency –** CALC does not require quiescing of the database in order to achieve a physical point of consistency. + +## Checkpoint Activation + +MOT checkpoints are integrated into openGauss's envelope's Checkpoint mechanism. The Checkpoint process can be triggered manually by executing the **CHECKPOINT;** command or automatically according to the envelope's Checkpoint triggering settings \(time/size\). + +Checkpoint configuration is performed in the mot.conf file – see the [CHECKPOINT \(MOT\)](mot-configuration-settings.md#section8719101152712)section. + diff --git a/content/en/docs/Developerguide/mot-concurrency-control-mechanism.md b/content/en/docs/Developerguide/mot-concurrency-control-mechanism.md new file mode 100644 index 0000000000000000000000000000000000000000..18cad981d3743c3bb77a097899cc207404d12485 --- /dev/null +++ b/content/en/docs/Developerguide/mot-concurrency-control-mechanism.md @@ -0,0 +1,18 @@ +# MOT Concurrency Control Mechanism + +After investing extensive research to find the best concurrency control mechanism, we concluded that SILO\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]-based on OCC is the best ACID-compliant OCC algorithm for MOT. SILO provides the best foundation for MOT's challenging requirements. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>MOT is fully Atomicity, Consistency, Isolation, Durability \(ACID\)-compliant, as described in the [MOT Introduction](mot-introduction.md) section. + +The following topics describe MOT's concurrency control mechanism – + +- **[MOT Local and Global Memory](mot-local-and-global-memory.md)** + +- **[MOT SILO Enhancements](mot-silo-enhancements.md)** + +- **[MOT Isolation Levels](mot-isolation-levels.md)** + +- **[MOT Optimistic Concurrency Control](mot-optimistic-concurrency-control.md)** + + diff --git a/content/en/docs/Developerguide/mot-configuration-settings.md b/content/en/docs/Developerguide/mot-configuration-settings.md index 52629b81f64aeb6679045efb953c9b43381ca5ac..e26ecf8514605784235d1703cc93011923eed7f6 100644 --- a/content/en/docs/Developerguide/mot-configuration-settings.md +++ b/content/en/docs/Developerguide/mot-configuration-settings.md @@ -1,37 +1,432 @@ -# MOT Configuration Settings - -MOT is provided preconfigured to creating working MOT Tables. For best results, it is recommended to customize the MOT configuration \(defined in the file named mot.conf\) according to your application's specific requirements and your preferences. - -This file is read-only upon server startup. If you edit this file while the system is running, then the server must be reloaded in order for the changes to take effect. - -The mot.conf file is located in the same folder as the postgres.conf configuration file. - -Read the [General Guidelines](general-guidelines.md)section and then review and configure the following sections of the mot.conf file, as needed. - -The following topics describe each section in the mot.conf file and the settings that it contains, as well as the default value of each. - -[REDO LOG \(MOT\)](redo-log_mot.md) - -[CHECKPOINT \(MOT\)](checkpoint_mot.md) - -[RECOVERY \(MOT\)](recovery_mot.md) - -[STATISTICS \(MOT\)](statistics_mot.md) - -[ERROR LOG \(MOT\)](error-log_mot.md) - -[MEMORY \(MOT\)](memory_mot.md) - -[GARBAGE COLLECTION \(MOT\)](garbage-collection_mot.md) - -[JIT \(MOT\)](jit_mot.md) - -[STORAGE \(MOT)\)](storage_mot.md) - -[Default MOT.conf](default-mot-conf.md) - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->The topics listed above describe each of the setting sections in the mot.conf file. In addition to the above topics, for an overview of all the aspects of a specific MOT feature \(such as Recovery\), you may refer to the relevant topic of this user manual. For example, the mot.conf file has a Recovery section that contains settings that affect MOT recovery and this is described in the [Recovery](recovery.md) section that is listed above. In addition, for a full description of all aspects of Recovery, you may refer to the [Recovery](recovery.md)section of the Administration chapter of this user manual. Reference links are also provided in each relevant section of the descriptions below. - - - +# MOT Configuration Settings + +MOT is provided preconfigured to creating working MOT Tables. For best results, it is recommended to customize the MOT configuration \(defined in the file named mot.conf\) according to your application's specific requirements and your preferences. + +This file is read-only upon server startup. If you edit this file while the system is running, then the server must be reloaded in order for the changes to take effect. + +The mot.conf file is located in the same folder as the postgres.conf configuration file. + +Read the [General Guidelines](#section14452102715206) section and then review and configure the following sections of the mot.conf file, as needed. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The topics listed above describe each of the setting sections in the mot.conf file. In addition to the above topics, for an overview of all the aspects of a specific MOT feature \(such as Recovery\), you may refer to the relevant topic of this user manual. For example, the mot.conf file has a Recovery section that contains settings that affect MOT recovery and this is described in the [MOT Recovery](mot-recovery.md) section that is listed above. In addition, for a full description of all aspects of Recovery, you may refer to the [MOT Recovery](mot-recovery.md)section of the Administration chapter of this user manual. Reference links are also provided in each relevant section of the descriptions below. + +The following topics describe each section in the mot.conf file and the settings that it contains, as well as the default value of each. + +## General Guidelines + +The following are general guidelines for editing the mot.conf file. + +- Each setting appears with its default value as follows – + + ``` + # name = value + ``` + +- Blank/white space is acceptable. +- Comments are indicated by placing a number sign \(\#\) anywhere on a line. +- The default values of each setting appear as a comment throughout this file. +- In case a parameter is uncommented and a new value is placed, the new setting is defined. +- Changes to the mot.conf file are applied only at the start or reload of the database server. + +Memory Units are represented as follows – + +- KB – Kilobytes +- MB – Megabytes +- GB – Gigabytes +- TB – Terabytes + +Some memory units are represented as a percentage of the **max\_process\_memory** setting that is configured in **postgresql.conf**. For example – **20%**. + +Time units are represented as follows – + +- us – Microseconds \(or micros\) +- ms – milliseconds \(or millis\) +- s – Seconds \(or secs\) +- min – Minutes \(or mins\) +- h – Hours +- d – Days + +## REDO LOG \(MOT\) + +- **enable\_redo\_log = true** + + Specifies whether to use the Redo Log for durability. See the [MOT Logging – WAL Redo Log](mot-durability.md#section129831140121218)section for more information about redo logs. + +- **enable\_group\_commit = false** + + Specifies whether to use group commit. + + This option is only relevant when openGauss is configured to use synchronous commit, meaning only when the synchronous\_commit setting in postgresql.conf is configured to any value other than off. + + You may refer to [MOT Logging – WAL Redo Log](mot-durability.md#section129831140121218) for more information about the WAL Redo Log. + +- **group\_commit\_size = 16** +- **group\_commit\_timeout = 10 ms** + + This option is only relevant when the MOT engine has been configured to **Synchronous Group Commit** logging. This means that the synchronous\_commit setting in postgresql.conf is configured to True and theenable\_group\_commit parameter in the mot.conf configuration file is configured to True. + + Defines which of the following determines when a group of transactions is recorded in the WAL Redo Log – + + group\_commit\_size **–** The quantity of committed transactions in a group. For example, **16** means that when 16 transactions in the same group have been committed by their client application, then an entry is written to disk in the WAL Redo Log for each of the 16 transactions. + + group\_commit\_timeout** –** A timeout period in ms. For example, **10** means that after 10 ms, an entry is written to disk in the WAL Redo Log for each of the transactions in the same group that have been committed by their client application in the lats 10 ms. + + A commit group is closed after either the configured number of transactions has arrived or after the configured timeout period since the group was opened. After the group is closed, all the transactions in the group wait for a group flush to complete execution and then notify the client that each transaction has ended. + + You may refer to the [MOT Logging Types](mot-durability.md#section125771537134) section for more information about synchronous group commit logging. + + +## CHECKPOINT \(MOT\) + +- **enable\_checkpoint = true** + + Specifies whether to use periodic checkpoint. + + +- **checkpoint\_dir =** + + Specifies the directory in which checkpoint data is to be stored. The default location is in the data folder of each data node. + + +- **checkpoint\_segsize = 16 MB** + + Specifies the segment size used during checkpoint. Checkpoint is performed in segments. When a segment is full, it is serialized to disk and a new segment is opened for the subsequent checkpoint data. + + +- **checkpoint\_workers = 3** + + Specifies the number of workers to use during checkpoint. + + Checkpoint is performed in parallel by several MOT engine workers. The quantity of workers may substantially affect the overall performance of the entire checkpoint operation, as well as the operation of other running transactions. To achieve a shorter checkpoint duration, a larger number of workers should be used, up to the optimal number \(which varies based on the hardware and workload\). However, be aware that if this number is too large, it may negatively impact the execution time of other running transactions. Keep this number as low as possible to minimize the effect on the runtime of other running transactions. When this number is too high, longer checkpoint durations occur. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >You may refer to the [MOT Checkpoints](mot-durability.md#section182761535131617)section for more information about configuration settings. + + +## RECOVERY \(MOT\) + +- **checkpoint\_recovery\_workers = 3** + + Specifies the number of workers \(threads\) to use during checkpoint data recovery. Each MOT engine worker runs on its own core and can process a different table in parallel by reading it into memory. For example, while the default is three-course, you might prefer to set this parameter to them number of cores that are available for processing. After recovery these threads are stopped and killed. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >You may refer to the [MOT Recovery](mot-recovery.md) section for more information about configuration settings. + + +## STATISTICS \(MOT\) + +- **enable\_stats = false** + + Configures periodic statistics for printing. + + +- **print\_stats\_period = 10 minute** + + Configures the time period for printing a summary statistics report. + + +- **print\_full\_stats\_period = 1 hours** + + Configures the time period for printing a full statistics report. + + The following settings configure the various sections included in the periodic statistics report. If none of them are configured, then the statistics report is suppressed. + + +- **enable\_log\_recovery\_stats = false** + + Log recovery statistics contain various Redo Log recovery metrics. + + +- **enable\_db\_session\_stats = false** + + Database session statistics contain transaction events, such commits, rollbacks and so on. + + +- **enable\_network\_stats = false** + + Network statistics contain connection/disconnection events. + + +- **enable\_log\_stats = false** + + Log statistics contain details regarding the Redo Log. + + +- **enable\_memory\_stats = false** + + Memory statistics contain memory-layer details. + + +- **enable\_process\_stats = false** + + Process statistics contain total memory and CPU consumption for the current process. + + +- **enable\_system\_stats = false** + + System statistics contain total memory and CPU consumption for the entire system. + + +- **enable\_jit\_stats = false** + + JIT statistics contain information regarding JIT query compilation and execution. + + +## ERROR LOG \(MOT\) + +- **log\_level = INFO** + + Configures the log level of messages issued by the MOT engine and recorded in the Error log of the database server. Valid values are PANIC, ERROR, WARN, INFO, TRACE, DEBUG, DIAG1 and DIAG2. + +- **Log/COMPONENT/LOGGER=LOG\_LEVEL** + + Configures specific loggers using the syntax described below. + + For example, to configure the TRACE log level for the ThreadIdPool logger in system component, use the following syntax – + + ``` + Log/System/ThreadIdPool=TRACE + ``` + + To configure the log level for all loggers under some component, use the following syntax – + + ``` + Log/COMPONENT=LOG_LEVEL + ``` + + For example – + + ``` + Log/System=DEBUG + ``` + + +## MEMORY \(MOT\) + +- **max\_threads = 1024** + + Configures the maximum number of threads allowed to run in the MOT engine. + + When not using a thread pool, this value restricts the number of sessions that can interact concurrently with MOT tables. This value does not restrict non-MOT sessions. + + When using a thread pool, this value restricts the number of worker threads that can interact concurrently with MOT tables. + + +- **max\_connections = 1024** + + Configures the maximum number of connections allowed to run in the MOT engine. + + This value restricts the number of sessions that can interact concurrently with MOT tables, regardless of the thread-pool configuration. This value does not restrict non-MOT sessions. + + +- **affinity\_mode = fill-physical-first** + + Configures the affinity mode of threads for the user session and internal MOT tasks. + + When a thread pool is used, this value is ignored for user sessions, as their affinity is governed by the thread pool. However, it is still used for internal MOT tasks. + + Valid values are **fill-socket-first**, **equal-per-socket**, **fill-physical-first** and **none** – + + - **Fill-socket-first** attaches threads to cores in the same socket until the socket is full and then moves to the next socket. + - **Equal-per-socket** spreads threads evenly among all sockets. + - **Fill-physical-first** attaches threads to physical cores in the same socket until all physical cores are employed and then moves to the next socket. When all physical cores are used, then the process begins again with hyper-threaded cores. + - **None** disables any affinity configuration and lets the system scheduler determine on which core each thread is scheduled to run. + +- **lazy\_load\_chunk\_directory = true** + + Configures the chunk directory mode that is used for memory chunk lookup. + + **Lazy** mode configures the chunk directory to load parts of it on demand, thus reducing the initial memory footprint \(from 1 GB to 1 MB approximately\). However, this may result in minor performance penalties and errors in extreme conditions of memory distress. In contrast, using a **non-lazy** chunk directory allocates an additional 1 GB of initial memory, produces slightly higher performance and ensures that chunk directory errors are avoided during memory distress. + +- **reserve\_memory\_mode = virtual** + + Configures the memory reservation mode \(either **physical** or **virtual**\). + + Whenever memory is allocated from the kernel, this configuration value is consulted to determine whether the allocated memory is to be resident \(**physical**\) or not \(**virtual**\). This relates primarily to preallocation, but may also affect runtime allocations. For **physical** reservation mode, the entire allocated memory region is made resident by forcing page faults on all pages spanned by the memory region. Configuring **virtual** memory reservation may result in faster memory allocation \(particularly during preallocation\), but may result in page faults during the initial access \(and thus may result in a slight performance hit\) and more sever errors when physical memory is unavailable. In contrast, physical memory allocation is slower, but later access is both faster and guaranteed. + +- **store\_memory\_policy = compact** + + Configures the memory storage policy \(**compact** or **expanding**\). + + When **compact** policy is defined, unused memory is released back to the kernel, until the lower memory limit is reached \(see **min\_mot\_memory** below\). In **expanding** policy, unused memory is stored in the MOT engine for later reuse. A **compact** storage policy reduces the memory footprint of the MOT engine, but may occasionally result in minor performance degradation. In addition, it may result in unavailable memory during memory distress. In contrast, **expanding** mode uses more memory, but results in faster memory allocation and provides a greater guarantee that memory can be re-allocated after being de-allocated. + +- **chunk\_alloc\_policy = auto** + + Configures the chunk allocation policy for global memory. + + MOT memory is organized in chunks of 2 MB each. The source NUMA node and the memory layout of each chunk affect the spread of table data among NUMA nodes, and therefore can significantly affect the data access time. When allocating a chunk on a specific NUMA node, the allocation policy is consulted. + + Available values are **auto**, **local**, **page-interleaved**, **chunk-interleaved** and **native** – + + - **Auto** policy selects a chunk allocation policy based on the current hardware. + - **Local** policy allocates each chunk on its respective NUMA node. + - **Page-interleaved** policy allocates chunks that are composed of interleaved memory 4‑kilobyte pages from all NUMA nodes. + - **Chunk-interleaved** policy allocates chunks in a round robin fashion from all NUMA nodes. + - **Native** policy allocates chunks by calling the native system memory allocator. + +- **chunk\_prealloc\_worker\_count = 8** + + Configures the number of workers per NUMA node participating in memory preallocation. + +- **max\_mot\_global\_memory = 80%** + + Configures the maximum memory limit for the global memory of the MOT engine. + + Specifying a percentage value relates to the total defined by **max\_process\_memory** configured in **postgresql.conf**. + + The MOT engine memory is divided into global \(long-term\) memory that is mainly used to store user data and local \(short-term\) memory that is mainly used by user sessions for local needs. + + Any attempt to allocate memory beyond this limit is denied and an error is reported to the user. Ensure that the sum of **max\_mot\_global\_memory** and **max\_mot\_local\_memory** do not exceed the **max\_process\_memory** configured in **postgresql.conf**. + +- **min\_mot\_global\_memory = 0 MB** + + Configures the minimum memory limit for the global memory of the MOT engine. + + Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. + + This value is used for the preallocation of memory during startup, as well as to ensure that a minimum amount of memory is available for the MOT engine during its normal operation. When using **compact** storage policy \(see **store\_memory\_policy** above\), this value designates the lower limit under which memory is not released back to the kernel, but rather kept in the MOT engine for later reuse. + +- **max\_mot\_local\_memory = 15%** + + Configures the maximum memory limit for the local memory of the MOT engine. + + Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. + + MOT engine memory is divided into global \(long-term\) memory that is mainly used to store user data and local \(short-term\) memory that is mainly used by user session for local needs. + + Any attempt to allocate memory beyond this limit is denied and an error is reported to the user. Ensure that the sum of **max\_mot\_global\_memory** and **max\_mot\_local\_memory** do not exceed the **max\_process\_memory** configured in **postgresql.conf**. + +- **min\_mot\_local\_memory = 0 MB** + + Configures the minimum memory limit for the local memory of the MOT engine. + + Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. + + This value is used for preallocation of memory during startup, as well as to ensure that a minimum amount of memory is available for the MOT engine during its normal operation. When using compact storage policy \(see **store\_memory\_policy** above\), this value designates the lower limit under which memory is not released back to the kernel, but rather kept in the MOT engine for later reuse. + +- **max\_mot\_session\_memory = 0 MB** + + Configures the maximum memory limit for a single session in the MOT engine. + + Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. + + Typically, sessions in the MOT engine can allocate as much local memory as needed, so long as the local memory limit is not exceeded. To prevent a single session from taking too much memory, and thereby denying memory from other sessions, this configuration item is used to restrict small session-local memory allocations \(up to 1,022 KB\). + + Make sure that this configuration item does not affect large or huge session-local memory allocations. + + A value of zero denotes no restriction on any session-local small allocations per session, except for the restriction arising from the local memory allocation limit configured by **max\_mot\_local\_memory**. + +- **min\_mot\_session\_memory = 0 MB** + + Configures the minimum memory reservation for a single session in the MOT engine. + + Specifying a percentage value relates to the total defined by the **max\_process\_memory** configured in **postgresql.conf**. + + This value is used to preallocate memory during session creation, as well as to ensure that a minimum amount of memory is available for the session to perform its normal operation. + + +- **high\_red\_mark\_percent = 90** + + Configures the high red mark for memory allocations. + + This is calculated as a percentage of the maximum for the MOT engine, as configured by **max\_mot\_memory**. The default is 90, meaning 90%. When total memory usage by MOT reaches this level, only destructive operations are permitted. All other operations report an error to the user. + + +- **session\_large\_buffer\_store\_size = 0 MB** + + Configures the large buffer store for sessions. + + When a user session executes a query that requires a lot of memory \(for example, when using many rows\), the large buffer store is used to increase the certainty level that such memory is available and to serve this memory request more quickly. Any memory allocation for a session exceeding 1,022 KB is considered as a large memory allocation. If the large buffer store is not used or is depleted, such allocations are treated as huge allocations that are served directly from the kernel. + + +- **session\_large\_buffer\_store\_max\_object\_size = 0 MB** + + Configures the maximum object size in the large allocation buffer store for sessions. + + Internally, the large buffer store is divided into objects of varying sizes. This value is used to set an upper limit on objects originating from the large buffer store, as well as to determine the internal division of the buffer store into objects of various size. + + This size cannot exceed 1/8 of the **session\_large\_buffer\_store\_size**. If it does, it is adjusted to the maximum possible. + + +- **session\_max\_huge\_object\_size = 1 GB** + + Configures the maximum size of a single huge memory allocation made by a session. + + Huge allocations are served directly from the kernel and therefore are not guaranteed to succeed. + + This value also pertains to global \(meaning not session-related\) memory allocations. + + +## GARBAGE COLLECTION \(MOT\) + +- **enable\_gc = true** + + Specifies whether to use the Garbage Collector \(GC\). + +- **reclaim\_threshold = 512 KB** + + Configures the memory threshold for the garbage collector. + + Each session manages its own list of to-be-reclaimed objects and performs its own garbage collection during transaction commitment. This value determines the total memory threshold of objects waiting to be reclaimed, above which garbage collection is triggered for a session. + + In general, the trade-off here is between un-reclaimed objects vs garbage collection frequency. Setting a low value keeps low levels of un-reclaimed memory, but causes frequent garbage collection that may affect performance. Setting a high value triggers garbage collection less frequently, but results in higher levels of un-reclaimed memory. This setting is dependent upon the overall workload. + +- **reclaim\_batch\_size = 8000** + + Configures the batch size for garbage collection. + + The garbage collector reclaims memory from objects in batches, in order to restrict the number of objects being reclaimed in a single garbage collection pass. The intent of this approach is to minimize the operation time of a single garbage collection pass. + + +- **high\_reclaim\_threshold = 8 MB** + + Configures the high memory threshold for garbage collection. + + Because garbage collection works in batches, it is possible that a session may have many objects that can be reclaimed, but which were not. In such situations, in order to prevent garbage collection lists from becoming too bloated, this value is used to continue reclaiming objects within a single pass, even though that batch size limit has been reached, until the total size of the still-waiting-to-be-reclaimed objects is less than this threshold, or there are no more objects eligible for reclamation. + + +## JIT \(MOT\) + +- **enable\_mot\_codegen = true** + + Specifies whether to use JIT query compilation and execution for planned queries. + + JIT query execution enables JIT-compiled code to be prepared for a prepared query during its planning phase. The resulting JIT-compiled function is executed whenever the prepared query is invoked. JIT compilation usually takes place in the form of LLVM. On platforms where LLVM is not natively supported, MOT provides a software-based fallback called Tiny Virtual Machine \(TVM\). + +- **force\_mot\_pseudo\_codegen = false** + + Specifies whether to use TVM \(pseudo-LLVM\) even though LLVM is supported on the current platform. + + On platforms where LLVM is not natively supported, MOT automatically defaults to TVM. + + On platforms where LLVM is natively supported, LLVM is used by default. This configuration item enables the use of TVM for JIT compilation and execution on platforms on which LLVM is supported. + +- **enable\_mot\_codegen\_print = false** + + Specifies whether to print emitted LLVM/TVM IR code for JIT-compiled queries. + +- **mot\_codegen\_limit = 100** + + Limits the number of JIT queries allowed per user session. + + +## STORAGE \(MOT\) + +**allow\_index\_on\_nullable\_column = true** + +Specifies whether it is permitted to define an index over a nullable column. + +## Default MOT.conf + +The minimum settings and configuration specify to point the **Postgresql.conf** file to the location of the **MOT.conf** file – + +``` +Postgresql.conf +mot_config_file = '/tmp/gauss/ MOT.conf' +``` + +Ensure that the value of the max\_process\_memory setting is sufficient to include the global \(data and index\) and local \(sessions\) memory of MOT tables. + +The default content of **MOT.conf** is sufficient to get started. The settings can be optimized later. + diff --git a/content/en/docs/Developerguide/mot-data-ingestion-speed.md b/content/en/docs/Developerguide/mot-data-ingestion-speed.md new file mode 100644 index 0000000000000000000000000000000000000000..adbdf3927c0a961c550a390905425624f3fe7e6d --- /dev/null +++ b/content/en/docs/Developerguide/mot-data-ingestion-speed.md @@ -0,0 +1,18 @@ +# MOT Data Ingestion Speed + +This test simulates realtime data streams arriving from massive IoT, cloud or mobile devices that need to be quickly and continuously ingested into the database on a massive scale. + +- The test involved ingesting large quantities of data, as follows – + - 10 million rows were sent by 500 threads, 2000 rounds, 10 records \(rows\) in each insert command, each record was 200 bytes. + - The client and database were on different machines. Database server – x86 2-socket, 72 cores. + +- Performance Results + + - **Throughput – 10,000** Records/Core or **2** MB/Core. + - **Latency – 2.8ms per a 10 records** bulk insert \(includes client-server networking\) + + >![](public_sys-resources/icon-caution.gif) **CAUTION:** + >We are projecting that multiple additional, and even significant, performance improvements will be made by MOT for this scenario. + >Click [MOT Usage Scenarios](mot-usage-scenarios.md)for more information about large-scale data streaming and data ingestion. + + diff --git a/content/en/docs/Developerguide/mot-deployment.md b/content/en/docs/Developerguide/mot-deployment.md new file mode 100644 index 0000000000000000000000000000000000000000..184b7abd2371db727dc358af646565bf4309497f --- /dev/null +++ b/content/en/docs/Developerguide/mot-deployment.md @@ -0,0 +1,11 @@ +# MOT Deployment + +- The following sections describe various mandatory and optional settings for optimal deployment. + +- **[MOT Server Optimization – x86](mot-server-optimization-x86.md)** + +- **[MOT Server Optimization – ARM Huawei Taishan 2P/4P](mot-server-optimization-arm-huawei-taishan-2p-4p.md)** + +- **[MOT Configuration Settings](mot-configuration-settings.md)** + + diff --git a/content/en/docs/Developerguide/durability.md b/content/en/docs/Developerguide/mot-durability-concepts.md similarity index 51% rename from content/en/docs/Developerguide/durability.md rename to content/en/docs/Developerguide/mot-durability-concepts.md index 0c64fd16399071c73f00d970cec1d7c7eb71b7e9..b7727d75be9e6935e35e147d5451f3984d312ce3 100644 --- a/content/en/docs/Developerguide/durability.md +++ b/content/en/docs/Developerguide/mot-durability-concepts.md @@ -1,31 +1,17 @@ -# Durability - -Durability refers to long-term data protection \(also known as _disk persistence_\). Durability means that stored data does not suffer from any kind of degradation or corruption, so that data is never lost or compromised. Durability ensures that data and the MOT engine are restored to a consistent state after a planned shutdown \(for example, for maintenance\) or an unplanned crash \(for example, a power failure\). - -Memory storage is volatile, meaning that it requires power to maintain the stored information. Disk storage, on the other hand, is non-volatile, meaning that it does not require power to maintain stored information, thus, it can survive a power shutdown. MOT uses both types of storage – it has all data in memory, while persisting transactional changes to disk [by WAL Redo Logging](logging-wal-redo-log.md) and by maintaining frequent periodic [Checkpoints](checkpoints.md) in order to ensure data recovery in case of shutdown. - -The user must ensure sufficient disk space for the logging and Checkpointing operations. A separated drive can be used for the Checkpoint to improve performance by reducing disk I/O load. - -You may refer to _MOT Key Technologies__ _for an overview of how durability is implemented in the MOT engine. - -- **[Configuring Durability](configuring-durability.md)** - -- **[Logging – WAL Redo Log](logging-wal-redo-log.md)** - -- **[Checkpoints](checkpoints.md)** - -- **[Recovery](recovery.md)** - -- **[Replication and High Availability](replication-and-high-availability.md)** - -- **[Memory Management](memory-management.md)** - -- **[Vacuum](vacuum.md)** - -- **[Statistics](statistics.md)** - -- **[Monitoring](monitoring.md)** - -- **[Error Messages](error-messages.md)** - - +# MOT Durability Concepts + +Durability refers to long-term data protection \(also known as _disk persistence_\). Durability means that stored data does not suffer from any kind of degradation or corruption, so that data is never lost or compromised. Durability ensures that data and the MOT engine are restored to a consistent state after a planned shutdown \(for example, for maintenance\) or an unplanned crash \(for example, a power failure\). + +Memory storage is volatile, meaning that it requires power to maintain the stored information. Disk storage, on the other hand, is non-volatile, meaning that it does not require power to maintain stored information, thus, it can survive a power shutdown. MOT uses both types of storage – it has all data in memory, while persisting transactional changes to disk [MOT Durability](mot-durability.md) and by maintaining frequent periodic [MOT Checkpoints](mot-durability.md#section182761535131617)in order to ensure data recovery in case of shutdown. + +The user must ensure sufficient disk space for the logging and Checkpointing operations. A separated drive can be used for the Checkpoint to improve performance by reducing disk I/O load. + +You may refer to [MOT Key Technologies](mot-key-technologies.md) section__for an overview of how durability is implemented in the MOT engine. + +MOTs WAL Redo Log and checkpoints enabled durability, as described below – + +- **[MOT Logging – WAL Redo Log Concepts](mot-logging-wal-redo-log-concepts.md)** + +- **[MOT Checkpoint Concepts](mot-checkpoint-concepts.md)** + + diff --git a/content/en/docs/Developerguide/mot-durability.md b/content/en/docs/Developerguide/mot-durability.md new file mode 100644 index 0000000000000000000000000000000000000000..70a686ede505da313a9567332a871b0929088ed6 --- /dev/null +++ b/content/en/docs/Developerguide/mot-durability.md @@ -0,0 +1,136 @@ +# MOT Durability + +Durability refers to long-term data protection \(also known as _disk persistence_\). Durability means that stored data does not suffer from any kind of degradation or corruption, so that data is never lost or compromised. Durability ensures that data and the MOT engine are restored to a consistent state after a planned shutdown \(for example, for maintenance\) or an unplanned crash \(for example, a power failure\). + +Memory storage is volatile, meaning that it requires power to maintain the stored information. Disk storage, on the other hand, is non-volatile, meaning that it does not require power to maintain stored information, thus, it can survive a power shutdown. MOT uses both types of storage – it has all data in memory, while persisting transactional changes to disk [MOT Durability](mot-durability.md) and by maintaining frequent periodic [MOT Checkpoints](#section182761535131617)in order to ensure data recovery in case of shutdown. + +The user must ensure sufficient disk space for the logging and Checkpointing operations. A separated drive can be used for the Checkpoint to improve performance by reducing disk I/O load. + +You may refer to the [MOT Key Technologies](mot-key-technologies.md) section__for an overview of how durability is implemented in the MOT engine. + +To configure durability – + +To ensure strict consistency, configure the synchronous\_commit parameter to **On** in the postgres.conf configuration file. + +MOTs WAL Redo Log and Checkpoints enable durability, as described below – + +## MOT Logging – WAL Redo Log + +To ensure Durability, MOT is fully integrated with the openGauss's Write-Ahead Logging \(WAL\) mechanism, so that MOT persists data in WAL records using openGauss's XLOG interface. This means that every addition, update, and deletion to an MOT table’s record is recorded as an entry in the WAL. This ensures that the most current data state can be regenerated and recovered from this non-volatile log. For example, if three new rows were added to a table, two were deleted and one was updated, then six entries would be recorded in the log. + +MOT log records are written to the same WAL as the other records of openGauss disk-based tables. + +MOT only logs an operation at the transaction commit phase. + +MOT only logs the updated delta record in order to minimize the amount of data written to disk. + +During recovery, data is loaded from the last known or a specific Checkpoint; and then the WAL Redo log is used to complete the data changes that occur from that point forward. + +The WAL \(Redo Log\) retains all the table row modifications until a Checkpoint is performed \(as described above\). The log can then be truncated in order to reduce recovery time and to save disk space. + +**Note** – In order to ensure that the log IO device does not become a bottleneck, the log file must be placed on a drive that has low latency. + +## MOT Logging Types + +Two synchronous transaction logging options and one asynchronous transaction logging option are supported \(these are also supported by the standard openGauss disk engine\). MOT also supports synchronous Group Commit logging with NUMA-awareness optimization, as described below. + +According to your configuration, one of the following types of logging is implemented – + +- **Synchronous Redo Logging** + + The **Synchronous Redo Logging** option is the simplest and most strict redo logger. When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL \(Redo Log\), as follows – + + 1. While a transaction is in progress, it is stored in the MOT's memory. + 2. After a transaction finishes and the client application sends a Commit command, the transaction is locked and then written to the WAL Redo Log on the disk. This means that while the transaction log entries are being written to the log, the client application is still waiting for a response. + 3. As soon as the transaction's entire buffer is written to the log, the changes to the data in memory take place and then the transaction is committed. After the transaction has been committed, the client application is notified that the transaction is complete. + + **Summary** + + The **Synchronous Redo Logging** option is the safest and most strict because it ensures total synchronization of the client application and the WAL Redo log entries for each transaction as it is committed; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk. + + The downside of the **Synchronous Redo Logging** option is that it is the slowest logging mechanism of the three options. This is because a client application must wait until all data is written to disk and because of the frequent disk writes \(which typically slow down the database\). + + +- **Group Synchronous Redo Logging** + + The **Group Synchronous Redo Logging** option is very similar to the **Synchronous Redo Logging** option, because it also ensures total durability with absolutely no data loss and total synchronization of the client application and the WAL \(Redo Log\) entries. The difference is that the **Group Synchronous Redo Logging** option writes _groups of transaction _redo entries to the WAL Redo Log on the disk at the same time, instead of writing each and every transaction as it is committed. Using Group Synchronous Redo Logging reduces the amount of disk I/Os and thus improves performance, especially when running a heavy workload. + + The MOT engine performs synchronous Group Commit logging with Non-Uniform Memory Access \(NUMA\)-awareness optimization by automatically grouping transactions according to the NUMA socket of the core on which the transaction is running. + + You may refer to the [NUMA Awareness Allocation and Affinity](numa-awareness-allocation-and-affinity.md) section for more information about NUMA-aware memory access. + + When a transaction commits, a group of entries are recorded in the WAL Redo Log, as follows – + + 1. While a transaction is in progress, it is stored in the memory. The MOT engine groups transactions in buckets according to the NUMA socket of the core on which the transaction is running. This means that all the transactions running on the same socket are grouped together and that multiple groups will be filling in parallel according to the core on which the transaction is running. + + Writing transactions to the WAL is more efficient in this manner because all the buffers from the same socket are written to disk together. + + **Note** – Each thread runs on a single core/CPU which belongs to a single socket and each thread only writes to the socket of the core on which it is running. + + 2. After a transaction finishes and the client application sends a Commit command, the transaction redo log entries are serialized together with other transactions that belong to the same group. + 3. After the configured criteria are fulfilled for a specific group of transactions \(quantity of committed transactions or timeout period as describes in the [REDO LOG \(MOT\)](mot-configuration-settings.md#section361563811235) section\), the transactions in this group are written to the WAL on the disk. This means that while these log entries are being written to the log, the client applications that issued the commit are waiting for a response. + 4. As soon as all the transaction buffers in the NUMA-aware group have been written to the log, all the transactions in the group are performing the necessary changes to the memory store and the clients are notified that these transactions are complete. + + **Summary** + + The **Group Synchronous Redo Logging** option is a an extremely safe and strict logging option because it ensures total synchronization of the client application and the WAL Redo log entries; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk. + + On one hand this option has fewer disk writes than the **Synchronous Redo Logging** option, which may mean that it is faster. The downside is that transactions are locked for longer, meaning that they are locked until after all the transactions in the same NUMA memory have been written to the WAL Redo Log on the disk. + + The benefits of using this option depend on the type of transactional workload. For example, this option benefits systems that have many transactions \(and less so for systems that have few transactions, because there are few disk writes anyway\). + + +- **Asynchronous Redo Logging** + + The **Asynchronous Redo Logging** option is the fastest logging method, However, it does not ensure no data loss, meaning that some data that is still in the buffer and was not yet written to disk may get lost upon a power failure or database crash. When a transaction is committed by a client application, the transaction redo entries are recorded in internal buffers and written to disk at preconfigured intervals. The client application does not wait for the data being written to disk. It continues to the next transaction. This is what makes asynchronous redo logging the fastest logging method. + + When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL Redo Log, as follows – + + 1. While a transaction is in progress, it is stored in the MOT's memory. + 2. After a transaction finishes and the client application sends a Commit command, the transaction redo entries are written to internal buffers, but are not yet written to disk. Then changes to the MOT data memory take place and the client application is notified that the transaction is committed. + 3. At a preconfigured interval, a redo log thread running in the background collects all the buffered redo log entries and writes them to disk. + + **Summary** + + The Asynchronous Redo Logging option is the fastest logging option because it does not require the client application to wait for data being written to disk. In addition, it groups many transactions redo entries and writes them together, thus reducing the amount of disk I/Os that slow down the MOT engine. + + The downside of the Asynchronous Redo Logging option is that it does not ensure that data will not get lost upon a crash or failure. Data that was committed, but was not yet written to disk, is not durable on commit and thus cannot be recovered in case of a failure. The Asynchronous Redo Logging option is most relevant for applications that are willing to sacrifice data recovery \(consistency\) over performance. + + +## Configuring Logging + +Two synchronous transaction logging options and one asynchronous transaction logging option are supported by the standard openGauss disk engine. + +To configure logging – + +1. The determination of whether synchronous or asynchronous transaction logging is performed is configured in the synchronous\_commit **\(On = Synchronous\)** parameters in the postgres.conf configuration file. +2. Set the enable\_redo\_log parameter to **True** in the REDO LOG section of the mot.conf configuration file. + +If a synchronous mode of transaction logging has been selected \(synchronous\_commit = **On**, as described above\), then the enable\_group\_commit parameter in the mot.conf configuration file determines whether the **Group Synchronous Redo Logging** option or the **Synchronous Redo Logging** option is used. For **Group Synchronous Redo Logging**, you must also define in the mot.conf file which of the following thresholds determine when a group of transactions is recorded in the WAL + +- group\_commit\_size **–** The quantity of committed transactions in a group. For example, **16** means that when 16 transactions in the same group have been committed by a client application, then an entry is written to disk in the WAL Redo Log for all 16 transactions. +- group\_commit\_timeout – A timeout period in ms. For example, **10** means that after 10 ms, an entry is written to disk in the WAL Redo Log for each of the transactions in the same group that have been committed by their client application in the last 10 ms. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >You may refer to the [REDO LOG \(MOT\)](mot-configuration-settings.md#section361563811235) for more information about configuration settings. + + +## MOT Checkpoints + +A Checkpoint is the point in time at which all the data of a table's rows is saved in files on persistent storage in order to create a full durable database image. It is a snapshot of the data at a specific point in time. + +A Checkpoint is required in order to reduce a database's recovery time by shortening the quantity of WAL \(Redo Log\) entries that must be replayed in order to ensure durability. Checkpoint's also reduce the storage space required to keep all the log entries. + +If there were no Checkpoints, then in order to recover a database, all the WAL redo entries would have to be replayed from the beginning of time, which could take days/weeks depending on the quantity of records in the database. Checkpoints record the current state of the database and enable old redo entries to be discarded. + +Checkpoints are essential during recovery scenarios \(especially for a cold start\). First, the data is loaded from the last known or a specific Checkpoint; and then the WAL is used to complete the data changes that occurred since then. + +For example – If the same table row is modified 100 times, then 100 entries are recorded in the log. When Checkpoints are used, then even if a specific table row was modified 100 times, it is recorded in the Checkpoint a single time. After the recording of a Checkpoint, recovery can be performed on the basis of that Checkpoint and only the WAL Redo Log entries that occurred since the Checkpoint need be played. + +**To configure Checkpoints** + +Checkpoint configuration is performed in the CHECKPOINT; section of the mot.conf file. You may refer to the [MOT Checkpoints](#section182761535131617) section of this user manual for a description of these configuration parameters. + +>![](public_sys-resources/icon-caution.gif) **CAUTION:** +>In a production deployment, the value must be TRUE \#enable\_Checkpoint = true. A FALSE value can only be used for testing. + diff --git a/content/en/docs/Developerguide/mot-error-messages.md b/content/en/docs/Developerguide/mot-error-messages.md index ba30127c7bcfc46647e70812312cf3ede76f3e54..caba01f921b884ce5ac7c2bbdce396d29c45aa2f 100644 --- a/content/en/docs/Developerguide/mot-error-messages.md +++ b/content/en/docs/Developerguide/mot-error-messages.md @@ -1,14 +1,417 @@ -# MOT Error Messages - -Errors may be caused by a variety of scenarios. All errors are logged in the database server log file. In addition, user-related errors are returned to the user as part of the response to the query or another action. UNCLEAR GGG - -- Errors reported in the Server log include – Function, Entity, Context, Error message, Error description and Severity. -- Errors reported to users are translated into standard PG error codes and reported to the user. IS THIS CORRECT? GGG - -The following lists the error codes, error messages and their descriptions. Only error messages and IR WHAT IS THIS? GGG present error descriptions, which are written to the log and returned to the user. - -- **[Errors Written the Log File](errors-written-the-log-file.md)** - -- **[Errors Returned to the User](errors-returned-to-the-user.md)** - - +# MOT Error Messages + +Errors may be caused by a variety of scenarios. All errors are logged in the database server log file. In addition, user-related errors are returned to the user as part of the response to the query, transaction or stored procedure execution or to database administration action. + +- Errors reported in the Server log include – Function, Entity, Context, Error message, Error description and Severity. +- Errors reported to users are translated into standard PostgreSQL error codes and may consist of an MOT-specific message and description. + +The following lists the error messages, error descriptions and error codes. The error code is actually an internal code and not logged or returned to users. + +## Errors Written the Log File + +All errors are logged in the database server log file. The following lists the errors that are written to the database server log file and are **not** returned to the user. The log is located in the data folder and named **postgresql-DATE-TIME.log**. + +**Table 1** Errors Written Only to the Log File + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Message in the Log

+

Error Internal Code

+

Error code denoting success

+

MOT_NO_ERROR 0

+

Out of memory

+

MOT_ERROR_OOM 1

+

Invalid configuration

+

MOT_ERROR_INVALID_CFG 2

+

Invalid argument passed to function

+

MOT_ERROR_INVALID_ARG 3

+

System call failed

+

MOT_ERROR_SYSTEM_FAILURE 4

+

Resource limit reached

+

MOT_ERROR_RESOURCE_LIMIT 5

+

Internal logic error

+

MOT_ERROR_INTERNAL 6

+

Resource unavailable

+

MOT_ERROR_RESOURCE_UNAVAILABLE 7

+

Unique violation

+

MOT_ERROR_UNIQUE_VIOLATION 8

+

Invalid memory allocation size

+

MOT_ERROR_INVALID_MEMORY_SIZE 9

+

Index out of range

+

MOT_ERROR_INDEX_OUT_OF_RANGE 10

+

Error code unknown

+

MOT_ERROR_INVALID_STATE 11

+
+ +## Errors Returned to the User + +The following lists the errors that are written to the database server log file and are returned to the user. + +MOT returns PG standard error codes to the envelope using a Return Code \(RC\). Some RCs cause the generation of an error message to the user who is interacting with the database. + +The PG code \(described below\) is returned internally by MOT to the database envelope, which reacts to it according to standard PG behavior. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>%s, %u and %lu in the message are replaced by relevant error information, such as query, table name or another information. +>- %s – String +>- %u – Number +>- %lu – Number + +**Table 2** Errors Returned to the User and Logged to the Log File + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Short and Long Description Returned to the User

+

PG Code

+

Internal Error Code

+

Success.

+

Denotes success

+

ERRCODE_SUCCESSFUL_

+

COMPLETION

+

RC_OK = 0

+

Failure

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_ERROR = 1

+

Unknown error has occurred.

+

Denotes aborted operation.

+

ERRCODE_FDW_ERROR

+

RC_ABORT

+

Column definition of %s is not supported.

+

Column type %s is not supported yet.

+

ERRCODE_INVALID_COLUMN_DEFINITION

+

RC_UNSUPPORTED_COL_TYPE

+

Column definition of %s is not supported.

+

Column type Array of %s is not supported yet.

+

ERRCODE_INVALID_COLUMN_DEFINITION

+

RC_UNSUPPORTED_COL_TYPE_ARR

+

Column size %d exceeds max tuple size %u.

+

Column definition of %s is not supported.

+

ERRCODE_FEATURE_NOT_SUPPORTED

+

RC_EXCEEDS_MAX_ROW_SIZE

+

Column name %s exceeds max name size %u.

+

Column definition of %s is not supported.

+

ERRCODE_INVALID_COLUMN_DEFINITION

+

RC_COL_NAME_EXCEEDS_MAX_SIZE

+

Column size %d exceeds max size %u.

+

Column definition of %s is not supported.

+

ERRCODE_INVALID_COLUMN_DEFINITION

+

RC_COL_SIZE_INVLALID

+

Cannot create table.

+

Cannot add column %s; as the number of declared columns exceeds the maximum declared columns.

+

ERRCODE_FEATURE_NOT_

+

SUPPORTED

+

RC_TABLE_EXCEEDS_MAX_

+

DECLARED_COLS

+

Cannot create index.

+

Total column size is greater than maximum index size %u.

+

ERRCODE_FDW_KEY_SIZE_

+

EXCEEDS_MAX_ALLOWED

+

RC_INDEX_EXCEEDS_MAX_SIZE

+

Cannot create index.

+

Total number of indexes for table %s is greater than the maximum number of indexes allowed %u.

+

ERRCODE_FDW_TOO_MANY_

+

INDEXES

+

RC_TABLE_EXCEEDS_MAX_INDEXES

+

Cannot execute statement.

+

Maximum number of DDLs per transaction reached the maximum %u.

+

ERRCODE_FDW_TOO_MANY_

+

DDL_CHANGES_IN_

+

TRANSACTION_NOT_

+

ALLOWED

+

RC_TXN_EXCEEDS_MAX_DDLS

+

Unique constraint violation

+

Duplicate key value violates unique constraint \"%s\"".

+

Key %s already exists.

+

ERRCODE_UNIQUE_

+

VIOLATION

+

RC_UNIQUE_VIOLATION

+

Table \"%s\" does not exist.

+

ERRCODE_UNDEFINED_TABLE

+

RC_TABLE_NOT_FOUND

+

Index \"%s\" does not exist.

+

ERRCODE_UNDEFINED_TABLE

+

RC_INDEX_NOT_FOUND

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_LOCAL_ROW_FOUND

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_LOCAL_ROW_NOT_FOUND

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_LOCAL_ROW_DELETED

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_INSERT_ON_EXIST

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_INDEX_RETRY_INSERT

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_INDEX_DELETE

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_LOCAL_ROW_NOT_VISIBLE

+

Memory is temporarily unavailable.

+

ERRCODE_OUT_OF_LOGICAL_MEMORY

+

RC_MEMORY_ALLOCATION_ERROR

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_ILLEGAL_ROW_STATE

+

Null constraint violated.

+

NULL value cannot be inserted into non-null column %s at table %s.

+

ERRCODE_FDW_ERROR

+

RC_NULL_VIOLATION

+

Critical error.

+

Critical error: %s.

+

ERRCODE_FDW_ERROR

+

RC_PANIC

+

A checkpoint is in progress – cannot truncate table.

+

ERRCODE_FDW_OPERATION_NOT_SUPPORTED

+

RC_NA

+

Unknown error has occurred.

+

ERRCODE_FDW_ERROR

+

RC_MAX_VALUE

+

<recovery message>

+
  

ERRCODE_CONFIG_FILE_ERROR

+

<recovery message>

+
  

ERRCODE_INVALID_TABLE_

+

DEFINITION

+

Memory engine – Failed to perform commit prepared.

+
  

ERRCODE_INVALID_TRANSACTION_

+

STATE

+

Invalid option <option name>

+
  

ERRCODE_FDW_INVALID_OPTION_

+

NAME

+

Invalid memory allocation request size.

+
  

ERRCODE_INVALID_PARAMETER_

+

VALUE

+

Memory is temporarily unavailable.

+
  

ERRCODE_OUT_OF_LOGICAL_

+

MEMORY

+

Could not serialize access due to concurrent update.

+
  

ERRCODE_T_R_SERIALIZATION_

+

FAILURE

+

Alter table operation is not supported for memory table.

+

Cannot create MOT tables while incremental checkpoint is enabled.

+

Re-index is not supported for memory tables.

+
  

ERRCODE_FDW_OPERATION_NOT_

+

SUPPORTED

+

Allocation of table metadata failed.

+
  

ERRCODE_OUT_OF_MEMORY

+

Database with OID %u does not exist.

+
  

ERRCODE_UNDEFINED_DATABASE

+

Value exceeds maximum precision: %d.

+
  

ERRCODE_NUMERIC_VALUE_OUT_

+

OF_RANGE

+

You have reached a maximum logical capacity %lu of allowed %lu.

+
  

ERRCODE_OUT_OF_LOGICAL_

+

MEMORY

+
+ diff --git a/content/en/docs/Developerguide/mot-external-support-tools.md b/content/en/docs/Developerguide/mot-external-support-tools.md new file mode 100644 index 0000000000000000000000000000000000000000..b88e74f5b41bd68a4f8867314f6d7941a1821890 --- /dev/null +++ b/content/en/docs/Developerguide/mot-external-support-tools.md @@ -0,0 +1,28 @@ +# MOT External Support Tools + +The following external openGauss tools have been modified in order to support MOT. Make sure to use the most recent version of each. An overview describing MOT-related usage is provided below. For a full description of these tools and their usage, refer to the openGauss Tools Reference document. + +## gs\_ctl \(Full and Incremental\) + +This tool is used to create a standby server from a primary server, as well as to synchronize a server with another copy of the same server after their timelines have diverged. + +At the end of the operation, the latest MOT checkpoint is fetched by the tool, taking into consideration the **checkpoint\_dir** configuration setting value. + +The checkpoint is fetched from the source server's **checkpoint\_dir** to the destination server's **checkpoint\_dir**. + +Currently, MOT does not support an incremental checkpoint. Therefore, the gs\_ctl incremental build does not work in an incremental manner for MOT, but rather in FULL mode. The Postgres \(disk-tables\) incremental build can still be done incrementally. + +## gs\_basebackup + +gs\_basebackup is used to prepare base backups of a running server, without affecting other database clients. + +The MOT checkpoint is fetched at the end of the operation as well. However, the checkpoint's location is taken from **checkpoint\_dir** in the source server and is transferred to the data directory of the source in order to back it up correctly. + +## gs\_dump + +gs\_dump is used to export the database schema and data to a file. It also supports MOT tables. + +## gs\_restore + +gs\_restore is used to import the database schema and data from a file. It also supports MOT tables. + diff --git a/content/en/docs/Developerguide/mot-features-and-benefits.md b/content/en/docs/Developerguide/mot-features-and-benefits.md new file mode 100644 index 0000000000000000000000000000000000000000..b1adb4a0cb564d9ecb89e2aefaa4bc7c517457bb --- /dev/null +++ b/content/en/docs/Developerguide/mot-features-and-benefits.md @@ -0,0 +1,16 @@ +# MOT Features and Benefits + +MOT provide users with significant benefits in performance \(query and transaction latency\), scalability \(throughput and concurrency\) and in some cases cost \(high resource utilization\) – + +- **Low Latency –** Provides fast query and transaction response time +- **High Throughput –** Supports spikes and constantly high user concurrency +- **High Resource Utilization –** Utilizes hardware to its full extent + +Using MOT, applications are able to achieve more 2.5 to 4 times \(2.5x – 4x\) higher throughput. For example, in our TPC-C benchmarks \(interactive transactions and synchronous logging\) performed both on Huawei Taishan Kunpeng-based \(ARM\) servers and on Dell x86 Intel Xeon-based servers, MOT provides throughput gains that vary from 2.5x on a 2-socket server to 3.7x on a 4-socket server, reaching 4.8M \(million\) tpmC on an ARM 4-socket 256-cores server. + +The lower latency provided by MOT reduces transaction speed by 3x to 5.5x, as observed in TPC-C benchmarks. + +Additionally, MOT enables extremely high utilization of server resources when running under high load and contention, which is a well-known problem for all leading industry databases. Using MOT, utilization reaches 99% on 4-socket server, compared with much lower utilization observed when testing other industry leading databases. + +This abilities are especially evident and important on modern many-core servers. + diff --git a/content/en/docs/Developerguide/mot-hardware.md b/content/en/docs/Developerguide/mot-hardware.md new file mode 100644 index 0000000000000000000000000000000000000000..82c1f54f68fe357e13d3c731cec942f152c5b6d0 --- /dev/null +++ b/content/en/docs/Developerguide/mot-hardware.md @@ -0,0 +1,10 @@ +# MOT Hardware + +The tests were performed on servers with the following configuration and with 10Gbe networking – + +- ARM64/Kunpeng 920-based 2-socket servers, model Taishan 2280 v2 \(total 128 Cores\), 800GB RAM, 1TB NVMe disk. For a detailed server specification, see – [https://e.huawei.com/en/products/servers/taishan-server/taishan-2480-v2](https://e.huawei.com/en/products/servers/taishan-server/taishan-2480-v2) OS: openEuler +- ARM64/Kunpeng 960-based 4-socket servers, model Taishan 2480 v2 \(total 256 Cores\), 512GB RAM, 1TB NVMe disk. For a detailed server specification, see – [https://e.huawei.com/en/products/servers/taishan-server/taishan-2480-v2](https://e.huawei.com/en/products/servers/taishan-server/taishan-2480-v2) OS: openEuler +- x86-based Dell servers, with 2-sockets of Intel Xeon Gold 6154 CPU @ 3GHz with 18 Cores \(72 Cores, with hyper-threading=on\), 1TB RAM, 1TB SSD OS: CentOS 7.6 +- x86-based SuperMicro server, with 8-sockets of Intel\(R\) Xeon\(R\) CPU E7-8890 v4 @ 2.20GHz 24 cores \(total 384 Cores, with hyper-threading=on\), 1TB RAM, 1.2TB SSD \(Seagate 1200 SSD 200GB, SAS 12Gb/s\). OS: Ubuntu 16.04.2 LTS +- x86-based Huawei server, with 4-sockets of Intel\(R\) Xeon\(R\) CPU E7-8890 v4 2.2Ghz \(total 96 Cores, with hyper-threading=on\), 512GB RAM, SSD 2TD OS: CentOS 7.6 + diff --git a/content/en/docs/Developerguide/mot-high-throughput.md b/content/en/docs/Developerguide/mot-high-throughput.md new file mode 100644 index 0000000000000000000000000000000000000000..0a86059c91a53658ffd53fa06e870053f4a5a41c --- /dev/null +++ b/content/en/docs/Developerguide/mot-high-throughput.md @@ -0,0 +1,64 @@ +# MOT High Throughput + +The following shows the results of various MOT table high throughput tests. + +## ARM/Kunpeng 2-Socket 128 Cores + +**Performance** + +The following figure shows the results of testing the TPC-C benchmark on a Huawei ARM/Kunpeng server that has two sockets and 128 cores. + +Four types of tests were performed – + +- Two tests were performed on MOT tables and another two tests were performed on openGauss disk-based tables. +- Two of the tests were performed on a Single node \(without high availability\), meaning that no replication was performed to a secondary node. The other two tests were performed on Primary/Secondary nodes \(with high availability\), meaning that data written to the primary node was replicated to a secondary node. + +MOT tables are represented in orange and disk-based tables are represented in blue. + +**Figure 1** ARM/Kunpeng 2-Socket 128 Cores – Performance Benchmarks +![](figures/arm-kunpeng-2-socket-128-cores-performance-benchmarks.png "arm-kunpeng-2-socket-128-cores-performance-benchmarks") + +The results showed that: + +- As expected, the performance of MOT tables is significantly greater than of disk-based tables in all cases. +- For a Single Node – 3.8M tpmC for MOT tables versus 1.5M tpmC for disk-based tables +- For a Primary/Secondary Node – 3.5M tpmC for MOT tables versus 1.2M tpmC for disk-based tables +- For production grade \(high-availability\) servers \(Primary/Secondary Node\) that require replication, the benefit of using MOT tables is even more significant than for a Single Node \(without high-availability, meaning no replication\). +- The MOT replication overhead of a Primary/Secondary High Availability scenario is 7% on ARM/Kunpeng and 2% on x86 servers, as opposed to the overhead of disk tables of 20% on ARM/Kunpeng and 15% on x86 servers. + +**Performance per CPU core** + +The following figure shows the TPC-C benchmark performance/throughput results per core of the tests performed on a Huawei ARM/Kunpeng server that has two sockets and 128 cores. The same four types of tests were performed \(as described above\). + +**Figure 2** ARM/Kunpeng 2-Socket 128 Cores – Performance per Core Benchmarks +![](figures/arm-kunpeng-2-socket-128-cores-performance-per-core-benchmarks.png "arm-kunpeng-2-socket-128-cores-performance-per-core-benchmarks") + +The results showed that as expected, the performance of MOT tables is significantly greater per core than of disk-based tables in all cases. It also shows that for production grade \(high-availability\) servers \(Primary/Secondary Node\) that require replication, the benefit of using MOT tables is even more significant than for a Single Node \(without high-availability, meaning no replication\). + +## ARM/Kunpeng 4-Socket 256 Cores + +The following demonstrates MOT's excellent concurrency control performance by showing the tpmC per quantity of connections. + +**Figure 3** ARM/Kunpeng 4-Socket 256 Cores – Performance Benchmarks +![](figures/arm-kunpeng-4-socket-256-cores-performance-benchmarks.png "arm-kunpeng-4-socket-256-cores-performance-benchmarks") + +The results show that performance increases significantly even when there are many cores and that peak performance of 4.8M tpmC is achieved at 768 cores. + +## x86-based Servers + +- **8-Socket 384 Cores** + +The following demonstrates MOT’s excellent concurrency control performance by comparing the tpmC per quantity of connections between disk-based tables and MOT. This test was performed on an X86 server with eight sockets and 384 cores. The orange represents the results of the MOT table. + +**Figure 4** x86 8-Socket 384 Cores – Performance Benchmarks +![](figures/x86-8-socket-384-cores-performance-benchmarks.png "x86-8-socket-384-cores-performance-benchmarks") + +The results show that MOT tables significantly outperform disk-based tables and have very highly efficient performance per core on a 386 core server, reaching over 3M tpmC / core. + +- **4-Socket 96 Cores** + +3.9 million tpmC was achieved by MOT on this 4-socket 96 cores server. The following figure shows a highly efficient MOT table performance per core reaching 40,000 tpmC / core. + +**Figure 5** 4-Socket 96 Cores – Performance Benchmarks +![](figures/4-socket-96-cores-performance-benchmarks.png "4-socket-96-cores-performance-benchmarks") + diff --git a/content/en/docs/Developerguide/mot-indexes.md b/content/en/docs/Developerguide/mot-indexes.md new file mode 100644 index 0000000000000000000000000000000000000000..d6b5241a6c2d127707678771d0124004e22c4b09 --- /dev/null +++ b/content/en/docs/Developerguide/mot-indexes.md @@ -0,0 +1,34 @@ +# MOT Indexes + +MOT Index is a lock-free index based on state-of-the-art Masstree\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\], which is a fast and scalable Key Value \(KV\) store for multicore systems, implemented as tries of B+ trees. It achieves excellent performance on many-core servers and high concurrent workloads. It uses various advanced techniques, such as an optimistic lock approach, cache-awareness and memory prefetching. + +After comparing various state-of-the-art solutions, such as \[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\],\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\], we chose Masstree\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] for the index because it demonstrated the best overall performance for point queries, iterations and modifications. Masstree is a combination of tries and a B+ tree that is implemented to carefully exploit caching, prefetching, optimistic navigation and fine-grained locking. It is optimized for high contention and adds various optimizations to its predecessors, such as OLFIT\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]. However, the downside of a Masstree index is its higher memory consumption. While row data consumes the same memory size, the memory per row per each index \(primary or secondary\) is higher on average by 16 bytes – 29 bytes in the lock‑based B-Tree used in disk-based tables vs. 45 bytes in MOT's Masstree. + +Our empirical experiments showed that the combination of the mature lock-free Masstree\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] implementation and our robust improvements to Silo\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] have provided exactly what we needed in that regard. + +Another challenge was making an optimistic insertion into a table with multiple indexes. + +The Masstree index is at the core of MOT memory layout for data and index management. Our team enhanced and significantly improved Masstree and submitted some of the key contributions to the Masstree open source. These improvements include – + +- Dedicated memory pools per index – Efficient allocation and fast index drop +- Global GC for Masstree – Fast, on-demand memory reclamation +- Masstree iterator implementation with access to an insertion key +- ARM architecture support + +We contributed our Masstree index improvements to the Masstree open-source implementation, which can be found here – [https://github.com/kohler/masstree-beta](https://github.com/kohler/masstree-beta). + +MOT's main innovation was to enhance the original Masstree data structure and algorithm, which did not support Non-Unique Indexes \(as a Secondary index\). You may refer to the [Non-unique Indexes](#section12297174320129) section for the design details. + +MOT supports both Primary, Secondary and Keyless indexes \(subject to the limitations specified in the **Unsupported Index DDLs and Index**section\). + +## Non-unique Indexes + +A non-unique index may contain multiple rows with the same key. Non-unique indexes are used solely to improve query performance by maintaining a sorted order of data values that are used frequently. For example, a database may use a non-unique index to group all people from the same family. However, the Masstree\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] data structure implementation does not allow the mapping of multiple objects to the same key. Our solution for enabling the creation of non-unique indexes \(as shown in the figure below\) is to add a symmetry-breaking suffix to the key, which maps the row. This added suffix is the pointer to the row itself, which has a constant size of 8 bytes and a value that is unique to the row. When inserting into a non-unique index, the insertion of the sentinel always succeeds, which enables the row allocated by the executing transaction to be used. This approach also enable MOT to have a fast, reliable, order-based iterator for a non-unique index. + +**Figure 1** Non-unique Indexes +![](figures/non-unique-indexes.png "non-unique-indexes") + +The structure of an MOT table T that has three rows and two indexes is depicted in the figure above. The rectangles represent data rows, and the indexes point to sentinels \(the elliptic shapes\) which point to the rows. The sentinels are inserted into unique indexes with a key and into non-unique indexes with a key + a suffix. The sentinels facilitate maintenance operations so that the rows can be replaced without touching the index data structure. In addition, there are various flags and a reference count embedded in the sentinel in order to facilitate optimistic inserts. + +When searching a non-unique secondary index, the required key \(for example, the family name\) is used. The fully concatenated key is only used for insert and delete operations. Insert and delete operations always get a row as a parameter, thereby making it possible to create the entire key and to use it in the execution of the deletion or the insertion of the specific row for the index. + diff --git a/content/en/docs/Developerguide/mot-introduction.md b/content/en/docs/Developerguide/mot-introduction.md index c7531109a6432aa7b873227ba71523520abf0404..c62f804273c552bb0df7e6dec3c3d39529b0a757 100644 --- a/content/en/docs/Developerguide/mot-introduction.md +++ b/content/en/docs/Developerguide/mot-introduction.md @@ -1,22 +1,22 @@ -# MOT Introduction - -openGauss introduces Memory-Optimized Tables \(MOT\) storage engine – a transactional row-based store \(rowstore\), that is optimized for many-core and large memory servers. MOT is a state-of-the-art production‑grade feature \(beta release\) of openGauss database that provides greater performance for transactional workloads. MOT is fully ACID compliant and includes strict durability and high availability support. Businesses can leverage MOT for mission-critical, performance-sensitive Online Transaction Processing \(OLTP\) applications in order to achieve high performance, high throughput, low and predictable latency and high utilization of many‑core servers. MOT is especially suited to leverage and scale-up when run on modern servers with multiple sockets and many-core processors, such as Huawei Taishan servers with ARM/Kunpeng processors and x86-based Dell or similar servers. - -**Figure 1** Memory-Optimized Storage Engine Within openGauss -![](figures/memory-optimized-storage-engine-within-opengauss.png "memory-optimized-storage-engine-within-opengauss") - -[Figure 1](#fig136991747161510) presents the Memory-Optimized Storage Engine component \(in green\) of openGauss database and is responsible for managing MOT and transactions. - -MOT tables are created side-by-side regular disk-based tables. MOT’s effective design enables almost full SQL coverage and support for a full database feature-set, such as stored procedures and user-defined functions \(excluding the features listed in [SQL Coverage and Limitations](sql-coverage-and-limitations.md) section\). - -With data and indexes stored totally in-memory, a Non-Uniform Memory Access \(NUMA\)-aware design, algorithms that eliminate lock and latch contention and query native compilation, MOT provides faster data access and more efficient transaction execution. - -MOT’s effective almost lock-free design and highly tuned implementation enable exceptional near-linear throughput scale-up on many-core servers – probably the best in the industry. - -Memory-Optimized Tables are fully ACID compliant, as follows - -- **Atomicity –** An atomic transaction is an indivisible series of database operations that either all occur or none occur after a transaction has been completed \(committed or aborted, respectively\). -- **Consistency –** Every transaction leaves the database in a consistent \(data integrity\) state. -- **Isolation –**Transactions cannot interfere with each other. MOT supports repeatable‑reads and read-committed isolation levels. In the next release, MOT will also support serializable isolation. See the [OCC vs 2PL Differences by Example](occ-vs-2pl-differences-by-example.md) section for more information. -- **Durability –** The effects of successfully completed \(committed\) transactions must persist despite crashes and failures. MOT is fully integrated with the WAL-based logging of openGauss. Both synchronous and asynchronous logging options are supported. MOT also uniquely supports synchronous + group commit with NUMA-awareness optimization. See the [Durability](durability-0.md) section for more information. - +# MOT Introduction + +OpenGauss introduces Memory-Optimized Tables \(MOT\) storage engine – a transactional row-based store \(rowstore\), that is optimized for many-core and large memory servers. MOT is a state-of-the-art production-grade feature \(Beta release\) of the openGauss database that provides greater performance for transactional workloads. MOT is fully ACID compliant and includes strict durability and high availability support. Businesses can leverage MOT for mission-critical, performance-sensitive Online Transaction Processing \(OLTP\) applications in order to achieve high performance, high throughput, low and predictable latency and high utilization of many‑core servers. MOT is especially suited to leverage and scale-up when run on modern servers with multiple sockets and many-core processors, such as Huawei Taishan servers with ARM/Kunpeng processors and x86-based Dell or similar servers. + +**Figure 1** Memory-Optimized Storage Engine Within openGauss +![](figures/memory-optimized-storage-engine-within-opengauss.png "memory-optimized-storage-engine-within-opengauss") + +[Figure 1](#fig16939193016363) presents the Memory-Optimized Storage Engine component \(in green\) of openGauss database and is responsible for managing MOT and transactions. + +MOT tables are created side-by-side regular disk-based tables. MOT's effective design enables almost full SQL coverage and support for a full database feature-set, such as stored procedures and user‑defined functions \(excluding the features listed in [MOT SQL Coverage and Limitations](mot-sql-coverage-and-limitations.md) section\). + +With data and indexes stored totally in-memory, a Non-Uniform Memory Access \(NUMA\)-aware design, algorithms that eliminate lock and latch contention and query native compilation, MOT provides faster data access and more efficient transaction execution. + +MOT's effective almost lock-free design and highly tuned implementation enable exceptional near-linear throughput scale-up on many-core servers – probably the best in the industry. + +Memory-Optimized Tables are fully ACID compliant, as follows: + +- **Atomicity –** An atomic transaction is an indivisible series of database operations that either all occur or none occur after a transaction has been completed \(committed or aborted, respectively\). +- **Consistency –** Every transaction leaves the database in a consistent \(data integrity\) state. +- **Isolation –** Transactions cannot interfere with each other. MOT supports repeatable‑reads and read-committed isolation levels. In the next release, MOT will also support serializable isolation. See the [MOT Isolation Levels](mot-isolation-levels.md) section for more information. +- **Durability –** The effects of successfully completed \(committed\) transactions must persist despite crashes and failures. MOT is fully integrated with the WAL-based logging of openGauss. Both synchronous and asynchronous logging options are supported. MOT also uniquely supports synchronous + group commit with NUMA-awareness optimization. See the [MOT Durability Concepts](mot-durability-concepts.md) section for more information. + diff --git a/content/en/docs/Developerguide/isolation-levels.md b/content/en/docs/Developerguide/mot-isolation-levels.md similarity index 36% rename from content/en/docs/Developerguide/isolation-levels.md rename to content/en/docs/Developerguide/mot-isolation-levels.md index cc022205ee19b63024a5e13f629c8829b7a88b76..cfdb0dd3974d5dd6f5fbdba5ebc0c2682771d948 100644 --- a/content/en/docs/Developerguide/isolation-levels.md +++ b/content/en/docs/Developerguide/mot-isolation-levels.md @@ -1,114 +1,112 @@ -# Isolation Levels - -Even though MOT is fully ACID-compliant \(as described in the section\), not all isolation levels are supported in this first release. - -The following table describes all isolation levels, as well as what is and what is not supported by MOT. - -**Table 1** Isolation Levels - - -

Isolation Level

+# MOT Isolation Levels + +Even though MOT is fully ACID-compliant \(as described in the section\), not all isolation levels are supported in openGauss 1.0. The following table describes all isolation levels, as well as what is and what is not supported by MOT. + +**Table 1** Isolation Levels + + + - - - - - - - - - - - -

Isolation Level

Description

+

Description

READ UNCOMMITTED

+

READ UNCOMMITTED

Not supported by MOT.

+

Not supported by MOT.

READ COMMITTED

+

READ COMMITTED

Supported by MOT.

-

The READ COMMITTED isolation level that guarantees that any data that is read was already committed when it was read. It simply restricts the reader from seeing any intermediate, uncommitted or dirty reads. Data is free to be changed after it has been read so that READ COMMITTED does not guarantee that if the transaction re-issues the read, that the same data will be found.

+

Supported by MOT.

+

The READ COMMITTED isolation level that guarantees that any data that is read was already committed when it was read. It simply restricts the reader from seeing any intermediate, uncommitted or dirty reads. Data is free to be changed after it has been read so that READ COMMITTED does not guarantee that if the transaction re-issues the read, that the same data will be found.

SNAPSHOT

+

SNAPSHOT

Not supported by MOT.

-

The SNAPSHOT isolation level makes the same guarantees as SERIALIZABLE, except that concurrent transactions can modify the data. Instead, it forces every reader to see its own version of the world (its own snapshot). This makes it very easy to program, plus it is very scalable, because it does not block concurrent updates. However, in many implementations this isolation level requires higher server resources.

+

Not supported by MOT.

+

The SNAPSHOT isolation level makes the same guarantees as SERIALIZABLE, except that concurrent transactions can modify the data. Instead, it forces every reader to see its own version of the world (its own snapshot). This makes it very easy to program, plus it is very scalable, because it does not block concurrent updates. However, in many implementations this isolation level requires higher server resources.

REPEATABLE READ

+

REPEATABLE READ

Supported by MOT.

-

REPEATABLE READ is a higher isolation level that (in addition to the guarantees of the READ COMMITTED isolation level) guarantees that any data that is read cannot change. If a transaction reads the same data again, it will find the same previously read data in place, unchanged and available to be read.

-

Because of the optimistic model, concurrent transactions are not prevented from updating rows read by this transaction. Instead, at commit time this transaction validates that the REPEATABLE READ isolation level has not been violated. If it has, this transaction is rolled back and must be retried.

+

Supported by MOT.

+

REPEATABLE READ is a higher isolation level that (in addition to the guarantees of the READ COMMITTED isolation level) guarantees that any data that is read cannot change. If a transaction reads the same data again, it will find the same previously read data in place, unchanged and available to be read.

+

Because of the optimistic model, concurrent transactions are not prevented from updating rows read by this transaction. Instead, at commit time this transaction validates that the REPEATABLE READ isolation level has not been violated. If it has, this transaction is rolled back and must be retried.

SERIALIZABLE

+

SERIALIZABLE

Not supported by MOT.

-

Serializable isolation makes an even stronger guarantee. In addition to everything that the REPEATABLE READ isolation level guarantees, it also guarantees that no new data can be seen by a subsequent read.

-

It is named SERIALIZABLE because the isolation is so strict that it is almost a bit like having the transactions run in series rather than concurrently.

+

Not supported by MOT.

+

Serializable isolation makes an even stronger guarantee. In addition to everything that the REPEATABLE READ isolation level guarantees, it also guarantees that no new data can be seen by a subsequent read.

+

It is named SERIALIZABLE because the isolation is so strict that it is almost a bit like having the transactions run in series rather than concurrently.

- -The following table shows the concurrency side effects enabled by the different isolation levels. - -**Table 2** Concurrency Side Effects Enabled by Isolation Levels - - -

Isolation Level

+
+ +The following table shows the concurrency side effects enabled by the different isolation levels. + +**Table 2** Concurrency Side Effects Enabled by Isolation Levels + + + - - - - - - - - - - - - - - - - - - - - - - - -

Isolation Level

Description

+

Description

Non-repeatable Read

+

Non-repeatable Read

Phantom

+

Phantom

READ UNCOMMITTED

+

READ UNCOMMITTED

Yes

+

Yes

Yes

+

Yes

Yes

+

Yes

READ COMMITTED

+

READ COMMITTED

No

+

No

Yes

+

Yes

Yes

+

Yes

REPEATABLE READ

+

REPEATABLE READ

No

+

No

No

+

No

Yes

+

Yes

SNAPSHOT

+

SNAPSHOT

No

+

No

No

+

No

No

+

No

SERIALIZABLE

+

SERIALIZABLE

No

+

No

No

+

No

No

+

No

- -In the near future release, openGauss MOT will also support both SNAPSHOT and SERIALIZABLE isolation levels. - +
+ +In the near future release, openGauss MOT will also support both SNAPSHOT and SERIALIZABLE isolation levels. + diff --git a/content/en/docs/Developerguide/mot-key-technologies.md b/content/en/docs/Developerguide/mot-key-technologies.md index dbf27e2e1941d27039557164f59ada8f5edb3225..8251d85415dace767f915ed5a5ee2b5a6ca731fb 100644 --- a/content/en/docs/Developerguide/mot-key-technologies.md +++ b/content/en/docs/Developerguide/mot-key-technologies.md @@ -1,19 +1,20 @@ -# MOT Key Technologies - -The following key MOT technologies enable its benefits - -- **Memory Optimized Data Structures** – With the objective of achieving optimal high concurrent throughput and predictable low latency, all data and indexes are in memory, no intermediate page buffers are used and minimal, short-duration locks are used. Data structures and all algorithms have been specialized and optimized for in-memory design. -- **Lock-free Transaction Management –** The MOT storage engine applies an optimistic approach to achieving data integrity versus concurrency and high throughput. During a transaction, a MOT table does not place locks on any version of the data rows being updated, thus significantly reducing contention in some high-volume systems. Optimistic Concurrency Control \(OCC\) statements within a transaction are implemented without locks, and all data modifications are performed in a part of the memory that is dedicated to private transactions \(also called _Private Transaction Memory_\). This means that during a transaction, the relevant data is updated in the Private Transaction Memory, thus enabling lock-less reads and writes; and a very short duration lock is only placed at the Commit phase. For more details, see the [Concurrency Control Mechanism](concurrency-control-mechanism.md)__section. -- **Lock-free Index** – Because database data and indexes stored totally in-memory, having an efficient index data structure and algorithm is essential. The MOT Index is based on state-of-the-art Masstree[\[1\]](#_ftn1), a fast and scalable Key Value \(KV\) store for multi-core systems, implemented as a Trie of B+ trees. In this way, excellent performance is achieved on many-core servers and during high concurrent workloads. This index applies various advanced techniques in order to optimize performance, such as an optimistic lock approach, cache-line awareness and memory prefetching. -- **NUMA-aware Memory Management **– MOT memory access is designed with Non-Uniform Memory Access \(NUMA\) awareness. NUMA-aware algorithms enhance the performance of a data layout in memory so that threads access the memory that is physically attached to the core on which the thread is running. This is handled by the memory controller without requiring an extra hop by using an interconnect, such as Intel QPI. MOT’s smart****memory control module with pre‑allocated memory pools for various memory objects improves performance, reduces locks and ensures stability. Allocation of a transaction’s memory objects is always NUMA-local. Deallocated objects are returned to the pool. Minimal usage of OS malloc during transactions circumvents unnecessary locks. -- **Efficient Durability – Logging and Checkpoint** – Achieving disk persistence \(also known as _durability_\) is a crucial requirement for being ACID compliant \(the **D** stands for Durability\). All current disks \(including the SSD and NVMe\) are significantly slower than memory and thus are always the bottleneck of a memory-based database. As an in-memory storage engine with full durability support, MOT’s durability design must implement a wide variety of algorithmic optimizations in order to ensure durability, while still achieving the speed and throughput objectives for which it was designed. These optimizations include - - Parallel logging, which is also available in all openGauss disk tables - - Log buffering per transaction and lock-less transaction preparation - - Updating delta records, meaning only logging changes - - In addition to synchronous and asynchronous, innovative NUMA-aware group commit logging - - State-of-the-art database checkpoints \(CALC[\[2\]](#_ftn2)\) enable the lowest memory and computational overhead. - -- **High SQL Coverage and Feature Set – **By extending and relying on the PostgreSQL Foreign Data Wrappers \(FDW\) + Index support, the entire range of SQL is covered, including stored procedures, user-defined functions and system function calls. You may refer to the [SQL Coverage and Limitations](sql-coverage-and-limitations.md)__section for a list of the features that are not supported. -- **Queries Native Compilation using PREPARE Statements –** Queries and transaction statements can be executed in an interactive manner by using PREPARE client commands that have been precompiled into a native execution format \(which are also known as _Code‑Gen_ or _Just-in-Time \[JIT\]_ compilation\). This achieves an average of 30% higher performance. Compilation and Lite Execution are applied when possible, and if not, applicable queries are processed using the standard execution path. A Cache Plan module \(that has been optimized for OLTP\) re-uses compilation results throughout an entire session \(even using different bind settings\), as well as across different sessions. -- **Seamless Integration of MOT and openGauss Database – **The MOT operates side by side the disk-based storage engine within an integrated envelope. MOT’s main memory engine and disk-based storage engines co-exist side by side in order to support multiple application scenarios, while internally reusing database auxiliary services, such as a Write-Ahead Logging \(WAL\) Redo Log, Replication, Checkpointing, Recovery, High Availability and so on. Users benefit from the unified deployment, configuration and access of both disk-based tables and MOT tables. This provides a flexible and cost-efficient choice of which storage engine to use according to specific requirements. For example, to place highly performance-sensitive data that causes bottlenecks into memory. - +# MOT Key Technologies + +The following key MOT technologies enable its benefits: + +- **Memory Optimized Data Structures –** With the objective of achieving optimal high concurrent throughput and predictable low latency, all data and indexes are in memory, no intermediate page buffers are used and minimal, short-duration locks are used. Data structures and all algorithms have been specialized and optimized for in-memory design. +- **Lock-free Transaction Management –** The MOT storage engine applies an optimistic approach to achieving data integrity versus concurrency and high throughput. During a transaction, an MOT table does not place locks on any version of the data rows being updated, thus significantly reducing contention in some high-volume systems. Optimistic Concurrency Control \(OCC\) statements within a transaction are implemented without locks, and all data modifications are performed in a part of the memory that is dedicated to private transactions \(also called _Private Transaction Memory_\). This means that during a transaction, the relevant data is updated in the Private Transaction Memory, thus enabling lock-less reads and writes; and a very short duration lock is only placed at the Commit phase. For more details, see the [MOT Concurrency Control Mechanism](mot-concurrency-control-mechanism.md)__section. +- **Lock-free Index –** Because database data and indexes stored totally in-memory, having an efficient index data structure and algorithm is essential. The MOT Index is based on state-of-the-art Masstree\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\], a fast and scalable Key Value \(KV\) store for multi-core systems, implemented as a Trie of B+ trees. In this way, excellent performance is achieved on many-core servers and during high concurrent workloads. This index applies various advanced techniques in order to optimize performance, such as an optimistic lock approach, cache-line awareness and memory prefetching. +- **NUMA-aware Memory Management –** MOT memory access is designed with Non-Uniform Memory Access \(NUMA\) awareness. NUMA-aware algorithms enhance the performance of a data layout in memory so that threads access the memory that is physically attached to the core on which the thread is running. This is handled by the memory controller without requiring an extra hop by using an interconnect, such as Intel QPI. MOT's smart memory control module with pre‑allocated memory pools for various memory objects improves performance, reduces locks and ensures stability. Allocation of a transaction's memory objects is always NUMA-local. Deallocated objects are returned to the pool. Minimal usage of OS malloc during transactions circumvents unnecessary locks. +- **Efficient Durability – Logging and Checkpoint –** Achieving disk persistence \(also known as _durability_\) is a crucial requirement for being ACID compliant \(the **D** stands for Durability\). All current disks \(including the SSD and NVMe\) are significantly slower than memory and thus are always the bottleneck of a memory-based database. As an in-memory storage engine with full durability support, MOT's durability design must implement a wide variety of algorithmic optimizations in order to ensure durability, while still achieving the speed and throughput objectives for which it was designed. These optimizations include – + - Parallel logging, which is also available in all openGauss disk tables + - Log buffering per transaction and lock-less transaction preparation + - Updating delta records, meaning only logging changes + - In addition to synchronous and asynchronous, innovative NUMA-aware group commit logging + - State-of-the-art database checkpoints \(CALC\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]\) enable the lowest memory and computational overhead. + + +- **High SQL Coverage and Feature Set –** By extending and relying on the PostgreSQL Foreign Data Wrappers \(FDW\) + Index support, the entire range of SQL is covered, including stored procedures, user-defined functions and system function calls. You may refer to the [MOT SQL Coverage and Limitations](mot-sql-coverage-and-limitations.md)__section for a list of the features that are not supported. +- **Queries Native Compilation using PREPARE Statements –** Queries and transaction statements can be executed in an interactive manner by using PREPARE client commands that have been precompiled into a native execution format \(which are also known as _Code‑Gen_ or _Just-in-Time \[JIT\]_ compilation\). This achieves an average of 30% higher performance. Compilation and Lite Execution are applied when possible, and if not, applicable queries are processed using the standard execution path. A Cache Plan module \(that has been optimized for OLTP\) re-uses compilation results throughout an entire session \(even using different bind settings\), as well as across different sessions. +- **Seamless Integration of MOT and openGauss Database –** The MOT operates side by side the disk‑based storage engine within an integrated envelope. MOT's main memory engine and disk‑based storage engines co-exist side by side in order to support multiple application scenarios, while internally reusing database auxiliary services, such as a Write-Ahead Logging \(WAL\) Redo Log, Replication, Checkpointing, Recovery, High Availability and so on. Users benefit from the unified deployment, configuration and access of both disk-based tables and MOT tables. This provides a flexible and cost-efficient choice of which storage engine to use according to specific requirements. For example, to place highly performance-sensitive data that causes bottlenecks into memory. + diff --git a/content/en/docs/Developerguide/local-and-global-mot-memory.md b/content/en/docs/Developerguide/mot-local-and-global-memory.md similarity index 73% rename from content/en/docs/Developerguide/local-and-global-mot-memory.md rename to content/en/docs/Developerguide/mot-local-and-global-memory.md index a036b3d9bf203b423162af99fcf2a523ce30fa6f..ed667be5f77d5dfd029f606e96c5fe6998176f6e 100644 --- a/content/en/docs/Developerguide/local-and-global-mot-memory.md +++ b/content/en/docs/Developerguide/mot-local-and-global-memory.md @@ -1,16 +1,17 @@ -# Local and Global MOT Memory - -SILO manages both a local memory and a global memory, as shown in. - -- **Global** memory is long-term shared memory is shared by all cores and is used primarily to store all the table data and indexes -- **Local** memory is short-term memory that is used primarily by sessions for handling transactions and store data changes in a primate to transaction memory until the commit phase. - -When a transaction change is required, SILO handles the copying of all that transaction's data from the global memory into the local memory. Minimal locks are placed on the global memory according to the OCC approach, so that the contention time in the global shared memory is extremely minimal. After the transaction’ change has been completed, this data is pushed back from the local memory to the global memory. - -The basic interactive transactional flow with our SILO-enhanced concurrency control is shown in the figure below - -**Figure 1** Private \(Local\) Memory \(for each transaction\) and a Global Memory \(for all the transactions of all the cores\) -![](figures/private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t.png "private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t") - -For more details, refer to the Industrial-Strength OLTP Using Main Memory and Many-cores document[\[6\]](#_ftn6). - +# MOT Local and Global Memory + +SILO\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] manages both a local memory and a global memory, as shown in. + +- **Global** memory is long-term shared memory is shared by all cores and is used primarily to store all the table data and indexes + +- **Local** memory is short-term memory that is used primarily by sessions for handling transactions and store data changes in a primate to transaction memory until the commit phase. + +When a transaction change is required, SILO handles the copying of all that transaction's data from the global memory into the local memory. Minimal locks are placed on the global memory according to the OCC approach, so that the contention time in the global shared memory is extremely minimal. After the transaction’ change has been completed, this data is pushed back from the local memory to the global memory. + +The basic interactive transactional flow with our SILO-enhanced concurrency control is shown in the figure below – + +**Figure 1** Private \(Local\) Memory \(for each transaction\) and a Global Memory \(for all the transactions of all the cores\) +![](figures/private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t.png "private-(local)-memory-(for-each-transaction)-and-a-global-memory-(for-all-the-transactions-of-all-t") + +For more details, refer to the Industrial-Strength OLTP Using Main Memory and Many-cores document\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]. + diff --git a/content/en/docs/Developerguide/mot-logging-wal-redo-log-concepts.md b/content/en/docs/Developerguide/mot-logging-wal-redo-log-concepts.md new file mode 100644 index 0000000000000000000000000000000000000000..34d12b02366ab1a7d92fd9753a8bdc42b67b110e --- /dev/null +++ b/content/en/docs/Developerguide/mot-logging-wal-redo-log-concepts.md @@ -0,0 +1,192 @@ +# MOT Logging – WAL Redo Log Concepts + +## Overview + +Write-Ahead Logging \(WAL\) is a standard method for ensuring data durability. The main concept of WAL is that changes to data files \(where tables and indexes reside\) are only written after those changes have been logged, meaning only after the log records that describe the changes have been flushed to permanent storage. + +The MOT is fully integrated with the openGauss envelope logging facilities. In addition to durability, another benefit of this method is the ability to use the WAL for replication purposes. + +Three logging methods are supported, two standard Synchronous and Asynchronous, which are also supported by the standard openGauss disk-engine. In addition, in the MOT a Group-Commit option is provided with special NUMA-Awareness optimization. The Group-Commit provides the top performance while maintaining ACID properties. + +To ensure Durability, MOT is fully integrated with the openGauss's Write-Ahead Logging \(WAL\) mechanism, so that MOT persists data in WAL records using openGauss's XLOG interface. This means that every addition, update, and deletion to an MOT table's record is recorded as an entry in the WAL. This ensures that the most current data state can be regenerated and recovered from this non-volatile log. For example, if three new rows were added to a table, two were deleted and one was updated, then six entries would be recorded in the log. + +- MOT log records are written to the same WAL as the other records of openGauss disk-based tables. +- MOT only logs an operation at the transaction commit phase. +- MOT only logs the updated delta record in order to minimize the amount of data written to disk. +- During recovery, data is loaded from the last known or a specific Checkpoint; and then the WAL Redo log is used to complete the data changes that occur from that point forward. +- The WAL \(Redo Log\) retains all the table row modifications until a Checkpoint is performed \(as described above\). The log can then be truncated in order to reduce recovery time and to save disk space. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >In order to ensure that the log IO device does not become a bottleneck, the log file must be placed on a drive that has low latency. + + +## Logging Types + +Two synchronous transaction logging options and one asynchronous transaction logging option are supported \(these are also supported by the standard openGauss disk engine\). MOT also supports synchronous Group Commit logging with NUMA-awareness optimization, as described below. + +According to your configuration, one of the following types of logging is implemented – + +- **Synchronous Redo Logging** + + The **Synchronous Redo Logging** option is the simplest and most strict redo logger. When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL \(Redo Log\), as follows – + + 1. While a transaction is in progress, it is stored in the MOT’s memory. + 2. After a transaction finishes and the client application sends a** Commit **command, the transaction is locked and then written to the WAL Redo Log on the disk. This means that while the transaction log entries are being written to the log, the client application is still waiting for a response. + 3. As soon as the transaction's entire buffer is written to the log, the changes to the data in memory take place and then the transaction is committed. After the transaction has been committed, the client application is notified that the transaction is complete. + + **Technical Description** + + When a transaction ends, the SynchronousRedoLogHandler serializes its transaction buffer and write it to the XLOG iLogger implementation. + + **Figure 1** Synchronous Logging + ![](figures/synchronous-logging.png "synchronous-logging") + + **Summary** + + The **Synchronous Redo Logging** option is the safest and most strict because it ensures total synchronization of the client application and the WAL Redo log entries for each transaction as it is committed; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk. + + The downside of the **Synchronous Redo Logging** option is that it is the slowest logging mechanism of the three options. This is because a client application must wait until all data is written to disk and because of the frequent disk writes \(which typically slow down the database\). + +- **Group Synchronous Redo Logging** + + The **Group Synchronous Redo Logging** option is very similar to the **Synchronous Redo Logging** option, because it also ensures total durability with absolutely no data loss and total synchronization of the client application and the WAL \(Redo Log\) entries. The difference is that the **Group Synchronous Redo Logging** option writes _groups of transaction _redo entries to the WAL Redo Log on the disk at the same time, instead of writing each and every transaction as it is committed. Using Group Synchronous Redo Logging reduces the amount of disk I/Os and thus improves performance, especially when running a heavy workload. + + The MOT engine performs synchronous Group Commit logging with Non-Uniform Memory Access \(NUMA\)-awareness optimization by automatically grouping transactions according to the NUMA socket of the core on which the transaction is running. + + You may refer to the [NUMA Awareness Allocation and Affinity](numa-awareness-allocation-and-affinity.md) section for more information about NUMA-aware memory access. + + When a transaction commits, a group of entries are recorded in the WAL Redo Log, as follows – + + 1. While a transaction is in progress, it is stored in the memory. The MOT engine groups transactions in buckets according to the NUMA socket of the core on which the transaction is running. This means that all the transactions running on the same socket are grouped together and that multiple groups will be filling in parallel according to the core on which the transaction is running. + + Writing transactions to the WAL is more efficient in this manner because all the buffers from the same socket are written to disk together. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >- Each thread runs on a single core/CPU which belongs to a single socket and each thread only writes to the socket of the core on which it is running. + + 2. After a transaction finishes and the client application sends a Commit command, the transaction redo log entries are serialized together with other transactions that belong to the same group. + 3. After the configured criteria are fulfilled for a specific group of transactions \(quantity of committed transactions or timeout period as describes in the [REDO LOG \(MOT\)](mot-configuration-settings.md#section361563811235)section\), the transactions in this group are written to the WAL on the disk. This means that while these log entries are being written to the log, the client applications that issued the commit are waiting for a response. + 4. As soon as all the transaction buffers in the NUMA-aware group have been written to the log, all the transactions in the group are performing the necessary changes to the memory store and the clients are notified that these transactions are complete. + + **Technical Description** + + The four colors represent 4 NUMA nodes. Thus each NUMA node has its own memory log enabling a group commit of multiple connections. + + **Figure 2** Group Commit – with NUMA-awareness + ![](figures/group-commit-with-numa-awareness.png "group-commit-with-numa-awareness") + + **Summary** + + The **Group Synchronous Redo Logging** option is a an extremely safe and strict logging option because it ensures total synchronization of the client application and the WAL Redo log entries; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk. + + On one hand this option has fewer disk writes than the **Synchronous Redo Logging** option, which may mean that it is faster. The downside is that transactions are locked for longer, meaning that they are locked until after all the transactions in the same NUMA memory have been written to the WAL Redo Log on the disk. + + The benefits of using this option depend on the type of transactional workload. For example, this option benefits systems that have many transactions \(and less so for systems that have few transactions, because there are few disk writes anyway\). + +- **Asynchronous Redo Logging** + + The **Asynchronous Redo Logging** option is the fastest logging method, However, it does not ensure no data loss, meaning that some data that is still in the buffer and was not yet written to disk may get lost upon a power failure or database crash. When a transaction is committed by a client application, the transaction redo entries are recorded in internal buffers and written to disk at preconfigured intervals. The client application does not wait for the data being written to disk. It continues to the next transaction. This is what makes asynchronous redo logging the fastest logging method. + + When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL Redo Log, as follows – + + 1. While a transaction is in progress, it is stored in the MOT's memory. + 2. After a transaction finishes and the client application sends a Commit command, the transaction redo entries are written to internal buffers, but are not yet written to disk. Then changes to the MOT data memory take place and the client application is notified that the transaction is committed. + 3. At a preconfigured interval, a redo log thread running in the background collects all the buffered redo log entries and writes them to disk. + + **Technical Description** + + Upon transaction commit, the transaction buffer is moved \(pointer assignment – not a data copy\) to a centralized buffer and a new transaction buffer is allocated for the transaction. The transaction is released as soon as its buffer is moved to the centralized buffer and the transaction thread is not blocked. The actual write to the log uses the Postgres walwriter thread. When the walwriter timer elapses, it first calls the AsynchronousRedoLogHandler \(via registered callback\) to write its buffers and then continues with its logic and flushes the data to the XLOG. + + **Figure 3** Asynchronous Logging + ![](figures/asynchronous-logging.png "asynchronous-logging") + + **Summary** + + The Asynchronous Redo Logging option is the fastest logging option because it does not require the client application to wait for data being written to disk. In addition, it groups many transactions redo entries and writes them together, thus reducing the amount of disk I/Os that slow down the MOT engine. + + The downside of the Asynchronous Redo Logging option is that it does not ensure that data will not get lost upon a crash or failure. Data that was committed, but was not yet written to disk, is not durable on commit and thus cannot be recovered in case of a failure. The Asynchronous Redo Logging option is most relevant for applications that are willing to sacrifice data recovery \(consistency\) over performance. + + Logging Design Details + + The following describes the design details of each persistence-related component in the In-Memory Engine Module. + + **Figure 4** Three Logging Options + ![](figures/three-logging-options.png "three-logging-options") + + The RedoLog component is used by both by backend threads that use the In-Memory Engine and by the WAL writer in order to persist their data. Checkpoints are performed using the Checkpoint Manager, which is triggered by the Postgres checkpointer. + +- **Logging Design Overview** + + Write-Ahead Logging \(WAL\) is a standard method for ensuring data durability. WAL's central concept is that changes to data files \(where tables and indexes reside\) are only written after those changes have been logged, meaning after the log records that describe these changes have been flushed to permanent storage. + + In the In-Memory Engine we use the existing openGauss logging facilities and have not develop a low level logging API from scratch in order to reduce development time and to enable it to be used for replication purposes as well. + +- **Per-transaction Logging** + + In the In-Memory Engine, the transaction log records are stored in a transaction buffer which is part of the transaction object \(TXN\). The transaction buffer is logged during the calls to addToLog\(\) – if the buffer exceeds a threshold it is then flushed and reused. When a transaction commits and passes the validation phase \(OCC SILO\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] validation\) or aborts for some reason, the appropriate message is saved in the log as well in order to make it possible to determine the transaction's state during a recovery. + + **Figure 5** Per-transaction Logging + ![](figures/per-transaction-logging.png "per-transaction-logging") + + Parallel Logging is performed both by MOT and disk engines. However, the MOT engine enhances this design with a log-buffer per transaction, lockless preparation and a single log record. + +- **Exception Handling** + + The persistence module handles exceptions by using the Postgres error reporting infrastructure \(ereport\). An error message is recorded in the system log for each error condition. In addition, the error is reported to the envelope using Postgres’s built-in error reporting infrastructure. + + The following exceptions are reported by this module – + + **Table 1** Exception Handling + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Exception Condition

+

Exception Code

+

Scenario

+

Resulting Outcome

+

WAL write failure

+

ERRCODE_FDW_ERROR

+

On any case the WAL write fails

+

Transaction terminates

+

File IO error: write, open and so on

+

ERRCODE_IO_ERROR

+

Checkpoint – Called on any file access error

+

FATAL – process exists

+

Out of Memory

+

ERRCODE_INSUFFICIENT_RESOURCES

+

Checkpoint – Local memory allocation failures

+

FATAL – process exists

+

Logic, DB errors

+

ERRCODE_INTERNAL_

+

ERROR

+

Checkpoint: algorithm fails or failure to retrieve table data or indexes.

+

FATAL – process exists

+
+ + diff --git a/content/en/docs/Developerguide/mot-low-latency.md b/content/en/docs/Developerguide/mot-low-latency.md new file mode 100644 index 0000000000000000000000000000000000000000..9b0a93c5d16b24747586d8e1aa2b98f12cd2027d --- /dev/null +++ b/content/en/docs/Developerguide/mot-low-latency.md @@ -0,0 +1,15 @@ +# MOT Low Latency + +The following was measured on ARM/Kunpeng 2-socket server \(128 cores\). The numbers scale is milliseconds \(ms\). + +**Figure 1** Low Latency \(90th%\) – Performance Benchmarks +![](figures/low-latency-(90th-)-performance-benchmarks.png "low-latency-(90th-)-performance-benchmarks") + +MOT's average transaction speed is 2.5x, with MOT latency of 10.5 ms, compared to 23-25ms for disk tables. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The average was calculated by taking into account all TPC-C 5 transaction percentage distributions. For more information, you may refer to the description of TPC-C transactions in the [MOT Sample TPC-C Benchmark](mot-sample-tpc-c-benchmark.md) section. + +**Figure 2** Low Latency \(90th%, Transaction Average\) – Performance Benchmarks +![](figures/low-latency-(90th-transaction-average)-performance-benchmarks.png "low-latency-(90th-transaction-average)-performance-benchmarks") + diff --git a/content/en/docs/Developerguide/mot-memory-and-storage-planning.md b/content/en/docs/Developerguide/mot-memory-and-storage-planning.md new file mode 100644 index 0000000000000000000000000000000000000000..0926a2c0016964b34f0746113e1d2ccb5c975b4c --- /dev/null +++ b/content/en/docs/Developerguide/mot-memory-and-storage-planning.md @@ -0,0 +1,168 @@ +# MOT Memory and Storage Planning + +This section describes the considerations and guidelines for evaluating, estimating and planning the quantity of memory and storage capacity to suit your specific application needs. This section also describes the various data aspects that affect the quantity of required memory, such as the size of data and indexes for the planned tables, memory to sustain transaction management and how fast the data is growing. + +## MOT Memory Planning + +MOT belongs to the in-memory database class \(IMDB\) in which all tables and indexes reside entirely in memory. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>Memory storage is volatile, meaning that it requires power to maintain the stored information. Disk storage is persistent, meaning that it is written to disk, which is non-volatile storage. MOT uses both, having all data in memory, while persisting \(by WAL logging\) transactional changes to disk with strict consistency \(in synchronous logging mode\). + +Sufficient physical memory must exist on the server in order to maintain the tables in their initial state, as well as to accommodate the related workload and growth of data. All this is in addition to the memory that is required for the traditional disk-based engine, tables and sessions that support the workload of disk-based tables. Therefore, planning ahead for enough memory to contain them all is essential. + +Even so, you can get started with whatever amount of memory you have and perform basic tasks and evaluation tests. Later, when you are ready for production, the following issues should be addressed. + +- **Memory Configuration Settings** + + Similar to standard PG , the memory of the openGauss database process is controlled by the upper limit in its max\_process\_memory setting, which is defined in the postgres.conf file. The MOT engine and all its components and threads, reside within the openGauss process. Therefore, the memory allocated to MOT also operates within the upper boundary defined by max\_process\_memory for the entire openGauss database process. + + The amount of memory that MOT can reserve for itself is defined as a portion of max\_process\_memory. It is either a percentage of it or an absolute value that is less than it. This portion is defined in the mot.conf configuration file by the \_mot\_\_memory settings. + + The portion of max\_process\_memory that can be used by MOT must still leave at least 2 GB available for the PG \(openGauss\) envelope. Therefore, in order to ensure this, MOT verifies the following during database startup – + + ``` + (max_mot_global_memory + max_mot_local_memory) + 2GB < max_process_memory + ``` + + If this limit is breached, then MOT memory internal limits are adjusted in order to provide the maximum possible within the limitations described above. This adjustment is performed during startup and calculates the value of MOT max memory accordingly. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >MOT max memory is a logically calculated value of either the configured settings or their adjusted values of \(max\_mot\_global\_memory + max\_mot\_local\_memory\). + + In this case, a warning is issued to the server log, as shown below – + + **Warning Examples** + + Two messages are reported – the problem and the solution. + + The following is an example of a warning message reporting the problem – + + ``` + [WARNING] MOT engine maximum memory definitions (global: 9830 MB, local: 1843 MB, session large store: 0 MB, total: 11673 MB) breach GaussDB maximum process memory restriction (12288 MB) and/or total system memory (64243 MB). MOT values shall be adjusted accordingly to preserve required gap (2048 MB). + ``` + + The following is an example of a warning message indicating that MOT is automatically adjusting the memory limits – + + ``` + [WARNING] Adjusting MOT memory limits: global = 8623 MB, local = 1617 MB, session large store = 0 MB, total = 10240 MB + ``` + + This is the only place that shows the new memory limits. + + Additionally, MOT does not allow the insertion of additional data when the total memory usage approaches the chosen memory limits. The threshold for determining when additional data insertions are no longer allowed, is defined as a percentage of MOT max memory \(which is a calculated value, as described above\). The percentage of MOT max memory is configured in the high\_red\_mark\_percent setting of the mot.conf file. The default is 90, meaning 90%. Attempting to add additional data over this threshold returns an error to the user and is also registered in the database log file. + +- **Minimum and Maximum** + + In order to secure memory for future operations, MOT pre-allocates memory based on the minimum global and local settings. The database administrator should specify the minimum amount of memory required for the MOT tables and sessions to sustain their workload. This ensures that this minimal memory is allocated to MOT even if another excessive memory‑consuming application runs on the same server as the database and competes with the database for memory resources. The maximum values are used to limit memory growth. + + +- **Global and Local** + + The memory used by MOT is comprised of two parts – + + - **Global Memory –** Global memory is a long-term memory pool that contains the data and indexes of MOT tables. It is evenly distributed across NUMA-nodes and is shared by all CPU cores. + + - **Local Memory –** Local memory is a memory pool used for short-term objects. Its primary consumers are sessions handling transactions. These sessions are storing data changes in the part of the memory dedicated to the relevant specific transaction \(known as _transaction private memory_\). Data changes are moved to the global memory at the commit phase. Memory object allocation is performed in NUMA-local manner in order to achieve the lowest possible latency. + + Deallocated objects are put back in the relevant memory pools. Minimal use of operating system memory allocation \(malloc\) functions during transactions circumvents unnecessary locks and latches. + + The allocation of these two memory parts is controlled by the dedicated **min/max\_mot\_global\_memory** and **min/max\_mot\_local\_memory** settings. If MOT global memory usage gets too close to this defined maximum, then MOT protects itself and does not accept new data. Attempts to allocate memory beyond this limit are denied and an error is reported to the user. + +- **Minimum Memory Requirements** + + To get started and perform a minimal evaluation of MOT performance, there are a few requirements. + + Make sure that the **max\_process\_memory** \(as defined in **postgres.conf**\) has sufficient capacity for MOT tables and sessions \(configured by **mix/max\_mot\_global\_memory** and **mix/max\_mot\_local\_memory**\), in addition to the disk tables buffer and extra memory. For simple tests, the default **mot.conf** settings can be used. + + +- **Actual Memory Requirements During Production** + + In a typical OLTP workload, with 80:20 read:write ratio on average, MOT memory usage per table is 60% higher than in disk-based tables \(this includes both the data and the indexes\). This is due to the use of more optimal data structures and algorithms that enable faster access, with CPU-cache awareness and memory-prefetching. + + The actual memory requirement for a specific application depends on the quantity of data, the expected workload and especially on the data growth. + + +- **Max Global Memory Planning – Data + Index Size** + + To plan for maximum global memory – + + 1. Determine the size of a specific disk table \(including both its data and all its indexes\). The following statistical query can be used to determine the data size of the **customer** table and the **customer\_pkey** index size – + - **Data size –** select pg\_relation\_size\(‘customer'\); + - **Index –** select pg\_relation\_size\('customer\_pkey'\); + + 2. Add 60%, which is the common requirement in MOT relative to the current size of the disk-based data and index. + 3. Add an additional percentage for the expected growth of data. For example – + + 5% monthly growth = 80% yearly growth \(1.05^12\). Thus, in order to sustain a year's growth, allocate 80% more memory than is currently used by the tables. + + This completes the estimation and planning of the max\_mot\_global\_memory value. The actual setting can be defined either as an absolute value or a percentage of the Postgres max\_process\_memory. The exact value is typically finetuned during deployment. + +- **Max Local Memory Planning – Concurrent Session Support** + + Local memory needs are primarily a function of the quantity of concurrent sessions. The typical OLTP workload of an average session uses up to 8 MB. This should be multiplied by the quantity of sessions and then a little bit extra should be added. + + A memory calculation can be performed in this manner and then finetuned, as follows – + + ``` + SESSION_COUNT * SESSION_SIZE (8 MB) + SOME_EXTRA (100MB should be enough) + ``` + + The default specifies 15% of Postgres's max\_process\_memory, which by default is 12 GB. This equals 1.8 GB, which is sufficient for 230 sessions, which is the requirement for the max\_mot\_local memory. The actual setting can be defined either in absolute values or as a percentage of the Postgres max\_process\_memory. The exact value is typically finetuned during deployment. + + **Unusually Large Transactions** + + Some transactions are unusually large because they apply changes to a large number of rows. This may increase a single session's local memory up to the maximum allowed limit, which is 1 GB. For example – + + ``` + delete from SOME_VERY_LARGE_TABLE; + ``` + + Take this scenario into consideration when configuring the max\_mot\_local\_memory setting, as well as during application development. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >You may refer to the[MEMORY \(MOT\)](mot-configuration-settings.md#section1223551495) section for more information about configuration settings. + + +## Storage IO + +MOT is a memory-optimized, persistent database storage engine. A disk drive\(s\) is required for storing the Redo Log \(WAL\) and a periodic checkpoint. + +It is recommended to use a storage device with low latency, such as SSD with a RAID-1 configuration, NVMe or any enterprise-grade storage system. When appropriate hardware is used, the database transaction processing and contention are the bottleneck, not the IO. + +Since the persistent storage is much slower than RAM memory, the IO operations \(logging and checkpoint\) can create a bottleneck for both an in-memory and memory-optimized databases. However, MOT has a highly efficient durability design and implementation that is optimized for modern hardware \(such as SSD and NVMe\). In addition, MOT has minimized and optimized writing points \(for example, by using parallel logging, a single log record per transaction and NUMA-aware transaction group writing\) and has minimized the data written to disk \(for example, only logging the delta or updated columns of the changed records and only logging a transaction at the commit phase\). + +## Required Capacity + +The required capacity is determined by the requirements of checkpointing and logging, as described below – + +- **Checkpointing** + + A checkpoint saves a snapshot of all the data to disk. + + Twice the size of all data should be allocated for checkpointing. There is no need to allocate space for the indexes for checkpointing + + Checkpointing = 2x the MOT Data Size \(rows only, index is not persistent\). + + Twice the size is required because a snapshot is saved to disk of the entire size of the data, and in addition, the same amount of space should be allocated for the checkpoint that is in progress. When a checkpoint process finishes, the previous checkpoint files are deleted. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >In the next openGauss release, MOT will have an incremental checkpoint feature, which will significantly reduce this storage capacity requirement. + + +- **Logging** + + MOT table log records are written to the same database transaction log as the other records of disk-based tables. + + The size of the log depends on the transactional throughput, the size of the data changes and the time between checkpoints \(at each time checkpoint the Redo Log is truncated and starts to expand again\). + + MOT tables use less log bandwidth and have lower IO contention than disk‑based tables. This is enabled by multiple mechanisms. + + For example, MOT does not log every operation before a transaction has been completed. It is only logged at the commit phase and only the updated delta record is logged \(not full records like for disk‑based tables\). + + In order to ensure that the log IO device does not become a bottleneck, the log file must be placed on a drive that has low latency. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >You may refer to the [STORAGE \(MOT\)](mot-configuration-settings.md#section16572933681) section for more information about configuration settings. + + diff --git a/content/en/docs/Developerguide/mot-memory-management.md b/content/en/docs/Developerguide/mot-memory-management.md new file mode 100644 index 0000000000000000000000000000000000000000..1e4d5dc8483733911c2d7cbb8fa72325c05e84ef --- /dev/null +++ b/content/en/docs/Developerguide/mot-memory-management.md @@ -0,0 +1,4 @@ +# MOT Memory Management + +For planning and finetuning, see the [MOT Memory and Storage Planning](mot-memory-and-storage-planning.md) and [MOT Configuration Settings](mot-configuration-settings.md) sections. + diff --git a/content/en/docs/Developerguide/monitoring.md b/content/en/docs/Developerguide/mot-monitoring.md similarity index 61% rename from content/en/docs/Developerguide/monitoring.md rename to content/en/docs/Developerguide/mot-monitoring.md index 186a3737dad3bbe09458245746f55bdadc1956cf..6cbd1f8f881c7169b938bcdc409e4f8613bd99b0 100644 --- a/content/en/docs/Developerguide/monitoring.md +++ b/content/en/docs/Developerguide/mot-monitoring.md @@ -1,123 +1,117 @@ -# Monitoring - -All syntax for monitoring of PG-based FDW tables is supported. This includes Table or Index sizes \(as described below\). In addition, special functions exist for monitoring MOT memory consumption, including MOT Global Memory, MOT Local Memory and a single client session. - -## Table and Index Sizes - -The size of tables and indexes can be monitored by querying****pg\_relation\_size. - -For example - -- **Data size** - - ``` - select pg_relation_size('customer'); - ``` - -- **Index** - - ``` - select pg_relation_size('customer_pkey'); - ``` - - -## MOT GLOBAL Memory Details - -Check the size of MOT global memory, which includes primarily the data and indexes. - -``` -select * from mot_global_memory_detail(); -``` - -Result - -``` -numa_node | reserved_size | used_size -----------------+----------------+------------- --1 | 194716368896 | 25908215808 -0 | 446693376 | 446693376 -1 | 452984832 | 452984832 -2 | 452984832 | 452984832 -3 | 452984832 | 452984832 -4 | 452984832 | 452984832 -5 | 364904448 | 364904448 -6 | 301989888 | 301989888 -7 | 301989888 | 301989888 -``` - -``` - -``` - -Where - -- -1 is the total memory. -- 0..7 are NUMA memory nodes. - -## MOT LOCAL Memory Details - -Check the size of MOT local memory, which includes session memory. - -``` -select * from mot_local_memory_detail(); -``` - -Result - -``` -numa_node | reserved_size | used_size -----------------+----------------+------------- --1 | 144703488 | 144703488 -0 | 25165824 | 25165824 -1 | 25165824 | 25165824 -2 | 18874368 | 18874368 -3 | 18874368 | 18874368 -4 | 18874368 | 18874368 -5 | 12582912 | 12582912 -6 | 12582912 | 12582912 -7 | 12582912 | 12582912 -``` - -Where - -- -1 is the total memory. -- 0..7 are NUMA memory nodes. - -## Session Memory - -Memory for session management is taken from the MOT local memory. - -Memory usage by all active sessions \(connections\) is possible using the following query - -``` -select * from mot_session_memory_detail(); -``` - -Result - -``` -sessid | total_size | free_size | used_size ----------------------------------––––––-+-----------+----------+---------- -1591175063.139755603855104 | 6291456 | 1800704 | 4490752 - -``` - -Legend - -**total\_size –** is allocated for the session - -**free\_size –** not in use - -**used\_size –** In actual use - -The following query enables a DBA to determine the state of local memory used by the current session - -``` -select * from mot_session_memory_detail() - where sessid = pg_current_sessionid; -``` - -Result - -![](figures/en-us_image_0260591116.png) - +# MOT Monitoring + +All syntax for monitoring of PG-based FDW tables is supported. This includes Table or Index sizes \(as described below\). In addition, special functions exist for monitoring MOT memory consumption, including MOT Global Memory, MOT Local Memory and a single client session. + +## Table and Index Sizes + +The size of tables and indexes can be monitored by querying pg\_relation\_size. + +For example + +**Data Size** + +``` +select pg_relation_size('customer'); +``` + +**Index** + +``` +select pg_relation_size('customer_pkey'); +``` + +## MOT GLOBAL Memory Details + +Check the size of MOT global memory, which includes primarily the data and indexes. + +``` +select * frommot_global_memory_detail(); +``` + +Result – + +``` +numa_node | reserved_size | used_size +----------------+----------------+------------- +-1 | 194716368896 | 25908215808 +0 | 446693376 | 446693376 +1 | 452984832 | 452984832 +2 | 452984832 | 452984832 +3 | 452984832 | 452984832 +4 | 452984832 | 452984832 +5 | 364904448 | 364904448 +6 | 301989888 | 301989888 +7 | 301989888 | 301989888 +``` + +Where – + +- -1 is the total memory. + +- 0..7 are NUMA memory nodes. + +## MOT LOCAL Memory Details + +Check the size of MOT local memory, which includes session memory. + +``` +select * frommot_local_memory_detail(); +``` + +Result – + +``` +numa_node | reserved_size | used_size +----------------+----------------+------------- +-1 | 144703488 | 144703488 +0 | 25165824 | 25165824 +1 | 25165824 | 25165824 +2 | 18874368 | 18874368 +3 | 18874368 | 18874368 +4 | 18874368 | 18874368 +5 | 12582912 | 12582912 +6 | 12582912 | 12582912 +7 | 12582912 | 12582912 +``` + +Where – + +- -1 is the total memory. +- 0..7 are NUMA memory nodes. + +## Session Memory + +Memory for session management is taken from the MOT local memory. + +Memory usage by all active sessions \(connections\) is possible using the following query – + +``` +select * frommot_session_memory_detail(); +``` + +Result – + +``` +sessid | total_size | free_size | used_size +---------------------------------––––––-+-----------+----------+---------- +1591175063.139755603855104 | 6291456 | 1800704 | 4490752 + +``` + +Legend – + +- **total\_size –** is allocated for the session +- **free\_size –** not in use +- **used\_size –** In actual use + +The following query enables a DBA to determine the state of local memory used by the current session – + +``` +select * from mot_session_memory_detail() + where sessid = pg_current_sessionid(); +``` + +Result – + +![](figures/en-us_image_0270643558.png) + diff --git a/content/en/docs/Developerguide/mot-optimistic-concurrency-control.md b/content/en/docs/Developerguide/mot-optimistic-concurrency-control.md new file mode 100644 index 0000000000000000000000000000000000000000..0c52758a67692b577f825b389f9dc049f67500d2 --- /dev/null +++ b/content/en/docs/Developerguide/mot-optimistic-concurrency-control.md @@ -0,0 +1,192 @@ +# MOT Optimistic Concurrency Control + +The Concurrency Control Module \(CC Module for short\) provides all the transactional requirements for the Main Memory Engine. The primary objective of the CC Module is to provide the Main Memory Engine with support for various isolation levels. + +## Optimistic OCC vs. Pessimistic 2PL + +The functional differences of Pessimistic 2PL \(2-Phase Locking\) vs. Optimistic Concurrency Control \(OCC\) involve pessimistic versus optimistic approaches to transaction integrity. + +Disk-based tables use a pessimistic approach, which is the most commonly used database method. The MOT Engine use an optimistic approach. + +The primary functional difference between the pessimistic approach and the optimistic approach is that if a conflict occurs – + +- The pessimistic approach causes the client to wait. + +- The optimistic approach causes one of the transactions to fail, so that the failed transaction must be retried by the client. + +**Optimistic Concurrency Control Approach \(Used by MOT\)** + +The **Optimistic Concurrency Control \(OCC\)** approach detects conflicts as they occur, and performs validation checks at commit time. + +The optimistic approach has less overhead and is usually more efficient, partly because transaction conflicts are uncommon in most applications. + +The functional differences between optimistic and pessimistic approaches is larger when the REPEATABLE READ isolation level is enforced and is largest for the SERIALIZABLE isolation level. + +**Pessimistic Approaches \(Not used by MOT\)** + +The **Pessimistic Concurrency Control** \(2PL or 2-Phase Locking\) approach uses locks to block potential conflicts before they occur. A lock is applied when a statement is executed and released when the transaction is committed. Disk-based row‑stores use this approach \(with the addition of Multi-version Concurrency Control \[MVCC\]\). + +In 2PL algorithms, while a transaction is writing a row, no other transaction can access it; and while a row is being read, no other transaction can overwrite it. Each row is locked at access time for both reading and writing; and the lock is released at commit time. These algorithms require a scheme for handling and avoiding deadlock. Deadlock can be detected by calculating cycles in a wait-for graph. Deadlock can be avoided by keeping time ordering using TSO\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] or by some kind of back-off scheme. + +**Encounter Time Locking \(ETL\)** + +Another approach is Encounter Time Locking \(ETL\), where reads are handled in an optimistic manner, but writes lock the data that they access. As a result, writes from different ETL transactions are aware of each other and can decide to abort. It has been empirically verified\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] that ETL improves the performance of OCC in two ways – + +- First, ETL detects conflicts early on and often increases transaction throughput. This is because transactions do not perform useless operations, because conflicts discovered at commit time \(in general\) cannot be solved without aborting at least one transaction. +- Second, encounter-time locking Reads-After-Writes \(RAW\) are handled efficiently without requiring expensive or complex mechanisms. + +**Conclusion** + +OCC is the fastest option for most workloads\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]. This finding has also been observed in our preliminary research phase. + +One of the reasons is that when every core executes multiple threads, a lock is likely to be held by a swapped thread, especially in interactive mode. Another reason is that pessimistic algorithms involve deadlock detection \(which introduces overhead\) and usually uses read-write locks \(which are less efficient than standard spin-locks\). + +We have chosen Silo\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] because it was simpler than other existing options, such as TicToc\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\], while maintaining the same performance for most workloads. ETL is sometimes faster than OCC, but it introduces spurious aborts which may confuse a user, in contrast to OCC which aborts only at commit. + +## OCC vs 2PL Differences by Example + +The following shows the differences between two user experiences – Pessimistic \(for disk-based tables\) and Optimistic \(MOT tables\) when sessions update the same table simultaneously. + +In this example, the following table test command is run – + +``` +table “TEST” – create table test (x int, y int, z int, primary key(x)); +``` + +This example describes two aspects of the same test – user experience \(operations in the example\) and retry requirements. + +**Example Pessimistic Approach – Used in Disk-based Tables** + +The following is an example of the Pessimistic approach \(which is not Mot\). Any Isolation Level may apply. + +The following two sessions perform a transaction that attempts to update a single table. + +A WAIT LOCK action occurs and the client experience is that session \#2 is _stuck_ until Session \#1 has completed a COMMIT. Only afterwards, is Session \#2 able to progress. + +However, when this approach is used, both sessions succeed and no abort occurs \(unless SERIALIZABLE or REPEATABLE-READ isolation level is applied\), which results in the entire transaction needing to be retried. + +**Table 1** Pessimistic Approach Code Example + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
  

Session 1

+

Session 2

+

t0

+

Begin

+

Begin

+

t1

+

update test set y=200 where x=1;

+
  

t2

+

y=200

+

Update test set y=300 where x=1; -- Wait on lock

+

t4

+

Commit

+
  
    

Unlock

+
    

Commit

+

(in READ-COMMITTED this will succeed, in SERIALIZABLE it will fail)

+
    

y = 300

+
+ +**Example Optimistic Approach – Used in MOT** + +The following is an example of the Optimistic approach. + +It describes the situation of creating an MOT table and then having two concurrent sessions updating that same MOT table simultaneously – + +``` +create foreign table test (x int, y int, z int, primary key(x)); +``` + +- The advantage of OCC is that there are no locks until COMMIT. +- The disadvantage of using OCC is that the update may fail if another session updates the same record. If the update fails \(in all supported isolation levels\), an entire SESSION \#2 transaction must be retried. +- Update conflicts are detected by the kernel at commit time by using a version checking mechanism. +- SESSION \#2 will not wait in its update operation and will be aborted because of conflict detection at commit phase. + +**Table 2** Optimistic Approach Code Example – Used in MOT + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
  

Session 1

+

Session 2

+

t0

+

Begin

+

Begin

+

t1

+

update test set y=200 where x=1;

+
  

t2

+

y=200

+

Update test set y=300 where x=1;

+

t4

+

Commit

+

y = 300

+
    

Commit

+
    

ABORT

+
    

y = 200

+
+ diff --git a/content/en/docs/Developerguide/mot-performance-benchmarks.md b/content/en/docs/Developerguide/mot-performance-benchmarks.md new file mode 100644 index 0000000000000000000000000000000000000000..89e3da2a4d80c953701cd0f25ba931dd488eecee --- /dev/null +++ b/content/en/docs/Developerguide/mot-performance-benchmarks.md @@ -0,0 +1,32 @@ +# MOT Performance Benchmarks + +Our performance tests are based on the TPC-C Benchmark that is commonly used both by industry and academia. + +Ours tests used BenchmarkSQL \(see [MOT Sample TPC-C Benchmark](mot-sample-tpc-c-benchmark.md)\) and generates the workload using interactive SQL commands, as opposed to stored procedures. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>Using the stored procedures approach may produce even higher performance results because it involves significantly less networking roundtrips and database envelope SQL processing cycles. + +All tests that evaluated the performance of openGauss MOT vs DISK used synchronous logging and its optimized **group-commit=on** version in MOT. + +Finally, we performed an additional test in order to evaluate MOT's ability to quickly and ingest massive quantities of data and to serve as an alternative to a mid-tier data ingestion solutions. + +All tests were performed in June 2020. + +The following shows various types of MOT performance benchmarks – + +- **[MOT Hardware](mot-hardware.md)** + +- **[MOT Results – Summary](mot-results-summary.md)** + +- **[MOT High Throughput](mot-high-throughput.md)** + +- **[MOT Low Latency](mot-low-latency.md)** + +- **[MOT RTO and Cold-Start Time](mot-rto-and-cold-start-time.md)** + +- **[MOT Resource Utilization](mot-resource-utilization.md)** + +- **[MOT Data Ingestion Speed](mot-data-ingestion-speed.md)** + + diff --git a/content/en/docs/Developerguide/mot-preparation.md b/content/en/docs/Developerguide/mot-preparation.md new file mode 100644 index 0000000000000000000000000000000000000000..a8fffa52a658a141a4df4b2d06918601b701ae09 --- /dev/null +++ b/content/en/docs/Developerguide/mot-preparation.md @@ -0,0 +1,9 @@ +# MOT Preparation + +The following describes the prerequisites and the memory and storage planning to perform in order to prepare to use MOT. + +- **[MOT Prerequisites](mot-prerequisites.md)** + +- **[MOT Memory and Storage Planning](mot-memory-and-storage-planning.md)** + + diff --git a/content/en/docs/Developerguide/prerequisites.md b/content/en/docs/Developerguide/mot-prerequisites.md similarity index 55% rename from content/en/docs/Developerguide/prerequisites.md rename to content/en/docs/Developerguide/mot-prerequisites.md index 1d371416ff7ce460b5fc4c2bbe28d6c6f14e2ba1..d053fa8a22b17a231fd552ff978980b53870a86d 100644 --- a/content/en/docs/Developerguide/prerequisites.md +++ b/content/en/docs/Developerguide/mot-prerequisites.md @@ -1,41 +1,41 @@ -# Prerequisites - -The following specifies the hardware and software prerequisites for using openGauss MOT. - -## Supported Hardware - -MOT can utilize state-of-the-art hardware, as well as support existing hardware platforms. Both x86 architecture and ARM by Huawei Kunpeng architecture are supported. - -MOT is fully aligned with the hardware supported by the openGauss database. For more information, see the _openGauss Installation Guide_. - -## CPU - -MOT delivers exceptional performance on many-core servers \(scale-up\). MOT significantly outperforms the competition in these environments and provides near-linear scaling and extremely high resource utilization. - -Even so, users can already start realizing MOT’s performance benefits on both low-end, mid-range and high-end servers, starting from one or two CPU sockets, as well as four and even eight CPU sockets. Very high performance and resource utilization are also expected on very high-end servers that have 16 or even 32 sockets \(for such cases, we recommend contacting Huawei support\). - -## Memory - -MOT supports standard RAM/DRAM for its data and transaction management. All MOT tables’ data and indexes reside in-memory; therefore, the memory capacity must support the data capacity and still have space for further growth. For detailed information about memory requirements and planning, see the [Memory and Storage Planning](memory-and-storage-planning.md)__section. - -## Storage IO - -MOT is a durable database and uses persistent storage \(disk/SSD/NVMe drive\[s\]\) for transaction log operations and periodic checkpoints. - -We recommend using a storage device with low latency, such as SSD with a RAID-1 configuration, NVMe or any enterprise-grade storage system. When appropriate hardware is used, the database transaction processing and contention are the bottleneck, not the IO. - -For detailed memory requirements and planning, see the [Memory and Storage Planning](memory-and-storage-planning.md)section_._ - -## Supported Operating Systems - -MOT is fully aligned with the operating systems supported by openGauss. - -MOT supports both bare-metal and virtualized environments that run the following operating systems on a bare-metal server or virtual machine - -- **x86 –** CentOS 7.6 and EulerOS 2.0 -- **ARM –** OpenEuler and EulerOS - -**OS Optimization** - -MOT does not require any special modifications or the installation of new software. However, several optional optimizations can enhance performance. You may refer to the _Server Optimization x86__ _and _Server Optimization ARM Huawei Taishan 4P_ sections for a description of the optimizations that enable maximal performance. - +# MOT Prerequisites + +The following specifies the hardware and software prerequisites for using openGauss MOT. + +## Supported Hardware + +MOT can utilize state-of-the-art hardware, as well as support existing hardware platforms. Both x86 architecture and ARM by Huawei Kunpeng architecture are supported. + +MOT is fully aligned with the hardware supported by the openGauss database. For more information, see the _openGauss Installation Guide_. + +## CPU + +MOT delivers exceptional performance on many-core servers \(scale-up\). MOT significantly outperforms the competition in these environments and provides near-linear scaling and extremely high resource utilization. + +Even so, users can already start realizing MOT's performance benefits on both low-end, mid-range and high-end servers, starting from one or two CPU sockets, as well as four and even eight CPU sockets. Very high performance and resource utilization are also expected on very high-end servers that have 16 or even 32 sockets \(for such cases, we recommend contacting Huawei support\). + +## Memory + +MOT supports standard RAM/DRAM for its data and transaction management. All MOT tables’ data and indexes reside in-memory; therefore, the memory capacity must support the data capacity and still have space for further growth. For detailed information about memory requirements and planning, see the [MOT Memory and Storage Planning](mot-memory-and-storage-planning.md)__section. + +## Storage IO + +MOT is a durable database and uses persistent storage \(disk/SSD/NVMe drive\[s\]\) for transaction log operations and periodic checkpoints. + +We recommend using a storage device with low latency, such as SSD with a RAID-1 configuration, NVMe or any enterprise-grade storage system. When appropriate hardware is used, the database transaction processing and contention are the bottleneck, not the IO. + +For detailed memory requirements and planning, see the [MOT Memory and Storage Planning](mot-memory-and-storage-planning.md)__section_._ + +Supported Operating Systems + +MOT is fully aligned with the operating systems supported by openGauss. + +MOT supports both bare-metal and virtualized environments that run the following operating systems on a bare-metal server or virtual machine – + +- **x86 –** CentOS 7.6 and EulerOS 2.0 +- **ARM –** OpenEuler and EulerOS + +## OS Optimization + +MOT does not require any special modifications or the installation of new software. However, several optional optimizations can enhance performance. You may refer to the [MOT Server Optimization – x86](mot-server-optimization-x86.md) and [MOT Server Optimization – ARM Huawei Taishan 2P/4P](mot-server-optimization-arm-huawei-taishan-2p-4p.md) sections for a description of the optimizations that enable maximal performance. + diff --git a/content/en/docs/Developerguide/mot-query-native-compilation-jit.md b/content/en/docs/Developerguide/mot-query-native-compilation-jit.md new file mode 100644 index 0000000000000000000000000000000000000000..84e6e2191a3e2a1dccc50e93d3abd10434ec452e --- /dev/null +++ b/content/en/docs/Developerguide/mot-query-native-compilation-jit.md @@ -0,0 +1,67 @@ +# MOT Query Native Compilation \(JIT\) + +MOT enables you to prepare and parse _pre-compiled full queries_ in a native format \(using a **PREPARE** statement\) before they are needed for execution. + +This native format can later be executed \(using an **EXECUTE** command\) more efficiently. This type of execution is much more efficient because during execution the native format bypasses multiple database processing layers. This division of labor avoids repetitive parse analysis operations. The Lite Executor module is responsible for executing **prepared** queries and has a much faster execution path than the regular generic plan performed by the envelope. This is achieved using Just-In-Time \(JIT\) compilation via LLVM. In addition, a similar solution that has potentially similar performance is provided in the form of pseudo-LLVM. + +The following is an example of a **PREPARE** syntax in SQL – + +``` +PREPARE name [ ( data_type [, ...] ) ] AS statement +``` + +The following is an example of how to invoke a PREPARE and then an EXECUTE statement in a Java application – ///VVV: consider to frame in a nice bounding box with some light background color. Also run java language in Notepad++ to be color-highlighted. After the content is complete, I’ll do this \(not today\) GGG ++ + +``` +conn = DriverManager.getConnection(connectionUrl, connectionUser, connectionPassword); + +// Example 1: PREPARE without bind settings +String query = "SELECT * FROM getusers"; +PreparedStatement prepStmt1 = conn.prepareStatement(query); +ResultSet rs1 = pstatement.executeQuery()) +while (rs1.next()) {…} + +// Example 2: PREPARE with bind settings +String sqlStmt = "SELECT * FROM employees where first_name=? and last_name like ?"; +PreparedStatement prepStmt2 = conn.prepareStatement(sqlStmt); +prepStmt2.setString(1, "Mark"); // first name “Mark” +prepStmt2.setString(2, "%n%"); // last name contains a letter “n” +ResultSet rs2 = prepStmt2.executeQuery()) +while (rs2.next()) {…} +``` + +## Prepare + +**PREPARE** creates a prepared statement. A prepared statement is a server-side object that can be used to optimize performance. When the **PREPARE** statement is executed, the specified statement is parsed, analyzed and rewritten. + +If the tables mentioned in the query statement are MOT tables, the MOT compilation takes charge of the object preparation and performs a special optimization by compiling the query into IR byte code based on LLVM. + +Whenever a new query compilation is required, the query is analyzed and a proper tailored IR byte code is generated for the query using the utility GsCodeGen object and standard LLVM JIT API \(IRBuilder\). After byte-code generation is completed, the code is JIT‑compiled into a separate LLVM module. The compiled code results in a C function pointer that can later be invoked for direct execution. Note that this C function can be invoked concurrently by many threads, as long as each thread provides a distinct execution context \(details are provided below\). Each such execution context is referred to as _JIT Context_. + +To improve performance further, MOT JIT applies a caching policy for its LLVM code results, enabling them to be reused for the same queries across different sessions. + +## Execute + +When an EXECUTE command is issued, the prepared statement \(described above\) is planned and executed. This division of labor avoids repetitive parse analysis work, while enabling the execution plan to depend on the specific setting values supplied. + +When the resulting execute query command reaches the database, it uses the corresponding IR byte code which is executed directly and more efficiently within the MOT engine. This is referred to as _Lite Execution_. + +In addition, for availability, the Lite Executor maintains a preallocated pool of JIT sources. Each session preallocates its own session-local pool of JIT context objects \(used for repeated executions of precompiled queries\). + +For more details you may refer to the Supported Queries for Lite Execution and Unsupported Queries for Lite Execution sections. + +## JIT Compilation Comparison – openGauss Disk-based vs. MOT Tables + +Currently, openGauss contains two main forms of JIT / CodeGen query optimizations for its disk-based tables – + +- Accelerating expression evaluation, such as in WHERE clauses, target lists, aggregates and projections +- Inlining small function invocations. + +These optimizations are partial \(in the sense they do not optimize the entire interpreted operator tree or replace it altogether\) and are targeted mostly at CPU-bound complex queries, typically seen in OLAP use cases. The execution of queries is performed in a pull-model \(Volcano-style processing\) using an interpreted operator tree. When activated, the compilation is performed at each query execution. At the moment, caching of the generated LLVM code and its reuse across sessions and queries is not yet provided. + +In contrast, MOT JIT optimization provides LLVM code for entire queries that qualify for JIT optimization by MOT. The resulting code is used for direct execution over MOT tables, while the interpreted operator model is abandoned completely. The result is _practically_ handwritten LLVM code that has been generated for an entire specific query execution. + +Another significant conceptual difference is that MOT LLVM code is only generated for prepared queries during the PREPARE phase of the query, rather than at query execution. This is especially important for OLTP scenarios due to the rather short runtime of OLTP queries, which cannot allow for code generation and relatively long query compilation time to be performed during each query execution. + +Finally, in PostgreSQL the activation of a PREPARE implies the reuse of the resulting plan across executions with different parameters in the same session. Similarly, the MOT JIT applies a caching policy for its LLVM code results, and extends it for reuse across different sessions. Thus, a single query may be compiled just once and its LLVM code may be reused across many sessions, which again is beneficial for OLTP scenarios. + diff --git a/content/en/docs/Developerguide/recovery-1.md b/content/en/docs/Developerguide/mot-recovery-concepts.md similarity index 73% rename from content/en/docs/Developerguide/recovery-1.md rename to content/en/docs/Developerguide/mot-recovery-concepts.md index 27446935157c85f282082b704f9c5df97981f853..c15b0107b65725451999548a7d38399c62e40763 100644 --- a/content/en/docs/Developerguide/recovery-1.md +++ b/content/en/docs/Developerguide/mot-recovery-concepts.md @@ -1,18 +1,18 @@ -# Recovery - -The MOT Recovery Module provides all the required functionality for recovering the MOT tables data. The main objective of the Recovery module is to restore the data and the MOT engine to a consistent state after planned \(maintenance for example\) shut down or unplanned \(power failure for example\) crash. - -Database recovery, which is also sometimes called a Cold Start, includes MOT tables and is performed automatically with the recovery of the rest of the database. The MOT Recovery Module is fully and seamlessly integrated into openGauss recovery process. - -MOT recovery has two main stages – Checkpoint recovery and WAL recovery \(Redo Log\). - -MOT checkpoint recovery is performed before the envelope's recovery takes place. This is done only at cold-start events \(start of a PG process\). It recovers the metadata first \(schema\) and then inserts all the rows from the current valid checkpoint, which is done in parallel by checkpoint\_recovery\_workers, each working on a different table. The indexes are created during the insert process. - -When checkpointing a table, it is divided into 16MB chunks, so that multiple recovery workers can recover the table in parallel. This is done in order to speed-up the checkpoint recovery, it is implemented as a multi-threaded procedure where each thread is responsible for recovering a different segment. There are no dependencies between different segments therefore there is no contention between the threads and there is no need to use locks when updating table or inserting new rows. - -WAL records are recovered as part of the envelope's WAL recovery. openGauss envelope iterates through the XLOG and performs the necessary operation based on the xlog record type. In case of entry with record type MOT, the envelope forwards it to MOT RecoveryManager for handling. The xlog entry will be ignored by MOT recovery, if it is ‘too ‘old’ – its LSN is older than the checkpoint's LSN \(Log Sequence Number\). - -In an active-standby deployment, the standby server is always in a Recovery state for an automatic WAL recovery process. - -The MOT recovery parameters are set in the mot.conf file explained in the [Recovery-Configuration](#_#_recovery) section. - +# MOT Recovery Concepts + +The MOT Recovery Module provides all the required functionality for recovering the MOT tables data. The main objective of the Recovery module is to restore the data and the MOT engine to a consistent state after a planned \(maintenance for example\) shut down or an unplanned \(power failure for example\) crash. + +OpenGauss database recovery, which is also sometimes called a _Cold Start_, includes MOT tables and is performed automatically with the recovery of the rest of the database. The MOT Recovery Module is seamlessly and fully integrated into the openGauss recovery process. + +MOT recovery has two main stages – Checkpoint Recovery and WAL Recovery \(Redo Log\). + +MOT checkpoint recovery is performed before the envelope's recovery takes place. This is done only at cold-start events \(start of a PG process\). It recovers the metadata first \(schema\) and then inserts all the rows from the current valid checkpoint, which is done in parallel by checkpoint\_recovery\_workers, each working on a different table. The indexes are created during the insert process. + +When checkpointing a table, it is divided into 16MB chunks, so that multiple recovery workers can recover the table in parallel. This is done in order to speed-up the checkpoint recovery, it is implemented as a multi-threaded procedure where each thread is responsible for recovering a different segment. There are no dependencies between different segments therefore there is no contention between the threads and there is no need to use locks when updating table or inserting new rows. + +WAL records are recovered as part of the envelope's WAL recovery. openGauss envelope iterates through the XLOG and performs the necessary operation based on the xlog record type. In case of entry with record type MOT, the envelope forwards it to MOT RecoveryManager for handling. The xlog entry will be ignored by MOT recovery, if it is ‘too ‘old’ – its LSN is older than the checkpoint's LSN \(Log Sequence Number\). + +In an active-standby deployment, the standby server is always in a Recovery state for an automatic WAL recovery process. + +The MOT recovery parameters are set in the mot.conf file explained in the [MOT Recovery](mot-recovery.md) section. + diff --git a/content/en/docs/Developerguide/recovery.md b/content/en/docs/Developerguide/mot-recovery.md similarity index 36% rename from content/en/docs/Developerguide/recovery.md rename to content/en/docs/Developerguide/mot-recovery.md index 46f897f4c108aaa5cb7be16b30f6f0d5e08fdd45..5b0f08011c06f2ab83a1df7d1ec5d67713df4d4e 100644 --- a/content/en/docs/Developerguide/recovery.md +++ b/content/en/docs/Developerguide/mot-recovery.md @@ -1,19 +1,18 @@ -# Recovery - -The main objective of MOT Recovery is to restore the data and the MOT engine to a consistent state after a planned shutdown \(for example, for maintenance\) or an unplanned crash \(for example, after a power failure\). - -MOT recovery is performed automatically with the recovery of the rest of the openGauss database and is fully integrated into openGauss recovery process \(also called a _Cold Start_\). - -MOT recovery consists of two stages – - -- **Checkpoint Recovery –** First, data must be recovered from the latest Checkpoint file on disk by loading it into memory rows and creating indexes. -- **WAL Redo Log Recovery –** Afterwards, the recent data \(which was not captured in the Checkpoint\) must be recovered from the WAL Redo Log by replaying records that were added to the log since the Checkpoint that was used in the Checkpoint Recovery \(described above\). - -The WAL Redo Log recovery is managed and triggered by openGauss. - -## Configuring Recovery - -While WAL recovery is performed in a serial manner, the Checkpoint recovery can be configured to run in a multi-threaded manner \(meaning in parallel by multiple workers\). - -This process is controlled by the **Checkpoint\_recovery\_workers** parameter in the **mot.conf** file, which is described in the [RECOVERY ](#_#_recovery)section. - +# MOT Recovery + +The main objective of MOT Recovery is to restore the data and the MOT engine to a consistent state after a planned shutdown \(for example, for maintenance\) or an unplanned crash \(for example, after a power failure\). + +MOT recovery is performed automatically with the recovery of the rest of the openGauss database and is fully integrated into openGauss recovery process \(also called a _Cold Start_\). + +MOT recovery consists of two stages – + +**Checkpoint Recovery –** First, data must be recovered from the latest Checkpoint file on disk by loading it into memory rows and creating indexes. + +**WAL Redo Log Recovery –** Afterwards, the recent data \(which was not captured in the Checkpoint\) must be recovered from the WAL Redo Log by replaying records that were added to the log since the Checkpoint that was used in the Checkpoint Recovery \(described above\). + +The WAL Redo Log recovery is managed and triggered by openGauss. + +- To configure recovery – +- While WAL recovery is performed in a serial manner, the Checkpoint recovery can be configured to run in a multi-threaded manner \(meaning in parallel by multiple workers\). +- Configure the **Checkpoint\_recovery\_workers** parameter in the **mot.conf** file, which is described in the [RECOVERY \(MOT\)](mot-configuration-settings.md#section7442447103115) section. + diff --git a/content/en/docs/Developerguide/replication-and-high-availability.md b/content/en/docs/Developerguide/mot-replication-and-high-availability.md similarity index 62% rename from content/en/docs/Developerguide/replication-and-high-availability.md rename to content/en/docs/Developerguide/mot-replication-and-high-availability.md index c6db53def66a75aa0b291b261781d47b8b52883f..c4595adb3b75ef172de083a7fa398a7aa64f97e5 100644 --- a/content/en/docs/Developerguide/replication-and-high-availability.md +++ b/content/en/docs/Developerguide/mot-replication-and-high-availability.md @@ -1,12 +1,11 @@ -# Replication and High Availability - -Since MOT is integrated into openGauss and uses/supports its replication and high availability, both synchronous and asynchronous replication are supported out of the box. - -The openGauss gs\_ctl tool is used for availability control and to operate the cluster. This includes gs\_ctl switchover, gs\_ctl failover, gs\_ctl build and so on. - -You may refer to the openGauss Tools Reference document for more information. - -## Configuring Replication and High Availability - -To configure the replication and high availability, refer to the relevant openGauss documentation. - +# MOT Replication and High Availability + +Since MOT is integrated into openGauss and uses/supports its replication and high availability, both synchronous and asynchronous replication are supported out of the box. + +The openGauss gs\_ctl tool is used for availability control and to operate the cluster. This includes gs\_ctl switchover, gs\_ctl failover, gs\_ctl build and so on. + +You may refer to the openGauss Tools Reference document for more information. + +- To configure replication and high availability – +- Refer to the relevant openGauss documentation. + diff --git a/content/en/docs/Developerguide/mot-resource-utilization.md b/content/en/docs/Developerguide/mot-resource-utilization.md new file mode 100644 index 0000000000000000000000000000000000000000..c1e2561b74468a7f9a236e11a31dc2b01b470214 --- /dev/null +++ b/content/en/docs/Developerguide/mot-resource-utilization.md @@ -0,0 +1,10 @@ +# MOT Resource Utilization + +The following figure shows the resource utilization of the test performed on a x86 server with four sockets, 96 cores and 512GB RAM server. It demonstrates that a MOT table is able to efficiently and consistently consume almost all available CPU resources. For example, it shows that almost 100% CPU percentage utilization is achieved for 192 cores and 3.9M tpmC. + +- **tmpC –** The total amount of time to load the entire database \(per database GB\) is represented by the blue line and the **TIME \(sec\)** Y axis on the left. +- **CPU % Utilization –** The quantity of database GB throughput per second is represented by the orange line and the **Throughput GB/sec** Y axis on the right. + +**Figure 1** Resource Utilization – Performance Benchmarks +![](figures/resource-utilization-performance-benchmarks.png "resource-utilization-performance-benchmarks") + diff --git a/content/en/docs/Developerguide/mot-results-summary.md b/content/en/docs/Developerguide/mot-results-summary.md new file mode 100644 index 0000000000000000000000000000000000000000..24212441bd2a9e85eab4e3a2c67efda91cc80267 --- /dev/null +++ b/content/en/docs/Developerguide/mot-results-summary.md @@ -0,0 +1,10 @@ +# MOT Results – Summary + +MOT provides higher performance than disk-tables by a factor of 2.5x to 4.1x and reaches 4.8 million tpmC on ARM/Kunpeng-based servers with 256 cores. The results clearly demonstrate MOT's exceptional ability to scale-up and utilize all hardware resources. Performance jumps as the quantity of CPU sockets and server cores increases. + +MOT delivers up to 30,000 tpmC/core on ARM/Kunpeng-based servers and up to 40,000 tpmC/core on x86-based servers. + +Due to a more efficient durability mechanism, in MOT the replication overhead of a Primary/Secondary High Availability scenario is 7% on ARM/Kunpeng and 2% on x86 servers, as opposed to the overhead in disk tables of 20% on ARM/Kunpeng and 15% on x86 servers. + +Finally, MOT delivers 2.5x lower latency, with TPC-C transaction response times of 2 to 7 times faster. + diff --git a/content/en/docs/Developerguide/mot-rto-and-cold-start-time.md b/content/en/docs/Developerguide/mot-rto-and-cold-start-time.md new file mode 100644 index 0000000000000000000000000000000000000000..929ac841896ddd9ee5b538e5b4ca24d86ea11e88 --- /dev/null +++ b/content/en/docs/Developerguide/mot-rto-and-cold-start-time.md @@ -0,0 +1,32 @@ +# MOT RTO and Cold-Start Time + +High Availability Recovery Time Objective \(RTO\) + +MOT is fully integrated into openGauss, including support for high-availability scenarios consisting of primary and secondary deployments. The WAL Redo Log's replication mechanism replicates changes into the secondary database node and uses it for replay. + +If a Failover event occurs, whether it is due to an unplanned primary node failure or due to a planned maintenance event, the secondary node quickly becomes active. The amount of time that it takes to recover and replay the WAL Redo Log and to enable connections is also referred to as the Recovery Time Objective \(RTO\). + +**The RTO of openGauss, including the MOT, is less than 10 seconds.** + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The Recovery Time Objective \(RTO\) is the duration of time and a service level within which a business process must be restored after a disaster in order to avoid unacceptable consequences associated with a break in continuity. In other words, the RTO is the answer to the question: “How much time did it take to recover after notification of a business process disruption?“ + +In addition, as shown in the [MOT High Throughput](mot-high-throughput.md) section in MOT the replication overhead of a Primary/Secondary High Availability scenario is only 7% on ARM/Kunpeng servers and 2% on x86 servers, as opposed to the replication overhead of disk-tables, which is 20% on ARM/Kunpeng and 15% on x86 servers. + +Cold-Start Recovery Time + +Cold-start Recovery time is the amount of time it takes for a system to become fully operational after a stopped mode. In memory databases, this includes the loading of all data and indexes into memory, thus it depends on data size, hardware bandwidth, and on software algorithms to process it efficiently. + +Our MOT tests used x86 servers and SSD disks to demonstrate the ability to load 100 GB of data in 65 seconds. Because MOT does not persist indexes and therefore they are created at cold-start, the actual size of the loaded data + indexes is approximately 50% more. Therefore, MOT Data + Index cold-start time can be converted into **138 GB in 60 seconds**. + +The following figure demonstrates cold-start process and how long it takes to load data into a MOT table from the disk after a cold start. + +**Figure 1** Cold-Start Time – Performance Benchmarks +![](figures/cold-start-time-performance-benchmarks.png "cold-start-time-performance-benchmarks") + +- **Database Size –** The total amount of time to load the entire database \(per database GB\) is represented by the blue line and the **TIME \(sec\)** Y axis on the left. +- **Throughput –** The quantity of database GB throughput per second is represented by the orange line and the **Throughput GB/sec** Y axis on the right. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The performance demonstrated during the test is very close to the bandwidth of the SSD hardware. Therefore, it is feasible that higher \(or lower\) performance may be achieved on a different platform. + diff --git a/content/en/docs/Developerguide/tpc-c-introduction.md b/content/en/docs/Developerguide/mot-sample-tpc-c-benchmark.md similarity index 31% rename from content/en/docs/Developerguide/tpc-c-introduction.md rename to content/en/docs/Developerguide/mot-sample-tpc-c-benchmark.md index e739df189d9319bf7d57f7cf511d5b82a78348a9..fbd912baae5634fc6831469108dcc9f6fc1f589e 100644 --- a/content/en/docs/Developerguide/tpc-c-introduction.md +++ b/content/en/docs/Developerguide/mot-sample-tpc-c-benchmark.md @@ -1,113 +1,181 @@ -# TPC-C Introduction - -The TPC-C Benchmark is an industry standard benchmark for measuring the performance of Online Transaction Processing \(OLTP\) systems. It is based on a complex database and a number of different transaction types that are executed on it. TPC-C is both a hardware‑independent and a software-independent benchmark and can thus be run on every test platform. An official overview of the benchmark model can be found at the tpc.org website here [http://www.tpc.org/tpcc/detail.asp](http://www.tpc.org/tpcc/detail.asp). - -The database consists of nine tables of various structures and thus also nine types of data records. The size and quantity of the data records varies per table. A mix of five concurrent transactions of varying types and complexities is executed on the database, which are largely online or in part queued for deferred batch processing. Because these tables compete for limited system resources, many system components are stressed and data changes are executed in a variety of ways. - -**Table 1** TPC-C Database Structure - - -

Table

+# MOT Sample TPC-C Benchmark + +## TPC-C Introduction + +The TPC-C Benchmark is an industry standard benchmark for measuring the performance of Online Transaction Processing \(OLTP\) systems. It is based on a complex database and a number of different transaction types that are executed on it. TPC-C is both a hardware‑independent and a software-independent benchmark and can thus be run on every test platform. An official overview of the benchmark model can be found at the tpc.org website here – [http://www.tpc.org/default5.asp](http://www.tpc.org/default5.asp). + +The database consists of nine tables of various structures and thus also nine types of data records. The size and quantity of the data records varies per table. A mix of five concurrent transactions of varying types and complexities is executed on the database, which are largely online or in part queued for deferred batch processing. Because these tables compete for limited system resources, many system components are stressed and data changes are executed in a variety of ways. + +**Table 1** TPC-C Database Structure + + + - - - - - - - - - - - - - - - - - - - -

Table

Number of Entries

+

Number of Entries

Warehouse

+

Warehouse

n

+

n

Item

+

Item

100,000

+

100,000

Stock

+

Stock

n x 100,000

+

n x 100,000

District

+

District

n x 10

+

n x 10

Customer

+

Customer

3,000 per district, 30,000 per warehouse

+

3,000 per district, 30,000 per warehouse

Order

+

Order

Number of customers (initial value)

+

Number of customers (initial value)

New order

+

New order

30% of the orders (initial value)

+

30% of the orders (initial value)

Order line

+

Order line

~ 10 per order

+

~ 10 per order

History

+

History

Number of customers (initial value)

+

Number of customers (initial value)

- -The transaction mix represents the complete business processing of an order – from its entry through to its delivery. More specifically, the provided mix is designed to produce an equal number of new-order transactions and payment transactions and to produce a single delivery transaction, a single order-status transaction and a single stock-level transaction for every ten new-order transactions. - -**Table 2** TPC-C Transactions Ratio - - -

Transaction Level ≥ 4%

+
+ +The transaction mix represents the complete business processing of an order – from its entry through to its delivery. More specifically, the provided mix is designed to produce an equal number of new-order transactions and payment transactions and to produce a single delivery transaction, a single order-status transaction and a single stock-level transaction for every ten new-order transactions. + +**Table 2** TPC-C Transactions Ratio + + + - - - - - - - - - - - -

Transaction Level ≥ 4%

Share of All Transactions

+

Share of All Transactions

New order

+

TPC-C New order

≤ 45%

+

≤ 45%

Payment

+

Payment

≥ 43%

+

≥ 43%

Order status

+

Order status

≥ 4%

+

≥ 4%

Delivery

+

Delivery

≥ 4% (batch)

+

≥ 4% (batch)

Stock level

+

Stock level

≥ 4%

+

≥ 4%

- -There are two ways to execute the transactions – **as stored procedures** \(which allow higher throughput\) and in **standard interactive SQL mode**. - -- Performance Metric – tpm-C - -The tpm-C metric is the number of new-order transactions executed per minute. Given the required mix and a wide range of complexity and types among the transactions, this metric most closely simulates a comprehensive business activity, not just one or two transactions or computer operations. For this reason, the tpm-C metric is considered to be a measure of business throughput. - -The tpm-C unit of measure is expressed as transactions-per-minute-C, whereas "C" stands for TPC-C specific benchmark. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->The official TPC-C Benchmark specification can be accessed at – [http://www.tpc.org/tpc\_documents\_current\_versions/pdf/tpc-c\_v5.11.0.pdf](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf). Some of the rules of this specification are generally not fulfilled in the industry, because they are too strict for industry reality. For example, Scaling rules – \(a\) tpm-C / Warehouse must be \>9 and <12.86 \(implying that a very high warehouses rate is required in order to achieve a high tpm-C rate, which also means that an extremely large database and memory capacity are required\); and \(b\) 10x terminals x Warehouses \(implying a huge quantity of simulated clients\). - +
+ +There are two ways to execute the transactions – **as stored procedures** \(which allow higher throughput\) and in **standard interactive SQL mode**. + +**Performance Metric – tpm-C** + +The tpm-C metric is the number of new-order transactions executed per minute. Given the required mix and a wide range of complexity and types among the transactions, this metric most closely simulates a comprehensive business activity, not just one or two transactions or computer operations. For this reason, the tpm-C metric is considered to be a measure of business throughput. + +The tpm-C unit of measure is expressed as transactions-per-minute-C, whereas "C" stands for TPC-C specific benchmark. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The official TPC-C Benchmark specification can be accessed at – [http://www.tpc.org/tpc\_documents\_current\_versions/pdf/tpc-c\_v5.11.0.pdf](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf). Some of the rules of this specification are generally not fulfilled in the industry, because they are too strict for industry reality. For example, Scaling rules – \(a\) tpm-C / Warehouse must be \>9 and <12.86 \(implying that a very high warehouses rate is required in order to achieve a high tpm-C rate, which also means that an extremely large database and memory capacity are required\); and \(b\) 10x terminals x Warehouses \(implying a huge quantity of simulated clients\). + +## System-Level Optimization + +Follow the instructions in the [MOT Server Optimization – x86](mot-server-optimization-x86.md) section. The following section describes the key system-level optimizations for deploying the openGauss database on a Huawei Taishan server and on a Euler 2.8 operating system for ultimate performance. + +## BenchmarkSQL – An Open-Source TPC-C Tool + +For example, to test TPCC, the **BenchmarkSQL** can be used, as follows – + +- Download **benchmarksql** from the following link – [https://osdn.net/frs/g\_redir.php?m=kent&f=benchmarksql%2Fbenchmarksql-5.0.zip](https://osdn.net/frs/g_redir.php?m=kent&f=benchmarksql%2Fbenchmarksql-5.0.zip) +- The schema creation scripts in the **benchmarksql** tool need to be adjusted to MOT syntax and unsupported DDLs need to be avoided. The adjusted scripts can be directly downloaded from the following link – [https://opengauss.obs.cn-south-1.myhuaweicloud.com/1.0.0/MOT-TPCC-Benchmark.tar.gz](https://opengauss.obs.cn-south-1.myhuaweicloud.com/1.0.0/MOT-TPCC-Benchmark.tar.gz). The contents of this tar file includes sql.common.opengauss.mot folder and jTPCCTData.java file as well as a sample configuration file postgresql.conf and a TPCC properties file props.mot for reference. +- Place the sql.common.opengauss.mot folder in the same level as sql.common under run folder and replace the file src/client/jTPCCTData.java with the downloaded java file. +- Edit the file runDatabaseBuild.sh under run folder to remove **extraHistID** from **AFTER\_LOAD** list to avoid unsupported alter table DDL. +- Replace the JDBC driver under lib/postgres folder with the openGauss JDBC driver available from the following link – [https://opengauss.org/en/download.html](https://opengauss.org/en/download.html). + +The only change done in the downloaded java file \(compared to the original one\) was to comment the error log printing for serialization and duplicate key errors. These errors are normal in case of MOT, since it uses Optimistic Concurrency Control \(OCC\) mechanism. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The benchmark test is executed using a standard interactive SQL mode without stored procedures. + +## Running the Benchmark + +Anyone can run the benchmark by starting up the server and running the **benchmarksql** scripts. + +To run the benchmark – + +1. Go to the **benchmarksql** run folder and rename sql.common to sql.common.orig. +2. Create a link sql.common to sql.common.opengauss.mot in order to test MOT. +3. Start up the database server. +4. Configure the props.pg file in the client. +5. Run the benchmark. + +## Results Report + +- Results in CLI + + BenchmarkSQL results should appear as follows – + + ![](figures/en-us_image_0270447139.jpg) + + Over time, the benchmark measures and averages the committed transactions. The example above benchmarks for two minutes. + + The score is **2.71M tmp-C** \(new-orders per-minute\), which is 45% of the total committed transactions, meaning the **tpmTOTAL**. + +- Detailed Result Report + + The following is an example of a detailed result report – + + **Figure 1** Detailed Result Report + ![](figures/detailed-result-report.png "detailed-result-report") + + ![](figures/en-us_image_0270447141.png) + + BenchmarkSQL collects detailed performance statistics and operating system performance data \(if configured\). + + This information can show the latency of the queries, and thus expose bottlenecks related to storage/network/CPU. + +- Results of TPC-C of MOT on Huawei Taishan 2480 + + Our TPC-C benchmark dated 01-May-2020 with an openGauss database installed on Taishan 2480 server \(a 4-socket ARM/Kunpeng server\), achieved a throughput of 4.79M tpm‑C. + + A near linear scalability was demonstrated, as shown below – + + **Figure 2** Results of TPC-C of MOT on Huawei Taishan 2480 + ![](figures/results-of-tpc-c-of-mot-on-huawei-taishan-2480.png "results-of-tpc-c-of-mot-on-huawei-taishan-2480") + + diff --git a/content/en/docs/Developerguide/mot-scale-up-architecture.md b/content/en/docs/Developerguide/mot-scale-up-architecture.md new file mode 100644 index 0000000000000000000000000000000000000000..b1ce0748421dca009f2ba5c1159aa4327fedb2aa --- /dev/null +++ b/content/en/docs/Developerguide/mot-scale-up-architecture.md @@ -0,0 +1,79 @@ +# MOT Scale-up Architecture + +To **scale up** means to add additional cores to the _same machine_ in order to add computing power. To scale up refers to the most common traditional form of adding computing power in a machine that has a single pair of controllers and multiple cores. Scale-up architecture is limited by the scalability limits of a machine’s controller. + +## Technical Requirements + +MOT has been designed to achieve the following – + +- **Linear Scale-up –** MOT delivers a transactional storage engine that utilizes all the cores of a single NUMA architecture server in order to provide near-linear scale-up performance. This means that MOT is targeted to achieve a direct, near-linear relationship between the quantity of cores in a machine and the multiples of performance increase. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >The near-linear scale-up results achieved by MOT significantly outperform all other existing solutions, and come as close as possible to achieving optimal results, which are limited by the physical restrictions and limitations of hardware, such as wires. + +- **No Maximum Number of Cores Limitation –** MOT does not place any limits on the maximum quantity of cores. This means that MOT is scalable from a single core up to 1,000s of cores, with minimal degradation per additional core, even when crossing NUMA socket boundaries. +- **Extremely High Transactional Throughout –** MOT delivers a transactional storage engine that can achieve extremely high transactional throughout compared with any other OLTP vendor on the market. +- **Extremely Low Transactional Latency –** MOT delivers a transactional storage engine that can reach extremely low transactional latency compared with any other OLTP vendor on the market. +- **Seamless Integration and Leveraging with/of openGauss –** MOT integrates its transactional engine in a standard and seamless manner with the openGauss product. In this way, MOT reuses maximum functionality from the openGauss layers that are situated on top of its transactional storage engine. + +## Design Principles + +To achieve the requirements described above \(especially in an environment with many-cores\), our storage engine's architecture implements the following techniques and strategies – + +- Data and indexes only reside in memory. +- Data and indexes are **not** laid out with physical partitions \(because these might achieve lower performance for certain types of applications\). +- Transaction concurrency control is based on Optimistic Concurrency Control \(OCC\) without any centralized contention points. See the [MOT Concurrency Control Mechanism](mot-concurrency-control-mechanism.md) section for more information about OCC. +- Parallel Redo Logs \(ultimately per core\) are used to efficiently avoid a central locking point. +- Indexes are lock-free. See the [MOT Indexes](mot-indexes.md) section for more information about lock-free indexes. +- NUMA-awareness memory allocation is used to avoid cross-socket access, especially for session lifecycle objects. See the [NUMA Awareness Allocation and Affinity](numa-awareness-allocation-and-affinity.md) section for more information about NUMA‑awareness. +- A customized MOT memory management allocator with pre-cached object pools is used to avoid expensive runtime allocation and extra points of contention. This dedicated MOT memory allocator makes memory allocation more efficient by pre‑accessing relatively large chunks of memory from the operation system as needed and then divvying it out to the MOT as needed. + +## Integration using Foreign Data Wrappers \(FDW\) + +MOT complies with and leverages openGauss's standard extensibility mechanism – Foreign Data Wrapper \(FDW\), as shown in the following diagram. + +The PostgreSQL Foreign Data Wrapper \(FDW\) feature enables the creation of foreign tables in an MOT database that are proxies for some other data source, such as MySQL, Redis, X3 and so on. When a query is made on a foreign table, the FDW queries the external data source and returns the results, as if they were coming from a table in your database. + +openGauss relies on the PostgreSQL Foreign Data Wrappers \(FDW\) and Index support so that SQL is entirely covered, including stored procedures, user defined functions, system functions calls. + +**Figure 1** MOT Architecture +![](figures/mot-architecture.png "mot-architecture") + +In the diagram above, the MOT engine is represented in green, while the existing openGauss \(based on Postgres\) components are represented in the top part of this diagram in blue. As you can see, the Foreign Data Wrapper \(FDW\) mediates between the MOT engine and the openGauss components. + +**MOT-Related FDW Customizations** + +Integrating MOT through FDW enables the reuse of the most upper layer openGauss functionality and therefore significantly shortened MOT's time-to-market without compromising SQL coverage. + +However, the original FDW mechanism in openGauss was not designed for storage engine extensions, and therefore lacks the following essential functionalities – + +- Index awareness of foreign tables to be calculated in the query planning phase +- Complete DDL interfaces +- Complete transaction lifecycle interfaces +- Checkpoint interfaces +- Redo Log interface +- Recovery interfaces +- Vacuum interfaces + +In order to support all the missing functionalities, the SQL layer and FDW interface layer were extended to provide the necessary infrastructure in order to enable the plugging in of the MOT transactional storage engine. + +## Result – Linear Scale-up + +The following shows the results achieved by the MOT design principles and implementation described above. + +To the best of our knowledge, MOT outperforms all existing industry-grade OLTP databases in transactional throughput of ACID-compliant workloads. + +openGauss and MOT have been tested on the following many-core systems with excellent performance scalability results. The tests were performed both on x86 Intel-based and ARM/Kunpeng-based many-core servers. You may refer to the [MOT Performance Benchmarks](mot-performance-benchmarks.md) section for more detailed performance review. + +Our TPC-C benchmark dated June 2020 tested an openGauss MOT database on a Taishan 2480 server. A 4-socket ARM/Kunpeng server, achieved throughput of 4.8 M tpmC. The following graph shows the near-linear nature of the results, meaning that it shows a significant increase in performance correlating to the increase of the quantity of cores – + +**Figure 2** TPC-C on ARM \(256 Cores\) +![](figures/tpc-c-on-arm-(256-cores).png "tpc-c-on-arm-(256-cores)") + +The following is an additional example that shows a test on an x86-based server also showing CPU utilization. + +**Figure 3** tpmC vs CPU Usage +![](figures/tpmc-vs-cpu-usage.png "tpmc-vs-cpu-usage") + +The chart shows that MOT demonstrates a significant performance increase correlation with an increase of the quantity of cores. MOT consumes more and more of the CPU correlating to the increase of the quantity of cores. Other industry solutions do not increase and sometimes show slightly degraded performance, which is a well-known problem in the database industry that affects customers’ CAPEX and OPEX expenses and operational efficiency. + diff --git a/content/en/docs/Developerguide/mot-server-optimization-arm-huawei-taishan-2p-4p.md b/content/en/docs/Developerguide/mot-server-optimization-arm-huawei-taishan-2p-4p.md new file mode 100644 index 0000000000000000000000000000000000000000..85545411decb9f7ab877c894eb4362d2ca874963 --- /dev/null +++ b/content/en/docs/Developerguide/mot-server-optimization-arm-huawei-taishan-2p-4p.md @@ -0,0 +1,108 @@ +# MOT Server Optimization – ARM Huawei Taishan 2P/4P + +The following are optional settings for optimizing MOT database performance running on an ARM/Kunpeng-based Huawei Taishan 2280 v2 server powered by 2-sockets with a total of 256 Cores\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] and Taishan 2480 v2 server powered by 4-sockets with a total of 256 Cores\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\]. + +Unless indicated otherwise, the following settings are for both client and server machines – + +## BIOS + +Modify related BIOS settings, as follows – + +1. Select **BIOS** è **Advanced** è **MISC Config**. Set **Support Smmu** to **Disabled**. +2. Select **BIOS** è **Advanced** è **MISC Config**. Set **CPU Prefetching Configuration** to **Disabled**. + + ![](figures/en-us_image_0270362942.png) + +3. Select **BIOS** è **Advanced** è **Memory Config**. Set **Die Interleaving** to **Disabled**. + + ![](figures/en-us_image_0270362943.png) + +4. Select **BIOS** è **Advanced** è **Performance Config**. Set **Power Policy** to **Performance**. + + ![](figures/en-us_image_0270362944.png) + + +## OS – Kernel and Boot + +- The following operating system kernel and boot parameters are usually configured by a sysadmin. + + Configure the kernel parameters, as follows – + + ``` + net.ipv4.ip_local_port_range = 9000 65535 + kernel.sysrq = 1 + kernel.panic_on_oops = 1 + kernel.panic = 5 + kernel.hung_task_timeout_secs = 3600 + kernel.hung_task_panic = 1 + vm.oom_dump_tasks = 1 + kernel.softlockup_panic = 1 + fs.file-max = 640000 + kernel.msgmnb = 7000000 + kernel.sched_min_granularity_ns = 10000000 + kernel.sched_wakeup_granularity_ns = 15000000 + kernel.numa_balancing=0 + vm.max_map_count = 1048576 + net.ipv4.tcp_max_tw_buckets = 10000 + net.ipv4.tcp_tw_reuse = 1 + net.ipv4.tcp_tw_recycle = 1 + net.ipv4.tcp_keepalive_time = 30 + net.ipv4.tcp_keepalive_probes = 9 + net.ipv4.tcp_keepalive_intvl = 30 + net.ipv4.tcp_retries2 = 80 + kernel.sem = 32000 1024000000 500 32000 + kernel.shmall = 52805669 + kernel.shmmax = 18446744073692774399 + sys.fs.file-max = 6536438 + net.core.wmem_max = 21299200 + net.core.rmem_max = 21299200 + net.core.wmem_default = 21299200 + net.core.rmem_default = 21299200 + net.ipv4.tcp_rmem = 8192 250000 16777216 + net.ipv4.tcp_wmem = 8192 250000 16777216 + net.core.somaxconn = 65535 + vm.min_free_kbytes = 5270325 + net.core.netdev_max_backlog = 65535 + net.ipv4.tcp_max_syn_backlog = 65535 + net.ipv4.tcp_syncookies = 1 + vm.overcommit_memory = 0 + net.ipv4.tcp_retries1 = 5 + net.ipv4.tcp_syn_retries = 5 + ##NEW + kernel.sched_autogroup_enabled=0 + kernel.sched_min_granularity_ns=2000000 + kernel.sched_latency_ns=10000000 + kernel.sched_wakeup_granularity_ns=5000000 + kernel.sched_migration_cost_ns=500000 + vm.dirty_background_bytes=33554432 + kernel.shmmax=21474836480 + net.ipv4.tcp_timestamps = 0 + net.ipv6.conf.all.disable_ipv6=1 + net.ipv6.conf.default.disable_ipv6=1 + net.ipv4.tcp_keepalive_time=600 + net.ipv4.tcp_keepalive_probes=3 + kernel.core_uses_pid=1 + ``` + + +- Tuned Service + + The following section is mandatory. + + The server must run a throughput-performance profile – + + ``` + [...]$ tuned-adm profile throughput-performance + ``` + + The **throughput-performance** profile is broadly applicable tuning that provides excellent performance across a variety of common server workloads. + + Other less suitable profiles for openGauss and MOT server that may affect MOT's overall performance are – balanced, desktop, latency-performance, network-latency, network-throughput and powersave. + +- Boot Tuning + + Add **iommu.passthrough=1** to the **kernel boot arguments**. + + When operating in **pass-through** mode, the adapter does require **DMA translation to the memory,** which improves performance. + + diff --git a/content/en/docs/Developerguide/server-optimization-x86.md b/content/en/docs/Developerguide/mot-server-optimization-x86.md similarity index 67% rename from content/en/docs/Developerguide/server-optimization-x86.md rename to content/en/docs/Developerguide/mot-server-optimization-x86.md index 1acbe81accaa7d1fa99ddb07aa2421e0046a1ae9..8b6ea24e15437489bb5f751c259c44441c2c5ba0 100644 --- a/content/en/docs/Developerguide/server-optimization-x86.md +++ b/content/en/docs/Developerguide/mot-server-optimization-x86.md @@ -1,163 +1,159 @@ -# Server Optimization – x86 - -The following are optional settings for optimizing MOT database performance running on an Intel x86 server. These settings are optimal for high throughput workloads - -- [BIOS](#section13218630941) -- [OS Environment Settings](#section850904112414) -- [Disk/SSD](#section11607173101020) -- [Network](#section1528763861114) - -Generally, databases are bounded by the following components: - -- **CPU –** A faster CPU speeds up any CPU-bound database. -- **Disk –** High-speed SSD/NVME speeds up any I/O-bound database. -- **Network **– A faster network speeds up any **SQL\*Net**-bound database. - -In addition to the above, the following general-purpose server settings are used by default and may significantly affect a database’s performance. - -MOT performance tuning is a crucial step for ensuring fast application functionality and data retrieval. MOT can utilize state-of-the-art hardware, and therefore it is extremely important to tune each system in order to achieve maximum throughput. - -## BIOS - -**Hyper Threading – ON** - -Activation \(HT=ON\) is highly recommended. - -We recommend turning hyper threading ON while running OLTP workloads on MOT. When hyper-threading is used, some OLTP workloads demonstrate performance gains of up to40%. - -## OS Environment Settings - -- NUMA - - Disable NUMA balancing, as described below. MOT performs its own memory management with extremely efficient NUMA-awareness, much more than the default methods used by the operating system. - - ``` - echo 0 > /proc/sys/kernel/numa_balancing - ``` - -- Services - - Disable Services, as described below. - - ``` - service irqbalance stop # MANADATORY - service sysmonitor stop # OPTIONAL, performance - service rsyslog stop # OPTIONAL, performance - ``` - -- Tuned Service - - The following section is mandatory. - - The server must run the throughput-performance profile. - - ``` - [...]$ tuned-adm profile throughput-performance - ``` - - The **throughput-performance** profile is broadly applicable tuning that provides excellent performance across a variety of common server workloads. - - Other less suitable profiles for openGauss and MOT server that may affect MOT’s overall performance are – balanced, desktop, latency-performance, network-latency, network-throughput and powersave. - -- Sysctl - - The following lists the recommended operating system settings for best performance. - - - Add the following settings to /etc/sysctl.conf and run sysctl -p - - ``` - net.ipv4.ip_local_port_range = 9000 65535 - kernel.sysrq = 1 - kernel.panic_on_oops = 1 - kernel.panic = 5 - kernel.hung_task_timeout_secs = 3600 - kernel.hung_task_panic = 1 - vm.oom_dump_tasks = 1 - kernel.softlockup_panic = 1 - fs.file-max = 640000 - kernel.msgmnb = 7000000 - kernel.sched_min_granularity_ns = 10000000 - kernel.sched_wakeup_granularity_ns = 15000000 - kernel.numa_balancing=0 - vm.max_map_count = 1048576 - net.ipv4.tcp_max_tw_buckets = 10000 - net.ipv4.tcp_tw_reuse = 1 - net.ipv4.tcp_tw_recycle = 1 - net.ipv4.tcp_keepalive_time = 30 - net.ipv4.tcp_keepalive_probes = 9 - net.ipv4.tcp_keepalive_intvl = 30 - net.ipv4.tcp_retries2 = 80 - kernel.sem = 250 6400000 1000 25600 - net.core.wmem_max = 21299200 - net.core.rmem_max = 21299200 - net.core.wmem_default = 21299200 - net.core.rmem_default = 21299200 - #net.sctp.sctp_mem = 94500000 915000000 927000000 - #net.sctp.sctp_rmem = 8192 250000 16777216 - #net.sctp.sctp_wmem = 8192 250000 16777216 - net.ipv4.tcp_rmem = 8192 250000 16777216 - net.ipv4.tcp_wmem = 8192 250000 16777216 - net.core.somaxconn = 65535 - vm.min_free_kbytes = 26351629 - net.core.netdev_max_backlog = 65535 - net.ipv4.tcp_max_syn_backlog = 65535 - #net.sctp.addip_enable = 0 - net.ipv4.tcp_syncookies = 1 - vm.overcommit_memory = 0 - net.ipv4.tcp_retries1 = 5 - net.ipv4.tcp_syn_retries = 5 - ``` - - - Update the section of /etc/security/limits.conf to the following - - ``` - soft nofile 100000 - hard nofile 100000 - ``` - - The **soft** and a **hard** limit settings specify the quantity of files that a process may have opened at once. The soft limit may be changed by each process running these limits up to the hard limit value. - - -## Disk/SSD - -The following describes how to ensure that disk R/W performance is suitable for database synchronous commit mode. - -To do so, test your disk bandwidth using the following - -``` -[...]$ sync; dd if=/dev/zero of=testfile bs=1M count=1024; sync -1024+0 records in -1024+0 records out -1073741824 bytes (1.1 GB) copied, 1.36034 s, 789 MB/s -``` - -In case the disk bandwidth is significantly below the above number \(789 MB/s\), it may create a performance bottleneck for openGauss, and especially for MOT. - -## Network - -Use a 10Gbps network or higher. - -- To verify, use iperf, as follows. - - ``` - Server side: iperf -s - Client side: iperf -c - ``` - -- rc.local – Network Card tuning - - The following optional settings will have a significant effect on performance: - - 1. Copy set\_irq\_affinity.sh from [https://gist.github.com/SaveTheRbtz/8875474](https://gist.github.com/SaveTheRbtz/8875474) to /var/scripts/. - 2. Put in /etc/rc.d/rc.local and run chmod in order to ensure that the following script is executed during boot - - ``` - 'chmod +x /etc/rc.d/rc.local' - var/scripts/set_irq_affinity.sh -x all - ethtool -K gro off - ethtool -C adaptive-rx on adaptive-tx on - Replace with the network card, i.e. ens5f1 - ``` - - - +# MOT Server Optimization – x86 + +Generally, databases are bounded by the following components – + +- **CPU –** A faster CPU speeds up any CPU-bound database. +- **Disk –** High-speed SSD/NVME speeds up any I/O-bound database. +- **Network –** A faster network speeds up any **SQL\*Net**-bound database. + +In addition to the above, the following general-purpose server settings are used by default and may significantly affect a database's performance. + +MOT performance tuning is a crucial step for ensuring fast application functionality and data retrieval. MOT can utilize state-of-the-art hardware, and therefore it is extremely important to tune each system in order to achieve maximum throughput. + +The following are optional settings for optimizing MOT database performance running on an Intel x86 server. These settings are optimal for high throughput workloads – + +## BIOS + +- Hyper Threading – ON + + Activation \(HT=ON\) is highly recommended. + + We recommend turning hyper threading ON while running OLTP workloads on MOT. When hyper‑threading is used, some OLTP workloads demonstrate performance gains of up to40%. + + +## OS Environment Settings + +- NUMA + + Disable NUMA balancing, as described below. MOT performs its own memory management with extremely efficient NUMA-awareness, much more than the default methods used by the operating system. + + ``` + echo 0 > /proc/sys/kernel/numa_balancing + ``` + +- Services + + Disable Services, as described below – + + ``` + service irqbalance stop # MANADATORY + service sysmonitor stop # OPTIONAL, performance + service rsyslog stop # OPTIONAL, performance + ``` + +- Tuned Service + + The following section is mandatory. + + The server must run the throughput-performance profile – + + ``` + [...]$ tuned-adm profile throughput-performance + ``` + + The **throughput-performance** profile is broadly applicable tuning that provides excellent performance across a variety of common server workloads. + + Other less suitable profiles for openGauss and MOT server that may affect MOT's overall performance are – balanced, desktop, latency-performance, network-latency, network-throughput and powersave. + +- Sysctl + + The following lists the recommended operating system settings for best performance. + + - Add the following settings to /etc/sysctl.conf and run sysctl -p + + ``` + net.ipv4.ip_local_port_range = 9000 65535 + kernel.sysrq = 1 + kernel.panic_on_oops = 1 + kernel.panic = 5 + kernel.hung_task_timeout_secs = 3600 + kernel.hung_task_panic = 1 + vm.oom_dump_tasks = 1 + kernel.softlockup_panic = 1 + fs.file-max = 640000 + kernel.msgmnb = 7000000 + kernel.sched_min_granularity_ns = 10000000 + kernel.sched_wakeup_granularity_ns = 15000000 + kernel.numa_balancing=0 + vm.max_map_count = 1048576 + net.ipv4.tcp_max_tw_buckets = 10000 + net.ipv4.tcp_tw_reuse = 1 + net.ipv4.tcp_tw_recycle = 1 + net.ipv4.tcp_keepalive_time = 30 + net.ipv4.tcp_keepalive_probes = 9 + net.ipv4.tcp_keepalive_intvl = 30 + net.ipv4.tcp_retries2 = 80 + kernel.sem = 250 6400000 1000 25600 + net.core.wmem_max = 21299200 + net.core.rmem_max = 21299200 + net.core.wmem_default = 21299200 + net.core.rmem_default = 21299200 + #net.sctp.sctp_mem = 94500000 915000000 927000000 + #net.sctp.sctp_rmem = 8192 250000 16777216 + #net.sctp.sctp_wmem = 8192 250000 16777216 + net.ipv4.tcp_rmem = 8192 250000 16777216 + net.ipv4.tcp_wmem = 8192 250000 16777216 + net.core.somaxconn = 65535 + vm.min_free_kbytes = 26351629 + net.core.netdev_max_backlog = 65535 + net.ipv4.tcp_max_syn_backlog = 65535 + #net.sctp.addip_enable = 0 + net.ipv4.tcp_syncookies = 1 + vm.overcommit_memory = 0 + net.ipv4.tcp_retries1 = 5 + net.ipv4.tcp_syn_retries = 5 + ``` + + - Update the section of /etc/security/limits.conf to the following – + + ``` + soft nofile 100000 + hard nofile 100000 + ``` + + The **soft** and a **hard** limit settings specify the quantity of files that a process may have opened at once. The soft limit may be changed by each process running these limits up to the hard limit value. + +- Disk/SSD + + The following describes how to ensure that disk R/W performance is suitable for database synchronous commit mode. + + To do so, test your disk bandwidth using the following + + ``` + [...]$ sync; dd if=/dev/zero of=testfile bs=1M count=1024; sync + 1024+0 records in + 1024+0 records out + 1073741824 bytes (1.1 GB) copied, 1.36034 s, 789 MB/s + ``` + + In case the disk bandwidth is significantly below the above number \(789 MB/s\), it may create a performance bottleneck for openGauss, and especially for MOT. + + +## Network + +Use a 10Gbps network or higher. + +To verify, use iperf, as follows – + +``` +Server side: iperf -s +Client side: iperf -c +``` + +- rc.local – Network Card Tuning + + The following optional settings have a significant effect on performance – + + 1. Copy set\_irq\_affinity.sh from [https://gist.github.com/SaveTheRbtz/8875474](https://gist.github.com/SaveTheRbtz/8875474) to /var/scripts/. + 2. Put in /etc/rc.d/rc.local and run chmod in order to ensure that the following script is executed during boot – + + ``` + 'chmod +x /etc/rc.d/rc.local' + var/scripts/set_irq_affinity.sh -x all + ethtool -K gro off + ethtool -C adaptive-rx on adaptive-tx on + Replace with the network card, i.e. ens5f1 + ``` + + + diff --git a/content/en/docs/Developerguide/silo-enhancements-for-mot.md b/content/en/docs/Developerguide/mot-silo-enhancements.md similarity index 55% rename from content/en/docs/Developerguide/silo-enhancements-for-mot.md rename to content/en/docs/Developerguide/mot-silo-enhancements.md index ec6c9b9b53b50e80dac27b75c83493598944645c..3166b6259a4ad1a942a627d17a3b6eaa0e71788b 100644 --- a/content/en/docs/Developerguide/silo-enhancements-for-mot.md +++ b/content/en/docs/Developerguide/mot-silo-enhancements.md @@ -1,15 +1,15 @@ -# SILO Enhancements for MOT - -SILO in its basic algorithm flow outperformed many other ACID-compliant OCCs that we tested in our research experiments. However, in order to make it a product-grade mechanism, we had to enhance it with many essential functionalities that were missing in the original design, such as - -- Added support for interactive mode transactions, where transactions are running SQL by SQL from the client side and not as a single step on the server side -- Added optimistic inserts -- Added support for non-unique indexes -- Added support for read-after-write in transactions so that users can see their own changes before they are committed -- Added support for lockless cooperative garbage collection -- Added support for lockless checkpoints -- Added support for fast recovery -- Added support for two-phase commit in a distributed deployment - -Adding these enhancements without breaking the scalable characteristic of the original SILO was very challenging. - +# MOT SILO Enhancements + +SILO\[[Comparison – Disk vs. MOT](comparison-disk-vs-mot.md)\] in its basic algorithm flow outperformed many other ACID-compliant OCCs that we tested in our research experiments. However, in order to make it a product-grade mechanism, we had to enhance it with many essential functionalities that were missing in the original design, such as – + +- Added support for interactive mode transactions, where transactions are running SQL by SQL from the client side and not as a single step on the server side +- Added optimistic inserts +- Added support for non-unique indexes +- Added support for read-after-write in transactions so that users can see their own changes before they are committed +- Added support for lockless cooperative garbage collection +- Added support for lockless checkpoints +- Added support for fast recovery +- Added support for two-phase commit in a distributed deployment + +Adding these enhancements without breaking the scalable characteristic of the original SILO was very challenging. + diff --git a/content/en/docs/Developerguide/mot-sql-coverage-and-limitations.md b/content/en/docs/Developerguide/mot-sql-coverage-and-limitations.md new file mode 100644 index 0000000000000000000000000000000000000000..0e37a852b8b8ccf4f937069d8d2c99a079b207be --- /dev/null +++ b/content/en/docs/Developerguide/mot-sql-coverage-and-limitations.md @@ -0,0 +1,195 @@ +# MOT SQL Coverage and Limitations + +MOT design enables almost complete coverage of SQL and future feature sets. For example, standard Postgres SQL is mostly supported, as well common database features, such as stored procedures and user defined functions. + +The following describes the various types of SQL coverages and limitations – + +## Unsupported Features + +The following features are not supported by MOT – + +- **Engine Interop –** No cross-engine \(Disk+MOT\) queries, views or transactions. Planned for 2021. +- **MVCC, Isolation –** No snapshot/serializable isolation. Planned for 2021. +- **Native Compilation **\(JIT\) **–** Limited SQL coverage. Also, JIT compilation of stored procedures is not supported. +- LOCAL memory is limited to 1 GB. A transaction can only change data of less than 1 GB. +- Capacity \(Data+Index\) is limited to available memory. Anti-caching + Data Tiering will be available in the future. +- No full-text search index. + +In addition, the following are detailed lists of various general limitations of MOT tables, MOT indexes, Query and DML syntax and the features and limitations of Query Native Compilation. + +## MOT Table Limitations + +The following lists the functionality limitations of MOT tables – + +- Partition by range +- AES encryption +- Stream operations +- User-defined types +- Sub-transactions +- DML triggers +- DDL triggers + +## Unsupported Table DDLs + +- Alter table +- Create table, like including +- Create table as select +- Partition by range +- Create table with no-logging clause +- DEFERRABLE primary key + +## Unsupported Data Types + +- UUID +- User-Defined Type \(UDF\) +- Array data type +- NVARCHAR2\(n\) +- Clob +- Name +- Blob +- Raw +- Path +- Circle +- Reltime +- Bit varying\(10\) +- Tsvector +- Tsquery +- JSON +- HSTORE +- Box +- Text +- Line +- Point +- LSEG +- POLYGON +- INET +- CIDR +- MACADDR +- Smalldatetime +- BYTEA +- Bit +- Varbit +- OID +- Money +- Any unlimited varchar/char + +## UnsupportedIndex DDLs and Index + +- Create index on decimal/numeric +- Create index, index per table \> 9 +- Create index on key size \> 256 + + The key size includes the column size in bytes + a column additional size, which is an overhead required to maintain the index. The below table lists the column additional size for different column types. + + Additionally, in case of non-unique indexes an extra 8 bytes is required. + + Thus, the following pseudo code calculates the **key size**: + + ``` + keySize =0; + + for each (column in index){ + keySize += (columnSize + columnAddSize); + } + if (index is non_unique) { + keySize += 8; + } + ``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Column Type

+

Column Size

+

Column Additional Size

+

Varchar

+

N

+

4

+

Tiny

+

1

+

1

+

Short

+

2

+

1

+

Int

+

4

+

1

+

Long

+

8

+

1

+

Float

+

4

+

2

+

Double

+

8

+

3

+
+ + Types that are not specified in above table, the column additional size is zero \(for instance timestamp\). + + +## Unsupported DMLs + +- Merge into +- Delete on conflict +- Insert on conflict +- Select into +- Update on conflict +- Update from + +## Unsupported Queries for Native Compilation and Lite Execution + +- The query refers to more than two tables +- The query has any one of the following attributes – + - Aggregation on non-primitive types + - Window functions + - Sub-query sub-links + - Distinct-ON modifier \(distinct clause is from DISTINCT ON\) + - Recursive \(WITH RECURSIVE was specified\) + - Modifying CTE \(has INSERT/UPDATE/DELETE in WITH\) + + +In addition, the following clauses disqualify a query from lite execution – + +- Returning list +- Group By clause +- Grouping sets +- Having clause +- Windows clause +- Distinct clause +- Sort clause that does not conform to native index order +- Set operations +- Constraint dependencies + diff --git a/content/en/docs/Developerguide/mot-statistics.md b/content/en/docs/Developerguide/mot-statistics.md index 07dafc058daeed7a90d188d9d052b273ace01bf6..ce119c15c9436157274252760802cbee33f6298d 100644 --- a/content/en/docs/Developerguide/mot-statistics.md +++ b/content/en/docs/Developerguide/mot-statistics.md @@ -1,10 +1,10 @@ -# MOT Statistics - -Statistics are intended for performance analysis or debugging. It is uncommon to turn them ON in a production environment \(by default, they are OFF\). Statistics are primarily used by database developers and to a lesser degree by database users. - -Performance Impact – Some impact exists, particularly the impact on the server. Impact on the user is negligible. - -The statistics are saved in database server log. The log is located in the data folder and named **postgresql-DATE-TIME.log**. - -Refer to [Configuration Settings à Statistics](en-us_topic_0257713257.md#_#_statistics) ++ fix reference for detailed configuration options. - +# MOT Statistics + +Statistics are intended for performance analysis or debugging. It is uncommon to turn them ON in a production environment \(by default, they are OFF\). Statistics are primarily used by database developers and to a lesser degree by database users. + +There is some impact on performance, particularly on the server. Impact on the user is negligible. + +The statistics are saved in the database server log. The log is located in the data folder and named **postgresql-DATE-TIME.log**. + +Refer to [STATISTICS \(MOT\)](mot-configuration-settings.md#section659861612477) for detailed configuration options. + diff --git a/content/en/docs/Developerguide/mot-table-limitations.md b/content/en/docs/Developerguide/mot-table-limitations.md deleted file mode 100644 index 0c57b2af6338ada7a0b9b4ad01955791a5a72ff3..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/mot-table-limitations.md +++ /dev/null @@ -1,12 +0,0 @@ -# MOT Table Limitations - -The following lists the functionality limitations of MOT tables: - -- Partition by range -- AES encryption -- Stream operations -- User-defined types -- Sub-transactions -- DML triggers -- DDL triggers - diff --git a/content/en/docs/Developerguide/mot-usage-scenarios.md b/content/en/docs/Developerguide/mot-usage-scenarios.md new file mode 100644 index 0000000000000000000000000000000000000000..a817012aceaf77b769ba154145f2dd463b384258 --- /dev/null +++ b/content/en/docs/Developerguide/mot-usage-scenarios.md @@ -0,0 +1,17 @@ +# MOT Usage Scenarios + +MOT can significantly speed up an application's overall performance, depending on the characteristics of the workload. MOT improves the performance of transaction processing by making data access and transaction execution more efficient and minimizing redirections by removing lock and latch contention between concurrently executing transactions. + +MOT's extreme speed stems from the fact that it is optimized around concurrent in‑memory usage management \(not just because it is in memory\). Data storage, access and processing algorithms were designed from the ground up to take advantage of the latest state of the art enhancements in in-memory and high-concurrency computing. + +openGauss enables an application to use any combination of MOT tables and standard disk-based tables. MOT is especially beneficial for enabling your most active, high-contention and performance-sensitive application tables that have proven to be bottlenecks and for tables that require a predictable low-latency access and high throughput. + +MOT tables can be used for a variety of application use cases, which include: + +- **High-throughput Transactions Processing –** This is the primary scenario for using MOT, because it supports large transaction volume that requires consistently low latency for individual transactions. Examples of such applications are real-time decision systems, payment systems, financial instrument trading, sports betting, mobile gaming, ad delivery and so on. +- **Acceleration of Performance Bottlenecks –** High contention tables can significantly benefit from using MOT, even when other tables are on disk. The conversion of such tables \(in addition to related tables and tables that are referenced together in queries and transactions\) result in a significant performance boost as the result of lower latencies, less contention and locks, and increased server throughput ability. +- **Elimination of Mid-Tier Cache –** Cloud and Mobile applications tend to have periodic or spikes of massive workload. Additionally, many of these applications have 80% or above read-workload, with frequent repetitive queries. To sustain the workload spikes, as well to provide optimal user experience by low-latency response time, applications sometimes deploy a mid-tier caching layer. Such additional layers increase development complexity and time, and also increase operational costs. MOT provides a great alternative, simplifying the application architecture with a consistent and high performance data store, while shortening development cycles and reducing CAPEX and OPEX costs. +- **Large-scale Data Streaming and Data Ingestion –** MOT tables enables large-scale streamlined data processing in the Cloud \(for Mobile, M2M and IoT\), Transactional Processing \(TP\), Analytical Processing \(AP\) and Machine Learning \(ML\). MOT tables are especially good at consistently and quickly ingesting large volumes of data from many different sources at the same time. The data can be later processed, transformed and moved in slower disk-based tables. Alternatively, MOT enables the querying of consistent and up-date data that enable real-time conclusions. In IoT and cloud applications with many real-time data streams, it is common to have special data ingestion and processing triers. For instance, an Apache Kafka cluster can be used to ingest data of 100,000 events/sec with a 10msec latency. A periodic batch processing task enriches and converts the collected data into an alternative format to be placed into a relational database for further analysis. MOT can support such scenarios \(while eliminating the separate data ingestion tier\) by ingesting data streams directly into MOT relational tables, ready for analysis and decisions. This enables faster data collection and processing, MOT eliminates costly tiers and slow batch processing, increases consistency, increases freshness of analyzed data, as well as lowers Total Cost of Ownership \(TCO\). + +- **Lower TCO –** Higher resource efficiency and mid-tier elimination can save 30% to 90%. Competitors reported similar case studies \([MemSQL](https://www.memsql.com/blog/epigen-powers-facial-recognition-in-the-cloud-with-memsql-case-study/), [Azure](https://azure.microsoft.com/es-es/blog/azure-sql-database-in-memory-performance/)\). + diff --git a/content/en/docs/Developerguide/usage.md b/content/en/docs/Developerguide/mot-usage.md similarity index 42% rename from content/en/docs/Developerguide/usage.md rename to content/en/docs/Developerguide/mot-usage.md index b07689f049875946c5bdc16209e461f706162401..0f00f75fd0a3a03357d80b394c180a719f7770b5 100644 --- a/content/en/docs/Developerguide/usage.md +++ b/content/en/docs/Developerguide/mot-usage.md @@ -1,25 +1,36 @@ -# Usage - -Using MOT tables is quite simple and is described in the few short sections below. - -openGauss enables an application to use of MOT tables and standard disk-based tables. You can use MOT tables for your most active, high-contention and throughput-sensitive application tables or you can use MOT tables for all your application's tables. - -The following commands describe how to create MOT tables and how to convert existing disk-based tables into MOT tables in order to accelerate an application's database-related performance. MOT is especially beneficial when applied to tables that have proven to be bottlenecks. - -- **[Workflow Overview](workflow-overview.md)** - -- **[Granting User Permissions](granting-user-permissions.md)** - -- **[Creating/Dropping a MOT Table](creating-dropping-a-mot-table.md)** - -- **[Creating an Index for MOT Table](creating-an-index-for-mot-table.md)** - -- **[Converting a Disk Table into a MOT Table](converting-a-disk-table-into-a-mot-table.md)** - -- **[Retrying an Aborted Transaction](retrying-an-aborted-transaction.md)** - -- **[External Support Tools](external-support-tools.md)** - -- **[SQL Coverage and Limitations](sql-coverage-and-limitations.md)** - - +# MOT Usage + +Using MOT tables is quite simple and is described in the few short sections below. + +openGauss enables an application to use of MOT tables and standard disk-based tables. You can use MOT tables for your most active, high-contention and throughput-sensitive application tables or you can use MOT tables for all your application's tables. + +The following commands describe how to create MOT tables and how to convert existing disk-based tables into MOT tables in order to accelerate an application's database-related performance. MOT is especially beneficial when applied to tables that have proven to be bottlenecks. + +Workflow Overview + +The following is a simple overview of the tasks related to working with MOT tables – + +![](figures/en-us_image_0270171684.png) + +- [Granting User Permissions](granting-user-permissions.md) +- [Creating/Dropping an MOT Table](creating-dropping-an-mot-table.md) +- **Creating an Index for an MOT Table** +- This section also describes how to perform various additional MOT-related tasks, as well as [MOT SQL Coverage and Limitations](mot-sql-coverage-and-limitations.md) – + +- **[Granting User Permissions](granting-user-permissions.md)** + +- **[Creating/Dropping an MOT Table](creating-dropping-an-mot-table.md)** + +- **[Creating an Index for an MOT Table](creating-an-index-for-an-mot-table.md)** + +- **[Converting a Disk Table into an MOT Table](converting-a-disk-table-into-an-mot-table.md)** + +- **[Query Native Compilation](query-native-compilation.md)** + +- **[Retrying an Aborted Transaction](retrying-an-aborted-transaction.md)** + +- **[MOT External Support Tools](mot-external-support-tools.md)** + +- **[MOT SQL Coverage and Limitations](mot-sql-coverage-and-limitations.md)** + + diff --git a/content/en/docs/Developerguide/mot-vacuum.md b/content/en/docs/Developerguide/mot-vacuum.md new file mode 100644 index 0000000000000000000000000000000000000000..e64dfeb0eff5a3b52394c6a03cf9ddedb236c163 --- /dev/null +++ b/content/en/docs/Developerguide/mot-vacuum.md @@ -0,0 +1,39 @@ +# MOT Vacuum + +Use VACUUM for garbage collection and optionally to analyze a database, , as follows – + +- \[PG\] + + In Postgress \(PG\), the VACUUM reclaims storage occupied by dead tuples. In normal PG operation, tuples that are deleted or that are made obsolete by an update are not physically removed from their table. They remain present until a VACUUM is done. Therefore, it is necessary to perform a VACUUM periodically, especially on frequently updated tables. + +- \[MOT Extension\] + + MOT tables do not need a periodic VACUUM operation, since dead/empty tuples are re‑used by new ones. MOT tables require VACUUM operations only when their size is significantly reduced and they do not expect to grow to their original size in the near future. + + For example, an application that periodically \(for example, once in a week\) performs a large deletion of a table/tables data while inserting new data takes days and does not necessarily require the same quantity of rows. In such cases, it makes sense to activate the VACUUM. + + The VACUUM operation on MOT tables is always transformed into a VACUUM FULL with an exclusive table lock. + + +- Supported Syntax and Limitations + + Activation of the VACUUM operation is performed in a standard manner. + + ``` + VACUUM [FULL | ANALYZE] [ table ]; + ``` + + Only the FULL and ANALYZE VACUUM options are supported. The VACUUM operation can only be performed on an entire MOT table. + + The following PG vacuum options are not supported: + + - FREEZE + - VERBOSE + - Column specification + - LAZY mode \(partial table scan\) + + Additionally, the following functionality is not supported – + + - AUTOVACUUM + + diff --git a/content/en/docs/Developerguide/numa-awareness-allocation-and-affinity.md b/content/en/docs/Developerguide/numa-awareness-allocation-and-affinity.md index 638af4679e00c0f366de42ae81b3fc2a17212557..dd456543de0042a4ddd3f66e7d022591386bec1e 100644 --- a/content/en/docs/Developerguide/numa-awareness-allocation-and-affinity.md +++ b/content/en/docs/Developerguide/numa-awareness-allocation-and-affinity.md @@ -1,14 +1,16 @@ -# NUMA Awareness Allocation and Affinity - -Non-uniform memory access \(NUMA\) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can take advantage of NUMA by preferring to access its own local memory \(which is faster\), rather than accessing non-local memory \(meaning that it will prefer not to access the local memory of another processor or memory shared between processors\). MOT memory access has been designed with NUMA awareness. This means that MOT is aware that memory is not uniform and achieves best performance by accessing the quickest and most local memory. - -The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users. - -In-memory database systems running on NUMA platforms face several issues, such as the increased latency and the decreased bandwidth when accessing remote main memory. To cope with these NUMA-related issues, NUMA awareness must be considered as a major design principle for the fundamental architecture of a database system. - -To facilitate fast operation and make efficient use of NUMA nodes, MOT allocates a designated memory pool for rows per table and for nodes per index. Each memory pool is composed from chunks of 2 MB. There is an API to allocate these chunks from a local NUMA node, from pages coming from all nodes or in a round-robin fashion where every chunk is allocated on the next node. By default, pools of shared data are allocated in round robin to balance access while not splitting rows between different NUMA nodes. However, thread private memory is allocated from a local node. It must also be verified that a thread always operates in the same NUMA node. - -**Summary** - -MOT has a smart****memory control module with preallocated memory pools for different memory objects. This improves performance, reduces locks and ensures stability. Allocation of memory objects for a transaction is always NUMA-local, ensuring optimal performance for CPU memory access and resulting in low latency and reduced contention. Deallocated objects go back to the memory pool. Minimized use of OS malloc functions during transactions avoids unnecessary locks. - +# NUMA Awareness Allocation and Affinity + +Non-Uniform Memory Access \(NUMA\) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can take advantage of NUMA by preferring to access its own local memory \(which is faster\), rather than accessing non-local memory \(meaning that it will prefer **not** to access the local memory of another processor or memory shared between processors\). + +MOT memory access has been designed with NUMA awareness. This means that MOT is aware that memory is not uniform and achieves best performance by accessing the quickest and most local memory. + +The benefits of NUMA are limited to certain types of workloads, particularly on servers where the data is often strongly associated with certain tasks or users. + +In-memory database systems running on NUMA platforms face several issues, such as the increased latency and the decreased bandwidth when accessing remote main memory. To cope with these NUMA-related issues, NUMA awareness must be considered as a major design principle for the fundamental architecture of a database system. + +To facilitate quick operation and make efficient use of NUMA nodes, MOT allocates a designated memory pool for rows per table and for nodes per index. Each memory pool is composed from 2 MB chunks. A designated API allocates these chunks from a local NUMA node, from pages coming from all nodes or in a round-robin fashion, where each chunk is allocated on the next node. By default, pools of shared data are allocated in a round robin fashion in order to balance access, while not splitting rows between different NUMA nodes. However, thread private memory is allocated from a local node. It must also be verified that a thread always operates in the same NUMA node. + +**Summary** + +MOT has a smart memory control module that has preallocated memory pools intended for various types of memory objects. This smart memory control improves performance, reduces locks and ensures stability. The allocation of the memory objects of a transaction is always NUMA-local, ensuring optimal performance for CPU memory access and resulting in low latency and reduced contention. Deallocated objects go back to the memory pool. Minimized use of OS malloc functions during transactions circumvents unnecessary locks. + diff --git a/content/en/docs/Developerguide/occ-vs-2pl-differences-by-example.md b/content/en/docs/Developerguide/occ-vs-2pl-differences-by-example.md deleted file mode 100644 index e9c6285b06e1c33770e5e4b1794c169ec0d2e814..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/occ-vs-2pl-differences-by-example.md +++ /dev/null @@ -1,17 +0,0 @@ -# OCC vs 2PL Differences by Example - -The following shows the differences between two different user experiences – Pessimistic \(such as disk-based tables\) and Optimistic \(MOT tables\) when sessions update the same table simultaneously. - -In this example, the following table test command is run - -``` -table “TEST” – create table test (x int, y int, z int, primary key(x)); -``` - -This example describes two aspects of the test, user experience \(operations in the example\) and retry requirement. - -- **[Pessimistic Approach – Used in Disk-based Tables](pessimistic-approach-used-in-disk-based-tables.md)** - -- **[Optimistic Approach – Used in MOT](optimistic-approach-used-in-mot.md)** - - diff --git a/content/en/docs/Developerguide/optimistic-approach-used-in-mot.md b/content/en/docs/Developerguide/optimistic-approach-used-in-mot.md deleted file mode 100644 index 124ab89b13b3a41bf8853a17e551a202abc26fb6..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/optimistic-approach-used-in-mot.md +++ /dev/null @@ -1,70 +0,0 @@ -# Optimistic Approach – Used in MOT - -The following is an example of the Optimistic approach. - -It describes the situation of creating a MOT table4 and then having two concurrent sessions updating that same MOT table simultaneously - -``` -create foreign table test (x int, y int, z int, primary key(x)); -``` - -- The advantage and user experience of OCC is that there are no locks until COMMIT. -- The disadvantage of using OCC is that the update may fail if another session updates the same record. If the update fails \(in all supported isolation levels\), an entire SESSION \#2 transaction must be retried. -- Update conflicts are detected by the kernel at commit time by using a version checking mechanism. -- SESSION \#2 will not wait in its update operation and will be aborted because of conflict detection at commit phase. - -**Table 1** Optimistic Approach Code Example – Used in MOT - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  

Session 1

-

Session 2

-

t0

-

Begin

-

Begin

-

t1

-

update test set y=200 where x=1;

-
  

t2

-

y=200

-

Update test set y=300 where x=1;

-

t4

-

Commit

-

y = 300

-
    

Commit

-
    

ABORT

-
    

y = 200

-
- diff --git a/content/en/docs/Developerguide/optimistic-concurrency-control.md b/content/en/docs/Developerguide/optimistic-concurrency-control.md deleted file mode 100644 index c40740398dc640d22431c1a8bd19c12fcbc39cf9..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/optimistic-concurrency-control.md +++ /dev/null @@ -1,7 +0,0 @@ -# Optimistic Concurrency Control - -The Concurrency Control Module \(CC Module for short\) provides all the transactional requirements for the Main Memory Engine. The primary objective of the CC Module is to provide the Main Memory Engine with support for various isolation levels. - -- **[Optimistic OCC vs. Pessimistic 2PL](optimistic-occ-vs-pessimistic-2pl.md)** - - diff --git a/content/en/docs/Developerguide/optimistic-occ-vs-pessimistic-2pl.md b/content/en/docs/Developerguide/optimistic-occ-vs-pessimistic-2pl.md deleted file mode 100644 index 960041fac62ed18b8f400bec543b4720737cdc0d..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/optimistic-occ-vs-pessimistic-2pl.md +++ /dev/null @@ -1,34 +0,0 @@ -# Optimistic OCC vs. Pessimistic 2PL - -The functional differences of Pessimistic 2PL \(2-Phase Locking\) vs. Optimistic Concurrency Control \(OCC\) involve pessimistic versus optimistic approaches to transaction integrity. - -Disk-based tables use a pessimistic approach, which is the most commonly used database method. The MOT Engine use the optimistic approach. - -The primary functional difference between the pessimistic approach and the optimistic approach is that if a conflict occurs – - -- The pessimistic approach causes the client to wait -- The optimistic approach causes one of the transactions to fail so that the transaction must be retried by the client. - -## Optimistic Concurrency Control Approac - -The** Optimistic Concurrency Control \(OCC\)** approach detects conflicts as they occur, and performs validation checks at commit time. - -The optimistic approach has less overhead and is usually more efficient, partly because transaction conflicts are uncommon in most applications. - -The functional differences between these two approaches is larger when the REPEATABLE READ isolation level is enforced and is largest for the SERIALIZABLE isolation level. - -## Pessimistic Approaches \(Not used by MOT\) - -The** Pessimistic Concurrency Control** \(2PL or 2-Phase Locking\) approach uses locks to block potential conflicts before they occur. A Lock is applied when a statement is executed and released when the transaction is committed. Disk-based row‑store uses this approach \(with the addition of Multi-version Concurrency Control \(MVCC\)\). - -In 2PL algorithms, if one transaction is writing a row, no other transaction can access it, and if a row is being read, no transaction is allowed to write it. A row is locked at access time for read and for write; and the lock is released at commit time. These algorithms require a scheme for handling and avoiding deadlock. Deadlock can be detected by calculating cycles in a wait-for graph. Deadlock can be avoided by keeping time ordering in TSO[\[7\]](#_ftn7) or by some kind of back-off scheme. - -Another approach is **Encounter Time Locking \(ETL\)**. In ETL, reads are handled in an optimistic manner, but writers lock the data that they access. As a result, writers from different ETL transactions see each other and can decide to abort. It has been empirically verified[\[8\]](#_ftn8) that ETL improves the performance of OCC in two ways. - -- First, they detect conflicts early and often increase the transaction throughput because transactions do not perform useless operations, because conflicts discovered at commit time \(in general\) cannot be solved without aborting at least one transaction. -- Second, encounter-time locking Reads-After-Writes \(RAW\) to be handled efficiently without requiring expensive or complex mechanisms. - -OCC is the fastest option for most workloads[\[9\]](#_ftn9)[\[10\]](#_ftn10). This finding has also been observed in our preliminary research phase. One reason is that when every core executes multiple threads, a lock is likely to be held by a swapped thread, especially in interactive mode. In addition, pessimistic algorithms involve deadlock detection \(which introduces overhead\) and usually uses read-write locks \(which are less efficient than standard spin-locks\). - -We have chosen Silo[\[11\]](#_ftn11) because it was simpler than other existing options, such as TicToc[\[12\]](#_ftn12), while maintaining the same performance for most workloads. ETL is sometimes faster than OCC, but it introduces spurious aborts which may confuse a user, in contrast to OCC which aborts only at commit. - diff --git a/content/en/docs/Developerguide/overall-architecture.md b/content/en/docs/Developerguide/overall-architecture.md deleted file mode 100644 index 89d40e014fb1763bb6a5b873114df8190e03d7f2..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/overall-architecture.md +++ /dev/null @@ -1,44 +0,0 @@ -# Overall Architecture - -![](figures/en-us_image_0260488320.png) - -As this diagram depicts, the RedoLog component is being used by both by backend threads which use the In-Memory Engine and by the WAL writer to persist their data. - -Checkpoints are being performed using the Checkpoint Manager \(triggered by the postgres checkpointer\) - -## Logging overview - -Write-Ahead Logging \(WAL\) is a standard method for ensuring data durability. WAL's central concept is that changes to data files \(where tables and indexes reside\) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage. - -In the In-Memory Engine we have made the decision to use the existing openGauss logging facilities and not develop a low level logging API from scratch in order to reduce development time and to be able to use it for replication purposes as well. - -## Per-transaction logging - -In the In-Memory Engine, the transaction log records are stored in a transaction buffer which is part of the whole transaction object \(TXN\). The transaction buffer is logged during the calls to addToLog\(\) – if the buffer exceeds a threshold it will be flushed and reused. When a transaction commits and passes the validation phase \(OCC SILO validation\) or abort for some reason, the appropriate message will be saved in the log as well in order to make it possible to distinguish the transaction state when doing a recovery. - -![](figures/en-us_image_0260574156.png) - -Parallel Logging is done both for MOT and Disk ending. - -However the MOT enhances the design with a Log-buffer per Transaction, lockless preparation, Single Log Record. - -## Synchronous logging - -Synchronous logging is the simplest redo logger. Ongoing or when a transaction ends, the SynchronousRedoLogHandler serializes its transaction buffer and write it to the XLOG iLogger implementation. - -During the write to the log, the thread is blocked and it completes as soon as its buffer is written to the log. - -![](figures/en-us_image_0260574154.png) - -## Asynchronous Logging - -Upon transaction commit, the transaction buffer is moved \(pointer assignment and not data copy\) to a centralized buffer and a new transaction buffer is allocated for the transaction. The transaction is released as soon as its buffer was moved to the centralized buffer. The transaction thread is not blocked. The actual write to the log uses Postgres walwriter thread. When the walwriter timer elapses, it first calls the AsynchronousRedoLogHandler \(via registered callback\) to write its buffers and then continue with its logic and flushes the data to the XLOG. - -![](figures/en-us_image_0260574155.png) - -## Group Commit – with NUMA-awareness - -The four colors represent 4 NUMA nodes. Thus each NUMA node has its own memory log enabling a group commit of multiple connections. This enables …. - -![](figures/en-us_image_0260574153.png) - diff --git a/content/en/docs/Developerguide/overview.md b/content/en/docs/Developerguide/overview.md deleted file mode 100644 index 4b4b5b1ef05eec9e13cb58148b7ccda61dd8cf32..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/overview.md +++ /dev/null @@ -1,28 +0,0 @@ -# Overview - -MOT is automatically deployed as part of openGauss. You may refer to the [Preparation](preparation.md) section for a description of how to estimate and plan required memory and storage resources in order to sustain your workload. The [Deployment](deployment.md) section describes all the configuration settings in MOT, as well as non-mandatory options for server optimization. - -Using MOT tables is quite simple. The syntax of all****MOT commands is the same as for disk-based tables and includes support for most of standard PostgreSQL SQL, DDL and DML commands and features, such as Stored Procedures. Only the create and drop table statements in MOT differ from the statements for disk-based tables in openGauss. You may refer to the [Usage](usage.md) section for a description of these two simple commands, to learn how to convert a disk-based table into a MOT table, to get higher performance using Query Native Compilation and PREPARE statements and for a description of external tool support and the limitations of the MOT engine. - -The [Administration](administration.md) section describes how to perform database maintenance, monitoring and analysis of logs and reported errors. Lastly, the [Sample Workloads](sample-workloads.md) section describes how to perform a standard TPC-C benchmark. - -Read the following topics to learn how to use MOT - - - - - - - - - - - - -

-

Preparation

-

Deployment

-

Page 52

-

Administration

-
- diff --git a/content/en/docs/Developerguide/pessimistic-approach-used-in-disk-based-tables.md b/content/en/docs/Developerguide/pessimistic-approach-used-in-disk-based-tables.md deleted file mode 100644 index 7170d56f4edb5e8ae72553e9ce8fd1fd8063bcfd..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/pessimistic-approach-used-in-disk-based-tables.md +++ /dev/null @@ -1,65 +0,0 @@ -# Pessimistic Approach – Used in Disk-based Tables - -The following is an example of the Pessimistic approach \(which is not Mot\). Any Isolation Level may apply. - -The following two sessions perform a transaction with an attempt to update a single table. - -A WAIT LOCK action occurs and the client experience is that session \#2 is _stuck_ until Session \#1 finished a COMMIT. Only afterwards, Session \#2 can progress. - -However, in this approach, both sessions succeed and there is no abort \(unless SERIALIZABLE or REPEATABLE-READ isolation level is applied\) and then the entire transaction must be retried. - -**Table 1** Pessimistic Approach Code Example - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  

Session 1

-

Session 2

-

t0

-

Begin

-

Begin

-

t1

-

update test set y=200 where x=1;

-
  

t2

-

y=200

-

Update test set y=300 where x=1; -- Wait on lock

-

t4

-

Commit

-
  
    

Unlock

-
    

Commit

-

(in READ-COMMITTED this will succeed, in SERIALIZABLE it will fail)

-
    

y = 300

-
- diff --git a/content/en/docs/Developerguide/preparation.md b/content/en/docs/Developerguide/preparation.md deleted file mode 100644 index 856cde4561f65bddefd326ce80857cb2684bbe3b..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/preparation.md +++ /dev/null @@ -1,7 +0,0 @@ -# Preparation - -- **[Prerequisites](prerequisites.md)** - -- **[Memory and Storage Planning](memory-and-storage-planning.md)** - - diff --git a/content/en/docs/Developerguide/prerequisite-check.md b/content/en/docs/Developerguide/prerequisite-check.md deleted file mode 100644 index aa7e6f6bd64c6c70cb68f7089b3489d02644f8fd..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/prerequisite-check.md +++ /dev/null @@ -1,10 +0,0 @@ -# Prerequisite Check - -Check that the schema of the disk table to be converted into a MOT table contains all required columns. - -Check whether the schema contains any unsupported column data types, as described in the [Unsupported Data Types](unsupported-data-types.md)__section. - -If a specific column is not supported, then it is recommended to first create a secondary disk table with an updated schema. This schema is the same as the original table, except that all the unsupported types have been converted into supported types. - -Afterwards, use the following script to export this secondary disk table and then import it into a MOT table. - diff --git a/content/en/docs/Developerguide/query-native-compilation.md b/content/en/docs/Developerguide/query-native-compilation.md index 9910dc19f4922c0fa4d6c808ef2669ebde936ef0..34e243414bccdc7174517d2d159dfabedfbb870e 100644 --- a/content/en/docs/Developerguide/query-native-compilation.md +++ b/content/en/docs/Developerguide/query-native-compilation.md @@ -1,63 +1,71 @@ -# Query Native Compilation -An additional feature of MOT is the ability to **compile full queries** into a more native format, which then bypass multiple database processing layers and perform significantly better. This feature is sometimes called Just-In-Time \(JIT\) query compilation. - -Users can benefit from MOT query compilation by calling the **PREPARE **statement before the query is executed. In this way, queries and transaction statements are executed in an interactive manner. This is accomplished by first by using the PREPARE client command \(which instructs MOT to compile the query or to load already pre-compiled code from a cache\) and then by executing the statement. - -The following is an example of **PREPARE** syntax in SQL - -``` -PREPARE name [ ( data_type [, ...] ) ] AS statement -``` - -**PREPARE** creates a prepared statement in the database server, which is a server-side object that can be used to optimize performance. Upon its execution, the specified query statement is parsed, analyzed and rewritten. When an **EXECUTE** command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific setting values supplied. - -If the tables mentioned in the query statement are MOT tables, the MOT compilation takes charge of the object preparation and performs a special optimization by compiling the query into LLVM IR byte code. To improve performance further, MOT JIT applies a caching policy for its LLVM code results, enabling them to be reused for the same queries across different sessions. - -When the resulting execute query command reaches the database, it uses the corresponding IR byte code which is executed directly and more efficiently within the MOT engine. This is referred to as _Lite Execution_. - -The following is an example of how to invoke a **PREPARE** statement in a Java application - -``` -conn = DriverManager.getConnection(connectionUrl, connectionUser, connectionPassword); - -// Example 1: PREPARE without bind settings -String query = "SELECT * FROM getusers"; -PreparedStatement prepStmt1 = conn.prepareStatement(query); -ResultSet rs1 = pstatement.executeQuery()) -while (rs1.next()) {…} - -// Example 2: PREPARE with bind settings -String sqlStmt = "SELECT * FROM employees where first_name=? and last_name like ?"; -PreparedStatement prepStmt2 = conn.prepareStatement(sqlStmt); -prepStmt2.setString(1, "Mark"); // first name “Mark” -prepStmt2.setString(2, "%n%"); // last name contains a letter “n” -ResultSet rs2 = prepStmt2.executeQuery()) -while (rs2.next()) {…} -``` - -The following describes the supported and unsupported features of MOT compilation. - -## Supported Queries for Lite Execution - -The following query types are suitable for lite execution – - -- Simple point queries – -- SELECT \(including SELECT for UPDATE\) -- UPDATE -- DELETE -- INSERT query -- Range UPDATE queries that refer to a full prefix of the primary key -- Range SELECT queries that refer to a full prefix of the primary key -- JOIN queries where one or both parts collapse to a point query -- JOIN queries that refer to a full prefix of the primary key in each joined table - -## Unsupported Queries for Lite Execution - -Any special query attribute disqualifies it for Lite Execution. In particular, if any of the following conditions apply, then the query is declared as unsuitable for Lite Execution. You may refer to the [Unsupported Queries for Native Compilation and Lite Execution](unsupported-queries-for-native-compilation-and-lite-execution.md) section for more information. - -It is important to emphasize that in case a query statement does not fit - -native compilation and lite execution, no error is reported to the client and the query will still be executed in a normal and standard manner. - -For more information about MOT native compilation capabilities, see either the section about **Using Query Native Compilation** or a more detailed information in the **Query Native Compilation \(JIT\)** section. - +# Query Native Compilation + +An additional feature of MOT is the ability to prepare and parse _pre-compiled full queries_ in a native format \(using a PREPARE statement\) before they are needed for execution. + +This native format can later be executed \(using an EXECUTE command\) more efficiently. This type of execution is much quicker because the native format bypasses multiple database processing layers during execution and thus enables better performance. + +This division of labor avoids repetitive parse analysis operations. In this way, queries and transaction statements are executed in an interactive manner. This feature is sometimes called _Just-In-Time \(JIT\)_ query compilation. + +## Query Compilation – PREPARE Statement + +To use MOT’s native query compilation, call the PREPARE client statement before the query is executed. This instructs MOT to pre-compile the query and/or to pre-load previously pre-compiled code from a cache. + +The following is an example of PREPARE syntax in SQL – + +``` +PREPARE name [ ( data_type [, ...] ) ] AS statement +``` + +PREPARE creates a prepared statement in the database server, which is a server-side object that can be used to optimize performance. + +## Execute Command + +When an EXECUTE command is subsequently issued, the prepared statement is parsed, analyzed, rewritten and executed. This division of labor avoids repetitive parse analysis operations, while enabling the execution plan to depend on specific provided setting values. + +The following is an example of how to invoke a PREPARE and then an EXECUTE statement in a Java application. + +``` +conn = DriverManager.getConnection(connectionUrl, connectionUser, connectionPassword); + +// Example 1: PREPARE without bind settings +String query = "SELECT * FROM getusers"; +PreparedStatement prepStmt1 = conn.prepareStatement(query); +ResultSet rs1 = pstatement.executeQuery()) +while (rs1.next()) {…} + +// Example 2: PREPARE with bind settings +String sqlStmt = "SELECT * FROM employees where first_name=? and last_name like ?"; +PreparedStatement prepStmt2 = conn.prepareStatement(sqlStmt); +prepStmt2.setString(1, "Mark"); // first name “Mark” +prepStmt2.setString(2, "%n%"); // last name contains a letter “n” +ResultSet rs2 = prepStmt2.executeQuery()) +while (rs2.next()) {…} +``` + +The following describes the supported and unsupported features of MOT compilation. + +## Supported Queries for Lite Execution + +The following query types are suitable for lite execution – + +- Simple point queries – + - SELECT \(including SELECT for UPDATE\) + - UPDATE + - DELETE + +- INSERT query +- Range UPDATE queries that refer to a full prefix of the primary key +- Range SELECT queries that refer to a full prefix of the primary key +- JOIN queries where one or both parts collapse to a point query +- JOIN queries that refer to a full prefix of the primary key in each joined table + +## Unsupported Queries for Lite Execution + +Any special query attribute disqualifies it for Lite Execution. In particular, if any of the following conditions apply, then the query is declared as unsuitable for Lite Execution. You may refer to the Unsupported Queries for Native Compilation and Lite Execution section for more information. + +It is important to emphasize that in case a query statement does not fit + +native compilation and lite execution, no error is reported to the client and the query will still be executed in a normal and standard manner. + +For more information about MOT native compilation capabilities, see either the section about Query Native Compilation or a more detailed information in the Query Native Compilation \(JIT\) section. + diff --git a/content/en/docs/Developerguide/query-native-compilation_jit.md b/content/en/docs/Developerguide/query-native-compilation_jit.md deleted file mode 100644 index 2869f377aa8b96673761bb7d1bc2209fe03b4eec..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/query-native-compilation_jit.md +++ /dev/null @@ -1,48 +0,0 @@ -# Query Native Compilation \(JIT\) - -The Lite Executor module provides the ability to execute simple **prepared** queries in a much faster execution path than the regular generic plan made by the envelope. This is achieved using Just-In-Time \(JIT\) compilation via LLVM. In addition, a similar solution that has potentially similar performance is provided in the form of pseudo-LLVM. - -To benefit from the JIT compilation in MOT, you should invoke a PREPARE statement and only then execute the query. - -The following is an example of PREPARE syntax in SQL - -``` -PREPARE name [ ( data_type [, ...] ) ] AS statement -``` - -The following is an example of PREPARE syntax in Java - -``` -conn = DriverManager.getConnection(connectionUrl, connectionUser, connectionPassword); - -// Example 1: PREPARE without bind settings -String query = "SELECT * FROM getusers"; -PreparedStatement prepStmt1 = conn.prepareStatement(query); -ResultSet rs1 = pstatement.executeQuery()) -while (rs1.next()) {…} - -// Example 2: PREPARE with bind settings -String sqlStmt = "SELECT * FROM employees where first_name=? and last_name like ?"; -PreparedStatement prepStmt2 = conn.prepareStatement(sqlStmt); -prepStmt2.setString(1, "Mark"); // first name “Mark” -prepStmt2.setString(2, "%n%"); // last name contains a letter “n” -ResultSet rs2 = prepStmt2.executeQuery()) -while (rs2.next()) {…} -``` - -**PREPARE** creates a prepared statement. A prepared statement is a server-side object that can be used to optimize performance. When the **PREPARE** statement is executed, the specified statement is parsed, analyzed and rewritten. When an **EXECUTE** command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific setting values supplied. - -MOT performs a special optimization by compiling the query into IR byte code based on LLVM. Whenever a new query compilation is required, the query is analyzed and a proper tailored IR byte code is generated for the query using the utility GsCodeGen object and standard LLVM JIT API \(IRBuilder\). After byte-code generation is completed, the code is JIT‑compiled into a separate LLVM module. The compiled code results in a C function pointer that can later be invoked for direct execution. Note that this C function can be invoked concurrently by many threads, as long as each thread provides a distinct execution context \(see the details below\). Each such execution context is referred to as JIT Context. - -In addition, for availability, the Lite Executor maintains a preallocated pool of JIT sources. Each session preallocates its own session-local pool of JIT context objects \(used for repeated execution of precompiled queries\). - -See more details in the [Supported Queries for Lite Execution](query-native-compilation.md#section23679317151) and[Supported Queries for Lite Execution](query-native-compilation.md#section1477313021514) sections. - -## JIT Compilation Comparison: openGauss Disk-based vs. MOT tables - -Current openGauss contains two main forms of JIT / CodeGen query optimizations for its disk-based tables: \(1\) Accelerating expression evaluation \(such as in WHERE clauses, target lists, aggregates and projections\), and \(2\) Inlining small function invocations. These optimizations are partial \(in the sense they do not optimize the entire interpreted operator tree or replace it altogether\), and are targeted mostly at CPU-bound complex queries, typically seen in OLAP use cases. The execution of queries is performed in a pull-model \(Volcano-style processing\) using an interpreted operator tree. When activated, the compilation is performed at each query execution. At the moment, caching of the generated LLVM code and its reuse across sessions and queries is not present. - -In contrast, MOT JIT optimization provides an LLVM code for entire queries that qualify JIT optimization by MOT. - -The result code is used for direct execution over MOT tables, while the interpreted operator model is abandoned completely. The result is a practically hand-written LLVM code generated for specific and entire query execution. Another significant conceptual difference is that MOT LLVM code is generated only for prepared queries, during the PREPARE phase of the query, rather than at query execution. This is especially important for OLTP scenarios due to the rather short runtime of OLTP queries, which cannot allow for code generation and relatively long query compilation time to be performed during each query execution. At last, in PostgreSQL the activation of PREPARE implies reuse the result plan across execution with different parameters in the same session. Similarly, the MOT JIT applies caching policy for its LLVM code results, and extends it for reuse across different sessions. Thus a single query may be compiled just once and its LLVM code reused across many sessions, which again favorable for OLTP scenarios. - diff --git a/content/en/docs/Developerguide/recovery_mot.md b/content/en/docs/Developerguide/recovery_mot.md deleted file mode 100644 index 033fc47b6813060516059a7c143a7a3697f78a00..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/recovery_mot.md +++ /dev/null @@ -1,6 +0,0 @@ -# RECOVERY \(MOT\) - -checkpoint\_recovery\_workers = 3 - -Specifies the number of workers \(threads\) to use during checkpoint data recovery. Each MOT engine worker runs on its own core and can process a different table in parallel by reading it into memory. For example, while the default is three-course, you might prefer to set this parameter to them number of cores that are available for processing. After recovery these threads are stopped and killed. See Administrator Recovery for more details. - diff --git a/content/en/docs/Developerguide/redo-log_mot.md b/content/en/docs/Developerguide/redo-log_mot.md deleted file mode 100644 index 347ea796ef7a9fbc8a7d8eea122f76f464eab2c2..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/redo-log_mot.md +++ /dev/null @@ -1,30 +0,0 @@ -# REDO LOG \(MOT\) - -- enable\_redo\_log = true - - Specifies whether to use the Redo Log for durability. See the ++ section for more information about redo logs. - -- enable\_group\_commit = false - - Specifies whether to use group commit. - - This option is only relevant when openGauss is configured to use synchronous commit, meaning only when the synchronous\_commit setting in postgresql.conf is configured to any value other than off. - - You may refer to ++ for more information. - -- group\_commit\_size = 16 -- group\_commit\_timeout = 10 ms - - This option is only relevant when the MOT engine has been configured to **Synchronous Group Commit** logging. This means that the synchronous\_commit setting in postgresql.conf is configured to True and theenable\_group\_commit parameter in the mot.conf configuration file is configured to True. - - Defines which of the following determines when a group of transactions is recorded in the WAL Redo Log – - - group\_commit\_size – The quantity of committed transactions in a group. For example, **16** means that when 16 transactions in the same group have been committed by their client application, then an entry is written to disk in the WAL Redo Log for each of the 16 transactions. - - group\_commit\_timeout – A timeout period in ms. For example, **10** means that after 10 ms, an entry is written to disk in the WAL Redo Log for each of the transactions in the same group that have been committed by their client application in the lats 10 ms. - - A commit group is closed after either the configured number of transactions has arrived or after the configured timeout period since the group was opened. After the group is closed, all the transactions in the group wait for a group flush to complete execution and then notify the client that each transaction has ended. - - You may refer to ++ for more information about synchronous group commit logging. - - diff --git a/content/en/docs/Developerguide/result-linear-scale-up.md b/content/en/docs/Developerguide/result-linear-scale-up.md deleted file mode 100644 index 8af644c876da07cbc38295ec6421b1ffaca268d0..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/result-linear-scale-up.md +++ /dev/null @@ -1,18 +0,0 @@ -# Result – Linear Scale-up - -The following shows the results achieved by the MOT design principles and implementation described above. - -To the best of our knowledge, MOT outperforms all existing industry-grade OLTP databases in transactional throughput of ACID-compliant workloads. - -openGauss and MOT have been tested on the following many-core systems with excellent performance scalability results. The tests were performed both on x86 Intel-based and ARM/Kunpeng-based many-core servers. You may refer to the [Performance Benchmarks](performance-benchmarks.md) section for more detailed performance review. - -Our TPC-C benchmark dated June 2020 tested an openGauss MOT database on a Taishan 2480 server. A 4-socket ARM/Kunpeng server, achieved throughput of 4.8 M tpmC. The following graph shows the near-linear nature of the results, meaning that it shows a significant increase in performance correlating to the increase of the quantity of cores - -**Figure 1** TPC-C on ARM \(256 Cores\)The following is an additional example that shows a test on an x86-based server also showing CPU utilization -![](figures/tpc-c-on-arm-(256-cores)the-following-is-an-additional-example-that-shows-a-test-on-an-x86-based-ser.png) - -**Figure 2** tpmC vs CPU Usage -![](figures/resource-utilization-performance-benchmarks.png "resource-utilization-performance-benchmarks") - -The chart shows that MOT demonstrates a significant performance increase correlation with an increase of the quantity of cores. MOT consumes more and more of the CPU correlating to the increase of the quantity of cores. Other industry solutions do not increase and sometimes show slightly degraded performance, which is a well-known problem in the database industry that affects customers’ CAPEX and OPEX expenses and operational efficiency. - diff --git a/content/en/docs/Developerguide/retrying-an-aborted-transaction.md b/content/en/docs/Developerguide/retrying-an-aborted-transaction.md index 4ea393e62bdc2a547f87d088f0b0b6cc9f3fb173..fda413e30ce345f7e0388439c05979e048d4fc99 100644 --- a/content/en/docs/Developerguide/retrying-an-aborted-transaction.md +++ b/content/en/docs/Developerguide/retrying-an-aborted-transaction.md @@ -1,41 +1,41 @@ -# Retrying an Aborted Transaction - -In Optimistic Concurrency Control \(OCC\) \(such as the one used by MOT\) during a transaction \(using any isolation level\) no locks are placed on a record until the COMMIT phase. This is a powerful advantage that significantly increases performance. Its drawback is that an update may fail if another session attempts to update the same record. This results in an entire transaction that must be aborted. These so called _Update Conflicts_ are detected by MOT at the commit time by a version checking mechanism. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->A similar abort happens on engines using pessimistic concurrency control, such as standard PG and the openGauss disk-based tables, when SERIALIZABLE or REPEATABLE-READ isolation level are used. - -Such update conflicts are quite rare in common OLTP scenarios and are especially rare in our experience with MOT. However, because there is still a chance that they may happen, developers should consider resolving this issue using transaction retry code. - -The following describes how to retry a table command after multiple sessions attempt to update the same table simultaneously. You may refer to the Retrying T-SQL Code section for more detailed information. The following example is taken from TPC-C payment transaction. - -``` -int commitAborts = 0; - -while (commitAborts < RETRY_LIMIT) { - - try { - stmt = db.stmtPaymentUpdateDistrict; - stmt.setDouble(1, 100); - stmt.setInt(2, 1); - stmt.setInt(3, 1); - stmt.executeUpdate(); - - db.commit(); - - break; - } - catch (SQLException se) { - if (se != null && se.getMessage().contains("could not serialize access due to concurrent update")) { - log.error("commmit abort = " + se.getMessage()); - commitAborts++; - continue; - } else { - db.rollback(); - } - - break; - } -} -``` - +# Retrying an Aborted Transaction + +In Optimistic Concurrency Control \(OCC\) \(such as the one used by MOT\) during a transaction \(using any isolation level\) no locks are placed on a record until the COMMIT phase. This is a powerful advantage that significantly increases performance. Its drawback is that an update may fail if another session attempts to update the same record. This results in an entire transaction that must be aborted. These so called _Update Conflicts_ are detected by MOT at the commit time by a version checking mechanism. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>A similar abort happens on engines using pessimistic concurrency control, such as standard PG and the openGauss disk-based tables, when SERIALIZABLE or REPEATABLE-READ isolation level are used. + +Such update conflicts are quite rare in common OLTP scenarios and are especially rare in our experience with MOT. However, because there is still a chance that they may happen, developers should consider resolving this issue using transaction retry code. + +The following describes how to retry a table command after multiple sessions attempt to update the same table simultaneously. You may refer to the OCC vs 2PL Differences by Example section for more detailed information. The following example is taken from TPC-C payment transaction. + +``` +int commitAborts = 0; + +while (commitAborts < RETRY_LIMIT) { + + try { + stmt =db.stmtPaymentUpdateDistrict; + stmt.setDouble(1, 100); + stmt.setInt(2, 1); + stmt.setInt(3, 1); + stmt.executeUpdate(); + + db.commit(); + + break; + } + catch (SQLException se) { + if(se != null && se.getMessage().contains("could not serialize access due to concurrent update")) { + log.error("commmit abort = " + se.getMessage()); + commitAborts++; + continue; + }else { + db.rollback(); + } + + break; + } +} +``` + diff --git a/content/en/docs/Developerguide/sample-workloads.md b/content/en/docs/Developerguide/sample-workloads.md deleted file mode 100644 index f659830c1001c0aa21e39d88e7ee373f9ea47698..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/sample-workloads.md +++ /dev/null @@ -1,5 +0,0 @@ -# Sample Workloads - -- **[TPC-C Benchmark](tpc-c-benchmark.md)** - - diff --git a/content/en/docs/Developerguide/scale-up-architecture.md b/content/en/docs/Developerguide/scale-up-architecture.md deleted file mode 100644 index d1f7ffb1940bb660da37927dafe409e3d77b5d82..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/scale-up-architecture.md +++ /dev/null @@ -1,16 +0,0 @@ -# Scale-up Architecture - -To **scale up** means to add additional cores to the _same machine_****in order to add computing power. To scale up****refers to the most common traditional form of adding computing power in a machine that has a single pair of controllers and multiple cores. Scale-up architecture is limited by the scalability limits of a machine's controller. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->A **scale-out** solution combines _many machines_ \(each with multiple cores\) into a single combined computing entity. To scale out means to add additional machines/cores to this single computing pool, which is used as if it were a single combined computing entity. Object software is managed on each core and all the cores participate in this solution so that the cores of multiple machines become a single computing cluster. - -- **[Technical Requirements](technical-requirements.md)** - -- **[Design Principles](design-principles.md)** - -- **[Integration using Foreign Data Wrappers \(FDW\)](integration-using-foreign-data-wrappers-(fdw).md)** - -- **[Result – Linear Scale-up](result-linear-scale-up.md)** - - diff --git a/content/en/docs/Developerguide/server-optimization-arm-huawei-taishan-2p-4p.md b/content/en/docs/Developerguide/server-optimization-arm-huawei-taishan-2p-4p.md deleted file mode 100644 index 9c490a7def481cb8190bc2b7351783c8142c8d5f..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/server-optimization-arm-huawei-taishan-2p-4p.md +++ /dev/null @@ -1,213 +0,0 @@ -# Server Optimization – ARM Huawei Taishan 2P/4P - -The following are optional settings for optimizing MOT database performance running on an ARM/Kunpeng-based Huawei Taishan 2280 v2 server powered by 2-sockets with a total of 256 Cores[\[3\]](#_ftn3) and Taishan 2480 v2 server powered by 4-sockets with a total of 256 Cores[\[4\]](#_ftn4). - -Unless indicated otherwise, the following settings are for both client and server machines: - -- [BIOS](#section12321235121710) -- [OS – Kernel and Boot](#section784441614224) -- [NVMe Disk](#section1254857132614) -- [Network](#section17901635182813) - -## BIOS - -Modify related BIOS settings, as follows: - -1. Select **BIOS - Advanced** **-** **MISC Config**. Set** Support Smmu **to **Disabled**. -2. Select **BIOS -** **Advanced - MISC Config**. Set **CPU Prefetching Configuration** to **Disabled**. - - ![](figures/en-us_image_0260574086.png) - -3. Select **BIOS - Advanced** **-** **Memory Config**. Set **Die Interleaving** to **Disabled**. - - ![](figures/en-us_image_0260574085.png) - -4. Select **BIOS** **-** **Advanced** **-** **Performance Config**. Set **Power Policy** to **Performance**. - - ![](figures/en-us_image_0260574087.png) - - -## OS – Kernel and Boot - -The following operating system kernel and boot parameters are usually configured by a sysadmin. - -Configure the kernel parameters, as follows: - -``` -net.ipv4.ip_local_port_range = 9000 65535 -kernel.sysrq = 1 -kernel.panic_on_oops = 1 -kernel.panic = 5 -kernel.hung_task_timeout_secs = 3600 -kernel.hung_task_panic = 1 -vm.oom_dump_tasks = 1 -kernel.softlockup_panic = 1 -fs.file-max = 640000 -kernel.msgmnb = 7000000 -kernel.sched_min_granularity_ns = 10000000 -kernel.sched_wakeup_granularity_ns = 15000000 -kernel.numa_balancing=0 -vm.max_map_count = 1048576 -net.ipv4.tcp_max_tw_buckets = 10000 -net.ipv4.tcp_tw_reuse = 1 -net.ipv4.tcp_tw_recycle = 1 -net.ipv4.tcp_keepalive_time = 30 -net.ipv4.tcp_keepalive_probes = 9 -net.ipv4.tcp_keepalive_intvl = 30 -net.ipv4.tcp_retries2 = 80 -kernel.sem = 32000 1024000000 500 32000 -kernel.shmall = 52805669 -kernel.shmmax = 18446744073692774399 -sys.fs.file-max = 6536438 -net.core.wmem_max = 21299200 -net.core.rmem_max = 21299200 -net.core.wmem_default = 21299200 -net.core.rmem_default = 21299200 -net.ipv4.tcp_rmem = 8192 250000 16777216 -net.ipv4.tcp_wmem = 8192 250000 16777216 -net.core.somaxconn = 65535 -vm.min_free_kbytes = 5270325 -net.core.netdev_max_backlog = 65535 -net.ipv4.tcp_max_syn_backlog = 65535 -net.ipv4.tcp_syncookies = 1 -vm.overcommit_memory = 0 -net.ipv4.tcp_retries1 = 5 -net.ipv4.tcp_syn_retries = 5 -##NEW -kernel.sched_autogroup_enabled=0 -kernel.sched_min_granularity_ns=2000000 -kernel.sched_latency_ns=10000000 -kernel.sched_wakeup_granularity_ns=5000000 -kernel.sched_migration_cost_ns=500000 -vm.dirty_background_bytes=33554432 -kernel.shmmax=21474836480 -net.ipv4.tcp_timestamps = 0 -net.ipv6.conf.all.disable_ipv6=1 -net.ipv6.conf.default.disable_ipv6=1 -net.ipv4.tcp_keepalive_time=600 -net.ipv4.tcp_keepalive_probes=3 -kernel.core_uses_pid=1 -``` - -- Tuned Service - - The following section is mandatory. - - The server must run a throughput-performance profile - - ``` - [...]$ tuned-adm profile throughput-performance - ``` - - The **throughput-performance** profile is broadly applicable tuning that provides excellent performance across a variety of common server workloads. - - Other less suitable profiles for openGauss and MOT server that may affect MOT’s overall performance are – balanced, desktop, latency-performance, network-latency, network-throughput and powersave. - -- Boot Tuning - - Add **iommu.passthrough=1** to the **kernel boot arguments**. - - When operating in **pass-through **mode, the adapter does require** DMA translation to the memory,** which improves performance. - - -## NVMe Disk - -**Settings** - -1. Format a partition, as follows - - ``` - mkfs.xfs -f -b size=8192 -s size=512 /dev/nvme0n1p1. - ``` - -2. Set the mount options - - ``` - UUID= /data1 xfs rw,noatime,inode64,allocsize=16m 0 0 - ``` - -3. Add the following to the kernel – cmdline **scsi\_mod.use\_blk\_mq=1** in order to enable BFQ. BFQ is a blk-mq \(Multi-Queue Block IO Queueing Mechanism\) scheduler. - -## Network - -- SmartNIC Driver Installation - - Perform the following if the server has Smart NIC installed, such as Hi1822 / Huawei IN200 NIC: - - 1. Verify that a NIC driver is installed. - 2. Verify that the NIC is configured as 64 queues. - - For detailed instructions, you may refer to [https://support.huawei.com/enterprise/de/doc/EDOC1100063073/42928ba6/configuring-64-queues](https://support.huawei.com/enterprise/de/doc/EDOC1100063073/42928ba6/configuring-64-queues) - - Quick instruction are as follows: - - - Configure the NIC as 64 queues \(queue pairs\). The default settings may already exist in the following **ini **file – ./hinicconfig hinic0 -f std\_sh\_4x25ge\_dpdk\_cfg\_template0.ini. - - Reboot and verify. - -- IRQ Configuration - - IRQ enables network-to-core balancing and interrupt management. The following is an optimized manual configuration for two server types – 256 cores and 128 cores. - - - For ARM Server 256, perform the following: - - 1. Disable the **irqbalance** service. - 2. For a single NIC \(recommended setting\), map four cores per queue – /var/scripts/set\_irq\_affinity\_256.sh -x all enp4s0. - - For a double NIC, map the last 16 cores per half-socket to the NIC queues, as follows – - - - /var/scripts/set\_irq\_affinity.sh -x 16-32,48-64,80-95,112-128 enp4s0 - - /var/scripts/set\_irq\_affinity.sh -x 146-161,176-191,208-224,240-256 enp5s0 - - 3. Configure the interrupt moderation, which changes the 1:1 ratio between packets and interrupts, as follows – ethtool -C enp4s0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 62. - - - For ARM Server 128, perform the following: - - 1. Disable the **irqbalance** service. - 2. For a single NIC \(recommended setting\), map two cores per queue – /var/scripts/set\_irq\_affinity\_128.sh -x all enp3s0. - - For a double NIC, map one NIC per socket, as follows: - - - ./set\_irq\_affinity.sh -x 0-63 enp3s0 - - ./set\_irq\_affinity.sh -x 64-128 enp4s0 - - 3. Configure the interrupt moderation, which changes the 1:1 ratio between packets and interrupts, as follows – ethtool-C enp4s0 adaptive-rx off adaptive-tx off rx-usecs 50 tx-usecs 50. - -- NIC Tune - - Configure the network buffer settings, as follows: - - - Increase the RX/TX buffer size – ethtool -G enp4s0 rx 4096 tx 4096. - - Modify NIC parameters \(client action=off , server action=on\), as follows – - - ethtool -K $net\_dev tso $action - - ethtool -K $net\_dev lro $action - - ethtool -K $net\_dev gro $action - - ethtool -K $net\_dev gso $action - -- rc.local - - Set the following in order to ensure that the above settings persist through future reboots. - - Replace **<256/128\>** with 256 or 128 according to whether the server has 256 or 128 cores. - - Configure the server machine, as follows - - ``` - killall -9 polkitd - - service sysmonitor stop - service irqbalance stop - service rsyslog stop - service firewalld stop - - echo madvise > /sys/kernel/mm/transparent_hugepage/enabled - - /var/scripts/net_tune.sh enp4s0 - /var/scripts/net_tune.sh enp5s0 - - ethtool -G enp4s0 rx 4096 tx 4096 - ethtool -G enp5s0 rx 4096 tx 4096 - /var/scripts/set_irq_affinity_<256/128>.sh -x all enp4s0 - ethtool -C enp4s0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 62 - ``` - - diff --git a/content/en/docs/Developerguide/sql-coverage-and-limitations.md b/content/en/docs/Developerguide/sql-coverage-and-limitations.md deleted file mode 100644 index 37b0b0808a5ba3f79b835c78d4a3c9946e48e30b..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/sql-coverage-and-limitations.md +++ /dev/null @@ -1,19 +0,0 @@ -# SQL Coverage and Limitations - -MOT design enables almost complete coverage of SQL and future feature sets. For example, standard Postgres SQL is mostly supported, as well common database features, such as stored procedures and user defined functions. - -- **[Unsupported Features](unsupported-features.md)** - -- **[MOT Table Limitations](mot-table-limitations.md)** - -- **[Unsupported Table DDLs](unsupported-table-ddls.md)** - -- **[Unsupported Data Types](unsupported-data-types.md)** - -- **[Unsupported Index DDLs and Index](unsupported-index-ddls-and-index.md)** - -- **[Unsupported DMLs](unsupported-dmls.md)** - -- **[Unsupported Queries for Native Compilation and Lite Execution](unsupported-queries-for-native-compilation-and-lite-execution.md)** - - diff --git a/content/en/docs/Developerguide/statistics.md b/content/en/docs/Developerguide/statistics.md deleted file mode 100644 index aab16154352931de4fb3b5c3fd817f180d3bbffa..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/statistics.md +++ /dev/null @@ -1,10 +0,0 @@ -# Statistics - -Statistics are intended for performance analysis or debugging. It is uncommon to turn them ON in a production environment \(by default, they are OFF\). Statistics are primarily used by database developers and to a lesser degree by database users. - -There is some impact on performance, particularly on the server. Impact on the user is negligible. - -The statistics are saved in the database server log. The log is located in the data folder and named **postgresql-DATE-TIME.log**. - -Refer to [Configuration Settings à Statistics](#_#_statistics) for detailed configuration options. - diff --git a/content/en/docs/Developerguide/statistics_mot.md b/content/en/docs/Developerguide/statistics_mot.md deleted file mode 100644 index 072433c97100e8c7e20cfa3b9389a4204e24c470..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/statistics_mot.md +++ /dev/null @@ -1,59 +0,0 @@ -# STATISTICS \(MOT\) - -- enable\_stats = false - - Configures periodic statistics for printing. - - -- print\_stats\_period = 10 minute - - Configures the time period for printing a summary statistics report. - - -- print\_full\_stats\_period = 1 hours - - Configures the time period for printing a full statistics report. - - The following settings configure the various sections included in the periodic statistics report. If none of them are configured, then the statistics report is suppressed. - - -- enable\_log\_recovery\_stats = false - - Log recovery statistics contain various Redo Log recovery metrics. - - -- enable\_db\_session\_stats = false - - Database session statistics contain transaction events, such commits, rollbacks and so on. - - -- enable\_network\_stats = false - - Network statistics contain connection/disconnection events. - - -- enable\_log\_stats = false - - Log statistics contain details regarding the Redo Log. - - -- enable\_memory\_stats = false - - Memory statistics contain memory-layer details. - - -- enable\_process\_stats = false - - Process statistics contain total memory and CPU consumption for the current process. - - -- enable\_system\_stats = false - - System statistics contain total memory and CPU consumption for the entire system. - - -- enable\_jit\_stats = false - - JIT statistics contain information regarding JIT query compilation and execution. - - diff --git a/content/en/docs/Developerguide/storage-io.md b/content/en/docs/Developerguide/storage-io.md deleted file mode 100644 index 7f4676b8a550c582174a7a7450ed0c0e8fba3202..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/storage-io.md +++ /dev/null @@ -1,39 +0,0 @@ -# Storage IO - -MOT is a memory-optimized, persistent database storage engine. A disk drive\(s\) is required for storing the Redo Log \(WAL\) and a periodic checkpoint. - -It is recommended to use a storage device with low latency, such as SSD with a RAID-1 configuration, NVMe or any enterprise-grade storage system. When appropriate hardware is used, the database transaction processing and contention are the bottleneck, not the IO. - -Since the persistent storage is much slower than RAM memory, the IO operations \(logging and checkpoint\) can create a bottleneck for both an in-memory and memory-optimized databases. However, MOT has a highly efficient durability design and implementation that is optimized for modern hardware \(such as SSD and NVMe\). In addition, MOT has minimized and optimized writing points \(for example, by using parallel logging, a single log record per transaction and NUMA-aware transaction group writing\) and has minimized the data written to disk \(for example, only logging the delta or updated columns of the changed records and only logging a transaction at the commit phase\). - -## Required Capacity - -The required capacity is determined by the requirements of checkpointing and logging, as described below: - -- **Checkpointing** - - A checkpoint saves a snapshot of all the data to disk. - - Twice the size of all data should be allocated for checkpointing. There is no need to allocate space for the indexes for checkpointing - - Checkpointing = 2x the MOT Data Size \(rows only, index is not persistent\). - - Twice the size is required because a snapshot is saved to disk of the entire size of the data, and in addition, the same amount of space should be allocated for the checkpoint that is in progress. When a checkpoint process finishes, the previous checkpoint files are deleted. - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >In the next openGauss release, MOT will have an incremental checkpoint feature, which will significantly reduce this storage capacity requirement. - - -- **Logging** - - MOT table log records are written to the same database transaction log as the other records of disk-based tables. - - The size of the log depends on the transactional throughput, the size of the data changes and the time between checkpoints \(at each time checkpoint the Redo Log is truncated and starts to expanding again\). - - MOT tables use less log bandwidth and have lower IO contention than disk‑based tables. This is enabled by multiple mechanisms. - - For example, MOT does not log every operation before a transaction has been completed. It is only logged at the commit phase and only the updated delta record is logged \(not full records like for disk-based tables\). - - In order to ensure that the log IO device does not become a bottleneck, the log file must be placed on a drive that has low latency. - - diff --git a/content/en/docs/Developerguide/storage_mot.md b/content/en/docs/Developerguide/storage_mot.md deleted file mode 100644 index 3edc853548ff60b52fcf0ccb87fba63fb715c161..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/storage_mot.md +++ /dev/null @@ -1,6 +0,0 @@ -# STORAGE \(MOT\) - -allow\_index\_on\_nullable\_column = true - -Specifies whether it is permitted to define an index over a nullable column. - diff --git a/content/en/docs/Developerguide/system-level-optimization.md b/content/en/docs/Developerguide/system-level-optimization.md deleted file mode 100644 index fe34ee2e04e1b95b8fdf107a2c4cd4b4e305a2a8..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/system-level-optimization.md +++ /dev/null @@ -1,4 +0,0 @@ -# System-Level Optimization - -Follow the instructions in the [Server Optimization – ARM Huawei Taishan 2P/4P](server-optimization-arm-huawei-taishan-2p-4p.md) section. The following section describes the key system-level optimizations for deploying the openGauss database on a Huawei Taishan server and an Euler 2.8 operating system for ultimate performance. - diff --git a/content/en/docs/Developerguide/technical-requirements.md b/content/en/docs/Developerguide/technical-requirements.md deleted file mode 100644 index e470a088262abe6027b19fbb46d4e7296ba846ea..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/technical-requirements.md +++ /dev/null @@ -1,14 +0,0 @@ -# Technical Requirements - -MOT has been designed to achieve the following - -- **Linear Scale-up – **MOT delivers a transactional storage engine that utilizes all the cores of a single NUMA architecture server in order to provide near-linear scale-up performance. This means that MOT is targeted to achieve a direct, near-linear relationship between the quantity of cores in a machine and the multiples of performance increase. - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >The near-linear scale-up results achieved by MOT significantly outperform all other existing solutions, and come as close as possible to achieving optimal results, which are limited by the physical restrictions and limitations of hardware, such as wires. - -- **No Maximum Number of Cores Limitation –** MOT does not place any limits on the maximum quantity of cores. This means that MOT is scalable from a single core up to 1,000s of cores, with minimal degradation per additional core, even when crossing NUMA socket boundaries. -- **Extremely High Transactional Throughout – **MOT delivers a transactional storage engine that can achieve extremely high transactional throughout compared with any other OLTP vendor on the market. -- **Extremely Low Transactional Latency –** MOT delivers a transactional storage engine that can reach extremely low transactional latency compared with any other OLTP vendor on the market. -- **Seamless Integration and Leveraging with/of openGauss –** MOT integrates its transactional engine in a standard and seamless manner with the openGauss product. In this way, MOT reuses maximum functionality from the openGauss layers that are situated on top of its transactional storage engine. - diff --git a/content/en/docs/Developerguide/tpc-c-benchmark.md b/content/en/docs/Developerguide/tpc-c-benchmark.md deleted file mode 100644 index fb6dc32f8d5e6a8022327dd968d96de76c8496f8..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/tpc-c-benchmark.md +++ /dev/null @@ -1,11 +0,0 @@ -# TPC-C Benchmark - -- **[TPC-C Introduction](tpc-c-introduction.md)** - -- **[System-Level Optimization](system-level-optimization.md)** - -- **[BenchmarkSQL – An Open-Source TPC-C Tool](benchmarksql-an-open-source-tpc-c-tool.md)** - -- **[Results Report](results-report.md)** - - diff --git a/content/en/docs/Developerguide/unsupported-data-types.md b/content/en/docs/Developerguide/unsupported-data-types.md deleted file mode 100644 index 37e37038653af7bcc66ddbc4c2ad6aa87cc09035..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/unsupported-data-types.md +++ /dev/null @@ -1,35 +0,0 @@ -# Unsupported Data Types - -- UUID -- User-Defined Type \(UDF\) -- Array data type -- NVARCHAR2\(n\) -- Clob -- Name -- Blob -- Raw -- Path -- Circle -- Reltime -- Bit varying\(10\) -- Tsvector -- Tsquery -- JSON -- HSTORE -- Box -- Text -- Line -- Point -- LSEG -- POLYGON -- INET -- CIDR -- MACADDR -- Smalldatetime -- BYTEA -- Bit -- Varbit -- OID -- Money -- Any unlimited varchar/char - diff --git a/content/en/docs/Developerguide/unsupported-dmls.md b/content/en/docs/Developerguide/unsupported-dmls.md deleted file mode 100644 index 351ea6fdcaf6a141af8523797f1fdc82f61b2f82..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/unsupported-dmls.md +++ /dev/null @@ -1,9 +0,0 @@ -# Unsupported DMLs - -- Merge into -- Delete on conflict -- Insert on conflict -- Select into -- Update on conflict -- Update from - diff --git a/content/en/docs/Developerguide/unsupported-features.md b/content/en/docs/Developerguide/unsupported-features.md deleted file mode 100644 index cc859e0c2d5529decc19197e1a04d8f9a5e623da..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/unsupported-features.md +++ /dev/null @@ -1,13 +0,0 @@ -# Unsupported Features - -The following features are not supported by MOT: - -- **Engine Interop –** No cross-engine \(Disk+MOT\) queries, views or transactions. Planned for 2021. -- **MVCC, Isolation –** No snapshot/serializable isolation. Planned for 2021. -- **Native Compilation **\(JIT\)**–** Limited SQL coverage. Also, JIT compilation of stored procedures is not supported. -- LOCAL memory is limited to 1 GB. A transaction can only change data of less than 1 GB. -- Capacity \(Data+Index\) is limited to available memory. Anti-caching + Data Tiering will be available in the future. -- No full-text search index. - -In addition, the following are detailed lists of various general limitations of MOT tables, MOT indexes, Query and DML syntax and the features and limitations of Query Native Compilation. - diff --git a/content/en/docs/Developerguide/unsupported-queries-for-native-compilation-and-lite-execution.md b/content/en/docs/Developerguide/unsupported-queries-for-native-compilation-and-lite-execution.md deleted file mode 100644 index c63c1c50f9c82147f23b0c5f4a9d6943c8cbdbda..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/unsupported-queries-for-native-compilation-and-lite-execution.md +++ /dev/null @@ -1,24 +0,0 @@ -# Unsupported Queries for Native Compilation and Lite Execution - -- The query refers to more than two tables -- The query has any one of the following attributes: - - Aggregation on non-primitive types - - Window functions - - Sub-query sub-links - - Distinct-ON modifier \(distinct clause is from DISTINCT ON\) - - Recursive \(WITH RECURSIVE was specified\) - - Modifying CTE \(has INSERT/UPDATE/DELETE in WITH\) - - -In addition, the following clauses disqualify a query from lite execution: - -- Returning list -- Group By clause -- Grouping sets -- Having clause -- Windows clause -- Distinct clause -- Sort clause that does not conform to native index order -- Set operations -- Constraint dependencies - diff --git a/content/en/docs/Developerguide/unsupported-table-ddls.md b/content/en/docs/Developerguide/unsupported-table-ddls.md deleted file mode 100644 index ec75c8da0fb6661b520b24853e8e4e424349ff20..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/unsupported-table-ddls.md +++ /dev/null @@ -1,9 +0,0 @@ -# Unsupported Table DDLs - -- Alter table -- Create table, like including -- Create table as select -- Partition by range -- Create table with no-logging clause -- DEFERRABLE primary key - diff --git a/content/en/docs/Developerguide/using-mot-overview.md b/content/en/docs/Developerguide/using-mot-overview.md new file mode 100644 index 0000000000000000000000000000000000000000..e71bc0be01c5d856a873bfe30ff869504dfe3003 --- /dev/null +++ b/content/en/docs/Developerguide/using-mot-overview.md @@ -0,0 +1,18 @@ +# Using MOT Overview + +MOT is automatically deployed as part of openGauss. You may refer to the [MOT Preparation](mot-preparation.md) section for a description of how to estimate and plan required memory and storage resources in order to sustain your workload. The [MOT Deployment](mot-deployment.md) section describes all the configuration settings in MOT, as well as non-mandatory options for server optimization. + +Using MOT tables is quite simple. The syntax of all MOT commands is the same as for disk-based tables and includes support for most of standard PostgreSQL SQL, DDL and DML commands and features, such as Stored Procedures. Only the create and drop table statements in MOT differ from the statements for disk-based tables in openGauss. You may refer to the [MOT Usage](mot-usage.md) section for a description of these two simple commands, to learn how to convert a disk-based table into an MOT table, to get higher performance using Query Native Compilation and PREPARE statements and for a description of external tool support and the limitations of the MOT engine. + +The [MOT Administration](mot-administration.md) section describes how to perform database maintenance, monitoring and analysis of logs and reported errors. Lastly, the [MOT Sample TPC-C Benchmark](mot-sample-tpc-c-benchmark.md) section describes how to perform a standard TPC-C benchmark. + +- Read the following topics to learn how to use MOT – + + + + + +

+
+ + diff --git a/content/en/docs/Developerguide/using-mot.md b/content/en/docs/Developerguide/using-mot.md index 388f4413c6e8cc911d68ef234b59098a744755cc..9c82975e48a096075fcfbd5c3db002f653bb6dda 100644 --- a/content/en/docs/Developerguide/using-mot.md +++ b/content/en/docs/Developerguide/using-mot.md @@ -1,19 +1,17 @@ -# Using MOT - -This**** chapter describes how to deploy, use and manage openGauss MOT. Using MOT tables is quite simple. The syntax of all MOT commands is the same as for openGauss disk‑based tables. Only the create and drop table statements in MOT differ from the statements for disk-based tables in openGauss. You may refer to this chapter in order to learn how to get started, how to convert a disk‑based table into a MOT table, how to use MOT’s Query Native Compilation feature and about MOT’s limitations and coverage. MOT administration options are also described here. This chapter also describes how to perform a TPC-C benchmark. - -- **[Overview](overview.md)** - -- **[Preparation](preparation.md)** - -- **[Deployment](deployment.md)** - -- **[Usage](usage.md)** - -- **[Administration](administration.md)** - -- **[Durability](durability.md)** - -- **[Sample Workloads](sample-workloads.md)** - - +# Using MOT + +This chapter describes how to deploy, use and manage openGauss MOT. Using MOT tables is quite simple. The syntax of all MOT commands is the same as for openGauss disk‑based tables. Only the create and drop table statements in MOT differ from the statements for disk-based tables in openGauss. You may refer to this chapter in order to learn how to get started, how to convert a disk‑based table into an MOT table, how to use MOT's Query Native Compilation feature and about MOT's limitations and coverage. MOT administration options are also described here. This chapter also describes how to perform a TPC-C benchmark. + +- **[Using MOT Overview](using-mot-overview.md)** + +- **[MOT Preparation](mot-preparation.md)** + +- **[MOT Deployment](mot-deployment.md)** + +- **[MOT Usage](mot-usage.md)** + +- **[MOT Administration](mot-administration.md)** + +- **[MOT Sample TPC-C Benchmark](mot-sample-tpc-c-benchmark.md)** + + diff --git a/content/en/docs/Developerguide/vacuum.md b/content/en/docs/Developerguide/vacuum.md deleted file mode 100644 index cb9d24fa62b1ac40b8f2cb9f85863ebe690ac85f..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/vacuum.md +++ /dev/null @@ -1,37 +0,0 @@ -# Vacuum - -Use VACUUM for garbage collection and optionally to analyze a database. - -## \[PG\] - -In Postgress \(PG\), the VACUUM reclaims storage occupied by dead tuples. In normal PG operation, tuples that are deleted or that are made obsolete by an update are not physically removed from their table. They remain present until a VACUUM is done. Therefore, it is necessary to perform a VACUUM periodically, especially on frequently updated tables. - -## \[MOT Extension\] - -MOT tables do not need a periodic VACUUM operation, since dead/empty tuples are re‑used by new ones. MOT tables require VACUUM operations only when their size is significantly reduced and they do not expect to grow to their original size in the near future. - -For example, an application that periodically \(for example, once in a week\) performs a large deletion of a table/tables data while inserting new data takes days and does not necessarily require the same quantity of rows. In such cases, it makes sense to activate the VACUUM. - -The VACUUM operation on MOT tables is always transformed into a VACUUM FULL with an exclusive table lock. - -## Supported Syntax and Limitations - -Activation of the VACUUM operation is performed in a standard manner. - -``` -VACUUM [FULL | ANALYZE] [ table ]; -``` - -Only the FULL and ANALYZE VACUUM options are supported. The VACUUM operation can only be performed on an entire MOT table. - -The following PG vacuum options are not supported - -- FREEZE -- VERBOSE -- Column specification -- LAZY mode \(partial table scan\) - -Additionally, the following functionality is not supported - -- AUTOVACUUM - diff --git a/content/en/docs/Developerguide/workflow-overview.md b/content/en/docs/Developerguide/workflow-overview.md deleted file mode 100644 index ba093b1171654fc1bf1d21d4f954ff6862751dba..0000000000000000000000000000000000000000 --- a/content/en/docs/Developerguide/workflow-overview.md +++ /dev/null @@ -1,19 +0,0 @@ -# Workflow Overview - -The following is a simple overview of the tasks related to working with MOT tables – - -![](figures/en-us_image_0260488312.png) - -- [Granting User Permissions](granting-user-permissions.md) -- [Creating/Dropping a MOT Table](creating-dropping-a-mot-table.md) -- **Creating an Index** - -This section also describes how to perform various other MOT-related tasks, such as – - -- [Converting a Disk Table into a MOT Table](converting-a-disk-table-into-a-mot-table.md) -- **Using Query Native Compilation** -- **Retrying an aborted transaction** -- [External Support Tools](external-support-tools.md) - -This section also describes [SQL Coverage and Limitations](sql-coverage-and-limitations.md). - diff --git a/content/en/menu/index.md b/content/en/menu/index.md index 220d2712d7766b215203f8ba231632a6b06c79a5..a316de6da5ce22bbb8b6a76554b6403a71dd3844 100644 --- a/content/en/menu/index.md +++ b/content/en/menu/index.md @@ -525,123 +525,65 @@ headless: true - [Usage Guide]({{< relref "./docs/Developerguide/usage-guide-11.md" >}}) - [Obtaining Help Information]({{< relref "./docs/Developerguide/obtaining-help-information-12.md" >}}) - [Command Reference]({{< relref "./docs/Developerguide/command-reference-13.md" >}}) - - [Troubleshooting]({{< relref "./docs/Developerguide/troubleshooting-14.md" >}}) - + - [Troubleshooting]({{< relref "./docs/Developerguide/troubleshooting-14.md" >}}) - [MOT Engine]({{< relref "./docs/Developerguide/mot.md" >}}) - - [Introducing MOT]({{< relref "./docs/Developerguide/introducing-mot.md" >}}) - - [MOT Introduction]({{< relref "./docs/Developerguide/mot-introduction.md" >}}) - - [Features and Benefits]({{< relref "./docs/Developerguide/features-and-benefits.md" >}}) - - [MOT Key Technologies]({{< relref "./docs/Developerguide/mot-key-technologies.md" >}}) - - [Usage Scenarios]({{< relref "./docs/Developerguide/usage-scenarios.md" >}}) - - [Performance Benchmarks]({{< relref "./docs/Developerguide/performance-benchmarks.md" >}}) - - [Hardware]({{< relref "./docs/Developerguide/hardware.md" >}}) - - [Results – Summary]({{< relref "./docs/Developerguide/results-summary.md" >}}) - - [High Throughput]({{< relref "./docs/Developerguide/high-throughput.md" >}}) - - [ARM/Kunpeng 2-Socket 128 Cores]({{< relref "./docs/Developerguide/arm-kunpeng-2-socket-128-cores.md" >}}) - - [ARM/Kunpeng 4-Socket 256 Cores]({{< relref "./docs/Developerguide/arm-kunpeng-4-socket-256-cores.md" >}}) - - [x86-based servers]({{< relref "./docs/Developerguide/x86-based-servers.md" >}}) - - [Low Latency]({{< relref "./docs/Developerguide/low-latency.md" >}}) - - [Recovery / Cold-Start Time]({{< relref "./docs/Developerguide/recovery-cold-start-time.md" >}}) - - [Resource Utilization]({{< relref "./docs/Developerguide/resource-utilization.md" >}}) - - [Data Ingestion Speed]({{< relref "./docs/Developerguide/data-ingestion-speed.md" >}}) - - [Using MOT]({{< relref "./docs/Developerguide/using-mot.md" >}}) - - [Overview]({{< relref "./docs/Developerguide/overview.md" >}}) - - [Preparation]({{< relref "./docs/Developerguide/preparation.md" >}}) - - [Prerequisites]({{< relref "./docs/Developerguide/prerequisites.md" >}}) - - [Memory and Storage Planning]({{< relref "./docs/Developerguide/memory-and-storage-planning.md" >}}) - - [Memory Planning]({{< relref "./docs/Developerguide/memory-planning.md" >}}) - - [Storage IO]({{< relref "./docs/Developerguide/storage-io.md" >}}) - - [Deployment]({{< relref "./docs/Developerguide/deployment.md" >}}) - - [Server Optimization – x86]({{< relref "./docs/Developerguide/server-optimization-x86.md" >}}) - - [Server Optimization – ARM Huawei Taishan 2P/4P]({{< relref "./docs/Developerguide/server-optimization-arm-huawei-taishan-2p-4p.md" >}}) - - [MOT Configuration Settings]({{< relref "./docs/Developerguide/mot-configuration-settings.md" >}}) - - [General Guidelines]({{< relref "./docs/Developerguide/general-guidelines.md" >}}) - - [REDO LOG \(MOT\)]({{< relref "./docs/Developerguide/redo-log_mot.md" >}}) - - [CHECKPOINT \(MOT\)]({{< relref "./docs/Developerguide/checkpoint_mot.md" >}}) - - [RECOVERY \(MOT\)]({{< relref "./docs/Developerguide/recovery_mot.md" >}}) - - [STATISTICS \(MOT\)]({{< relref "./docs/Developerguide/statistics_mot.md" >}}) - - [ERROR LOG \(MOT\)]({{< relref "./docs/Developerguide/error-log_mot.md" >}}) - - [MEMORY \(MOT\)]({{< relref "./docs/Developerguide/memory_mot.md" >}}) - - [GARBAGE COLLECTION \(MOT\)]({{< relref "./docs/Developerguide/garbage-collection_mot.md" >}}) - - [JIT \(MOT\)]({{< relref "./docs/Developerguide/jit_mot.md" >}}) - - [STORAGE \(MOT\)]({{< relref "./docs/Developerguide/storage_mot.md" >}}) - - [Default MOT.conf]({{< relref "./docs/Developerguide/default-mot-conf.md" >}}) - - [Usage]({{< relref "./docs/Developerguide/usage.md" >}}) - - [Workflow Overview]({{< relref "./docs/Developerguide/workflow-overview.md" >}}) - - [Granting User Permissions]({{< relref "./docs/Developerguide/granting-user-permissions.md" >}}) - - [Creating/Dropping a MOT Table]({{< relref "./docs/Developerguide/creating-dropping-a-mot-table.md" >}}) - - [Creating an Index for MOT Table]({{< relref "./docs/Developerguide/creating-an-index-for-mot-table.md" >}}) - - [Converting a Disk Table into a MOT Table]({{< relref "./docs/Developerguide/converting-a-disk-table-into-a-mot-table.md" >}}) - - [Prerequisite Check]({{< relref "./docs/Developerguide/prerequisite-check.md" >}}) - - [Converting]({{< relref "./docs/Developerguide/converting.md" >}}) - - [Conversion Example]({{< relref "./docs/Developerguide/conversion-example.md" >}}) - - [Query Native Compilation]({{< relref "./docs/Developerguide/query-native-compilation.md" >}}) - - [Retrying an Aborted Transaction]({{< relref "./docs/Developerguide/retrying-an-aborted-transaction.md" >}}) - - [External Support Tools]({{< relref "./docs/Developerguide/external-support-tools.md" >}}) - - [gs\_ctl \(Full and Incremental\)]({{< relref "./docs/Developerguide/gs_ctl-(full-and-incremental).md" >}}) - - [gs\_basebackup]({{< relref "./docs/Developerguide/gs_basebackup.md" >}}) - - [gs\_dump]({{< relref "./docs/Developerguide/gs_dump.md" >}}) - - [gs\_restore]({{< relref "./docs/Developerguide/gs_restore.md" >}}) - - [SQL Coverage and Limitations]({{< relref "./docs/Developerguide/sql-coverage-and-limitations.md" >}}) - - [Unsupported Features]({{< relref "./docs/Developerguide/unsupported-features.md" >}}) - - [MOT Table Limitations]({{< relref "./docs/Developerguide/mot-table-limitations.md" >}}) - - [Unsupported Table DDLs]({{< relref "./docs/Developerguide/unsupported-table-ddls.md" >}}) - - [Unsupported Data Types]({{< relref "./docs/Developerguide/unsupported-data-types.md" >}}) - - [Unsupported Index DDLs and Index]({{< relref "./docs/Developerguide/unsupported-index-ddls-and-index.md" >}}) - - [Unsupported DMLs]({{< relref "./docs/Developerguide/unsupported-dmls.md" >}}) - - [Unsupported Queries for Native Compilation and Lite Execution]({{< relref "./docs/Developerguide/unsupported-queries-for-native-compilation-and-lite-execution.md" >}}) - - [Administration]({{< relref "./docs/Developerguide/administration.md" >}}) - - [Durability]({{< relref "./docs/Developerguide/durability.md" >}}) - - [Configuring Durability]({{< relref "./docs/Developerguide/configuring-durability.md" >}}) - - [Logging – WAL Redo Log]({{< relref "./docs/Developerguide/logging-wal-redo-log.md" >}}) - - [Logging Types]({{< relref "./docs/Developerguide/logging-types.md" >}}) - - [Configuring Logging]({{< relref "./docs/Developerguide/configuring-logging.md" >}}) - - [Checkpoints]({{< relref "./docs/Developerguide/checkpoints.md" >}}) - - [Recovery]({{< relref "./docs/Developerguide/recovery.md" >}}) - - [Replication and High Availability]({{< relref "./docs/Developerguide/replication-and-high-availability.md" >}}) - - [Memory Management]({{< relref "./docs/Developerguide/memory-management.md" >}}) - - [Vacuum]({{< relref "./docs/Developerguide/vacuum.md" >}}) - - [Statistics]({{< relref "./docs/Developerguide/statistics.md" >}}) - - [Monitoring]({{< relref "./docs/Developerguide/monitoring.md" >}}) - - [Error Messages]({{< relref "./docs/Developerguide/error-messages.md" >}}) - - [Errors Written the Log File]({{< relref "./docs/Developerguide/errors-written-the-log-file.md" >}}) - - [Errors Returned to the User]({{< relref "./docs/Developerguide/errors-returned-to-the-user.md" >}}) - - [Sample Workloads]({{< relref "./docs/Developerguide/sample-workloads.md" >}}) - - [TPC-C Benchmark]({{< relref "./docs/Developerguide/tpc-c-benchmark.md" >}}) - - [TPC-C Introduction]({{< relref "./docs/Developerguide/tpc-c-introduction.md" >}}) - - [System-Level Optimization]({{< relref "./docs/Developerguide/system-level-optimization.md" >}}) - - [BenchmarkSQL – An Open-Source TPC-C Tool]({{< relref "./docs/Developerguide/benchmarksql-an-open-source-tpc-c-tool.md" >}}) - - [Results Report]({{< relref "./docs/Developerguide/results-report.md" >}}) - - [Concepts of MOT]({{< relref "./docs/Developerguide/concepts-of-mot.md" >}}) - - [Scale-up Architecture]({{< relref "./docs/Developerguide/scale-up-architecture.md" >}}) - - [Technical Requirements]({{< relref "./docs/Developerguide/technical-requirements.md" >}}) - - [Design Principles]({{< relref "./docs/Developerguide/design-principles.md" >}}) - - [Integration using Foreign Data Wrappers \(FDW\)]({{< relref "./docs/Developerguide/integration-using-foreign-data-wrappers-(fdw).md" >}}) - - [Result – Linear Scale-up]({{< relref "./docs/Developerguide/result-linear-scale-up.md" >}}) - - [Concurrency Control Mechanism]({{< relref "./docs/Developerguide/concurrency-control-mechanism.md" >}}) - - [Local and Global MOT Memory]({{< relref "./docs/Developerguide/local-and-global-mot-memory.md" >}}) - - [SILO Enhancements for MOT]({{< relref "./docs/Developerguide/silo-enhancements-for-mot.md" >}}) - - [Isolation Levels]({{< relref "./docs/Developerguide/isolation-levels.md" >}}) - - [Optimistic Concurrency Control]({{< relref "./docs/Developerguide/optimistic-concurrency-control.md" >}}) - - [Optimistic OCC vs. Pessimistic 2PL]({{< relref "./docs/Developerguide/optimistic-occ-vs-pessimistic-2pl.md" >}}) - - [OCC vs 2PL Differences by Example]({{< relref "./docs/Developerguide/occ-vs-2pl-differences-by-example.md" >}}) - - [Pessimistic Approach – Used in Disk-based Tables]({{< relref "./docs/Developerguide/pessimistic-approach-used-in-disk-based-tables.md" >}}) - - [Optimistic Approach – Used in MOT]({{< relref "./docs/Developerguide/optimistic-approach-used-in-mot.md" >}}) - - [Extended FDW and Other openGauss Features]({{< relref "./docs/Developerguide/extended-fdw-and-other-opengauss-features.md" >}}) - - [NUMA Awareness Allocation and Affinity]({{< relref "./docs/Developerguide/numa-awareness-allocation-and-affinity.md" >}}) - - [Indexes]({{< relref "./docs/Developerguide/indexes.md" >}}) - - [Secondary Index Support]({{< relref "./docs/Developerguide/secondary-index-support.md" >}}) - - [Non-unique Indexes]({{< relref "./docs/Developerguide/non-unique-indexes.md" >}}) - - [Durability]({{< relref "./docs/Developerguide/durability-0.md" >}}) - - [Exception Handling]({{< relref "./docs/Developerguide/exception-handling.md" >}}) - - [Logging]({{< relref "./docs/Developerguide/logging.md" >}}) - - [Overall Architecture]({{< relref "./docs/Developerguide/overall-architecture.md" >}}) - - [Checkpoint]({{< relref "./docs/Developerguide/checkpoint.md" >}}) - - [CALC Checkpoint algorithm: low overhead in memory and compute]({{< relref "./docs/Developerguide/calc-checkpoint-algorithm-low-overhead-in-memory-and-compute.md" >}}) - - [Checkpoint Activation]({{< relref "./docs/Developerguide/checkpoint-activation.md" >}}) - - [Recovery]({{< relref "./docs/Developerguide/recovery-1.md" >}}) - - [Query Native Compilation \(JIT\)]({{< relref "./docs/Developerguide/query-native-compilation_jit.md" >}}) - - [Comparison – Disk vs. MOT]({{< relref "./docs/Developerguide/comparison-disk-vs-mot.md" >}}) + - [Introducing MOT]({{< relref "./docs/Developerguide/introducing-mot.md" >}}) + - [MOT Introduction]({{< relref "./docs/Developerguide/mot-introduction.md" >}}) + - [MOT Features and Benefits]({{< relref "./docs/Developerguide/mot-features-and-benefits.md" >}}) + - [MOT Key Technologies]({{< relref "./docs/Developerguide/mot-key-technologies.md" >}}) + - [MOT Usage Scenarios]({{< relref "./docs/Developerguide/mot-usage-scenarios.md" >}}) + - [MOT Performance Benchmarks]({{< relref "./docs/Developerguide/mot-performance-benchmarks.md" >}}) + - [MOT Hardware]({{< relref "./docs/Developerguide/mot-hardware.md" >}}) + - [MOT Results – Summary]({{< relref "./docs/Developerguide/mot-results-summary.md" >}}) + - [MOT High Throughput]({{< relref "./docs/Developerguide/mot-high-throughput.md" >}}) + - [MOT Low Latency]({{< relref "./docs/Developerguide/mot-low-latency.md" >}}) + - [MOT RTO and Cold-Start Time]({{< relref "./docs/Developerguide/mot-rto-and-cold-start-time.md" >}}) + - [MOT Resource Utilization]({{< relref "./docs/Developerguide/mot-resource-utilization.md" >}}) + - [MOT Data Ingestion Speed]({{< relref "./docs/Developerguide/mot-data-ingestion-speed.md" >}}) + - [Using MOT]({{< relref "./docs/Developerguide/using-mot.md" >}}) + - [Using MOT Overview]({{< relref "./docs/Developerguide/using-mot-overview.md" >}}) + - [MOT Preparation]({{< relref "./docs/Developerguide/mot-preparation.md" >}}) + - [MOT Prerequisites]({{< relref "./docs/Developerguide/mot-prerequisites.md" >}}) + - [MOT Memory and Storage Planning]({{< relref "./docs/Developerguide/mot-memory-and-storage-planning.md" >}}) + - [MOT Deployment]({{< relref "./docs/Developerguide/mot-deployment.md" >}}) + - [MOT Server Optimization – x86]({{< relref "./docs/Developerguide/mot-server-optimization-x86.md" >}}) + - [MOT Server Optimization – ARM Huawei Taishan 2P/4P]({{< relref "./docs/Developerguide/mot-server-optimization-arm-huawei-taishan-2p-4p.md" >}}) + - [MOT Configuration Settings]({{< relref "./docs/Developerguide/mot-configuration-settings.md" >}}) + - [MOT Usage]({{< relref "./docs/Developerguide/mot-usage.md" >}}) + - [Granting User Permissions]({{< relref "./docs/Developerguide/granting-user-permissions.md" >}}) + - [Creating/Dropping an MOT Table]({{< relref "./docs/Developerguide/creating-dropping-an-mot-table.md" >}}) + - [Creating an Index for an MOT Table]({{< relref "./docs/Developerguide/creating-an-index-for-an-mot-table.md" >}}) + - [Converting a Disk Table into an MOT Table]({{< relref "./docs/Developerguide/converting-a-disk-table-into-an-mot-table.md" >}}) + - [Query Native Compilation]({{< relref "./docs/Developerguide/query-native-compilation.md" >}}) + - [Retrying an Aborted Transaction]({{< relref "./docs/Developerguide/retrying-an-aborted-transaction.md" >}}) + - [MOT External Support Tools]({{< relref "./docs/Developerguide/mot-external-support-tools.md" >}}) + - [MOT SQL Coverage and Limitations]({{< relref "./docs/Developerguide/mot-sql-coverage-and-limitations.md" >}}) + - [MOT Administration]({{< relref "./docs/Developerguide/mot-administration.md" >}}) + - [MOT Durability]({{< relref "./docs/Developerguide/mot-durability.md" >}}) + - [MOT Recovery]({{< relref "./docs/Developerguide/mot-recovery.md" >}}) + - [MOT Replication and High Availability]({{< relref "./docs/Developerguide/mot-replication-and-high-availability.md" >}}) + - [MOT Memory Management]({{< relref "./docs/Developerguide/mot-memory-management.md" >}}) + - [MOT Vacuum]({{< relref "./docs/Developerguide/mot-vacuum.md" >}}) + - [MOT Statistics]({{< relref "./docs/Developerguide/mot-statistics.md" >}}) + - [MOT Monitoring]({{< relref "./docs/Developerguide/mot-monitoring.md" >}}) + - [MOT Error Messages]({{< relref "./docs/Developerguide/mot-error-messages.md" >}}) + - [MOT Sample TPC-C Benchmark]({{< relref "./docs/Developerguide/mot-sample-tpc-c-benchmark.md" >}}) + - [Concepts of MOT]({{< relref "./docs/Developerguide/concepts-of-mot.md" >}}) + - [MOT Scale-up Architecture]({{< relref "./docs/Developerguide/mot-scale-up-architecture.md" >}}) + - [MOT Concurrency Control Mechanism]({{< relref "./docs/Developerguide/mot-concurrency-control-mechanism.md" >}}) + - [MOT Local and Global Memory]({{< relref "./docs/Developerguide/mot-local-and-global-memory.md" >}}) + - [MOT SILO Enhancements]({{< relref "./docs/Developerguide/mot-silo-enhancements.md" >}}) + - [MOT Isolation Levels]({{< relref "./docs/Developerguide/mot-isolation-levels.md" >}}) + - [MOT Optimistic Concurrency Control]({{< relref "./docs/Developerguide/mot-optimistic-concurrency-control.md" >}}) + - [Extended FDW and Other openGauss Features]({{< relref "./docs/Developerguide/extended-fdw-and-other-opengauss-features.md" >}}) + - [NUMA Awareness Allocation and Affinity]({{< relref "./docs/Developerguide/numa-awareness-allocation-and-affinity.md" >}}) + - [MOT Indexes]({{< relref "./docs/Developerguide/mot-indexes.md" >}}) + - [MOT Durability Concepts]({{< relref "./docs/Developerguide/mot-durability-concepts.md" >}}) + - [MOT Logging – WAL Redo Log Concepts]({{< relref "./docs/Developerguide/mot-logging-wal-redo-log-concepts.md" >}}) + - [MOT Checkpoint Concepts]({{< relref "./docs/Developerguide/mot-checkpoint-concepts.md" >}}) + - [MOT Recovery Concepts]({{< relref "./docs/Developerguide/mot-recovery-concepts.md" >}}) + - [MOT Query Native Compilation \(JIT\)]({{< relref "./docs/Developerguide/mot-query-native-compilation-jit.md" >}}) + - [Comparison – Disk vs. MOT]({{< relref "./docs/Developerguide/comparison-disk-vs-mot.md" >}}) - [Performance Tuning]({{< relref "./docs/Developerguide/performance-tuning.md" >}}) - [Overview]({{< relref "./docs/Developerguide/overview-15.md" >}}) - [Determining the Scope of Performance Tuning]({{< relref "./docs/Developerguide/determining-the-scope-of-performance-tuning.md" >}}) @@ -829,7 +771,7 @@ headless: true - [ANALYZE | ANALYSE]({{< relref "./docs/Developerguide/analyze-analyse.md" >}}) - [BEGIN]({{< relref "./docs/Developerguide/begin.md" >}}) - [CALL]({{< relref "./docs/Developerguide/call.md" >}}) - - [CHECKPOINT]({{< relref "./docs/Developerguide/checkpoint.md" >}}) + - [CHECKPOINT]({{< relref "./docs/Developerguide/checkpoint-32.md" >}}) - [CLOSE]({{< relref "./docs/Developerguide/close.md" >}}) - [CLUSTER]({{< relref "./docs/Developerguide/cluster.md" >}}) - [COMMENT]({{< relref "./docs/Developerguide/comment.md" >}}) @@ -915,7 +857,7 @@ headless: true - [START TRANSACTION]({{< relref "./docs/Developerguide/start-transaction.md" >}}) - [TRUNCATE]({{< relref "./docs/Developerguide/truncate.md" >}}) - [UPDATE]({{< relref "./docs/Developerguide/update.md" >}}) - - [VACUUM]({{< relref "./docs/Developerguide/vacuum.md" >}}) + - [VACUUM]({{< relref "./docs/Developerguide/vacuum-33.md" >}}) - [VALUES]({{< relref "./docs/Developerguide/values.md" >}}) - [Appendix]({{< relref "./docs/Developerguide/appendix.md" >}}) - [GIN Indexes]({{< relref "./docs/Developerguide/gin-indexes.md" >}}) @@ -1328,7 +1270,7 @@ headless: true - [Parallel Data Import]({{< relref "./docs/Developerguide/parallel-data-import.md" >}}) - [Write Ahead Log]({{< relref "./docs/Developerguide/write-ahead-log.md" >}}) - [Settings]({{< relref "./docs/Developerguide/settings.md" >}}) - - [Checkpoints]({{< relref "./docs/Developerguide/checkpoints.md" >}}) + - [Checkpoints]({{< relref "./docs/Developerguide/checkpoints-41.md" >}}) - [Log Replay]({{< relref "./docs/Developerguide/log-replay.md" >}}) - [Archiving]({{< relref "./docs/Developerguide/archiving.md" >}}) - [HA Replication]({{< relref "./docs/Developerguide/ha-replication.md" >}})