CMU15445Lab3查询执行引擎

整体架构

image-20230827204310822

  1. 输入一个SQL语句,通过SQL语句,走到解析层
  2. 走到binder,执行查询重写
  3. 走到planner,生成一个执行计划
  4. 到optimizer,执行一个优化
  5. 到查询执行这一层
  6. 后面就是并发控制和存储层等,先不介绍

信息介绍

schema模式

注意到,在解释计划节点时,每个节点后面都有一长串列描述。它是这个计划节点的预期输出模式。考虑下面的例子:

1
Projection { exprs=[#0.0, #0.1] } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

这表示代表此计划节点的执行器将生成两列。它们都是整数类型。输出模式是在规划器中推断的。对于这个项目,您的执行器实现必须按计划节点中指定的模式精确地生成元组,否则它将无法通过检查输出的测试。

Catalog系统表/元数据/数据字典

数据库维护一个内部目录以跟踪关于数据库的元数据信息。在这个项目中,可以与系统目录进行交互,以查询关于表、索引及其模式的信息。

Index Updates 索引更新

对于表修改执行器(InsertExecutor和DeleteExecutor),您必须修改操作目标表的所有索引。您可以使用Catalog :: GetTableIndexes()函数查询特定表定义的所有索引。一旦您获得了表的每个索引的IndexInfo实例,就可以在底层索引结构上调用索引修改操作

火山模型

  • 简介简单理解就是实现一个next() 方法,是查询执行中最常用的模型

  • 特点 火山模型将整个SQL构建成一颗运算符树,每一个运算符节点实现一个Next()函数,每次调用这个函数将会返回算子产生的一行数据tuple。根节点循环向子节点调用Next()方法,直到数据拉取完毕

  • 在本项目中主要是init和next方法
  • 调用模式:从父结点往调用子节点的方法调用,每次返回弹出一个tuple传到上层

C++基础知识总结

  • 共享指针和普通指针如何转换

    用get()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>
#include <memory>
using namespace std;

int main(){
shared_ptr<int> sp_1;
int *a;
int c = 100;
a = &c;
// 普通指针转换为智能指针
sp_1 = make_shared<int>(c);
sp_2 = make_shared<int>(*a);
cout<<"sp_1 = "<<sp_1<<endl;
// 智能指针转换为普通指针
int *ord_ptr_1 = sp_1.get();
cout<<"ord_ptr_1 = "<<ord_ptr_1<<endl;
}
  • 独占指针与普通指针之间的转换
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <iostream>
#include <memory>
using namespace std;

int main(){
int *a;
int c = 100;
a = &c;
// 通过指针构造
unique_ptr<int> up_1(new int(999));
// 通过指向类型的构造函数赋值,注意这里是数值不是指针,非常常用
unique_ptr<int> up_4 = std::make_unique<int>(10);

// 普通指针转换为独占指针
unique_ptr<int> up_5;
up_5.reset(a);
cout<<"up_5 = "<<up_5<<endl;

// 普通转为独占
int* rawPtr = new int(42);
std::unique_ptr<int> smartPtr(rawPtr);
std::cout << *smartPtr << std::endl; // 输出 42

// 独占指针转换为普通指针
int *ord_ptr_1 = up_1.get();
cout<<"ord_ptr_1 = "<<ord_ptr_1<<endl;
int *ord_ptr_5 = up_5.release();
cout<<"ord_ptr_5 = "<<ord_ptr_5<<endl;
}

第一个任务 Seq_Scan

  • SeqScanPlanNode可以计划使用SELECT * FROM table语句。
  • SeqScanExecutor遍历表并逐个返回其元组

是一个入门算子,只需要实现Init函数和next函数

分析任务

遍历一整个表,只需要两个信息:1. 当前遍历到哪个位置 2. 遍历到哪个表

数据库图形表示

image-20230827212207528

image-20230827212229605

实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SeqScanExecutor::SeqScanExecutor(ExecutorContext *exec_ctx, const SeqScanPlanNode *plan)
: AbstractExecutor(exec_ctx), plan_(plan) {
table_info_ = exec_ctx_->GetCatalog()->GetTable(plan_->GetTableOid());
table_iterator_ = table_info_->table_->Begin(exec_ctx_->GetTransaction());
}

void SeqScanExecutor::Init() {}

auto SeqScanExecutor::Next(Tuple *tuple, RID *rid) -> bool {
if (table_iterator_ != table_info_->table_->End()) {
*tuple = *table_iterator_;
*rid = table_iterator_->GetRid();
++table_iterator_;
return true;
}
return false;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class SeqScanExecutor : public AbstractExecutor {
public:
/**
* Construct a new SeqScanExecutor instance.
* @param exec_ctx The executor context
* @param plan The sequential scan plan to be executed
*/
SeqScanExecutor(ExecutorContext *exec_ctx, const SeqScanPlanNode *plan);

/** Initialize the sequential scan */
void Init() override;

/**
* Yield the next tuple from the sequential scan.
* @param[out] tuple The next tuple produced by the scan
* @param[out] rid The next tuple RID produced by the scan
* @return `true` if a tuple was produced, `false` if there are no more tuples
*/
auto Next(Tuple *tuple, RID *rid) -> bool override;

/** @return The output schema for the sequential scan */
auto GetOutputSchema() const -> const Schema & override { return plan_->OutputSchema(); }

private:
/** The sequential scan plan node to be executed */
const SeqScanPlanNode *plan_;
TableIterator table_iterator_ = {nullptr, RID(), nullptr};
TableInfo *table_info_;
};

输入参数:

  • ExecutorContext 执行器上下文
    • GetBufferPoolManager 缓冲池管理器
    • GetCatalog 系统表
    • GetTransaction 事务
    • GetLogManager 日志管理器
    • GetLockManager 锁管理器
    • GetTransactionManager 事务管理器
  • SeqScanPlanNode 顺序扫描执行节点
    • GetType 类型
    • GetTableOid 表的唯一ID

第二个任务Insert

可以根据INSERT语句来规划InsertPlanNode

实现思路

  • 目标:把一个tuple插入到table中
  • 分解需求:
    • 哪个tuple
    • 哪张表 tableInfo
    • 插到哪儿 rid

实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
InsertExecutor::InsertExecutor(ExecutorContext *exec_ctx, const InsertPlanNode *plan,
std::unique_ptr<AbstractExecutor> &&child_executor)
: AbstractExecutor(exec_ctx), plan_{plan}, child_executor_{std::move(child_executor)} {
this->table_info_ = this->exec_ctx_->GetCatalog()->GetTable(plan_->table_oid_);
}

void InsertExecutor::Init() {
child_executor_->Init();
table_indexes_ = exec_ctx_->GetCatalog()->GetTableIndexes(table_info_->name_);
}

auto InsertExecutor::Next([[maybe_unused]] Tuple *tuple, RID *rid) -> bool {
if (is_end_) {
return false;
}
Tuple to_insert_tuple{};
RID emit_rid;
int32_t insert_count = 0;

while (child_executor_->Next(&to_insert_tuple, &emit_rid)) {
bool inserted = table_info_->table_->InsertTuple(to_insert_tuple, rid, exec_ctx_->GetTransaction());

if (inserted) {
std::for_each(table_indexes_.begin(), table_indexes_.end(),
[&to_insert_tuple, &rid, &table_info = table_info_, &exec_ctx = exec_ctx_](IndexInfo *index) {
index->index_->InsertEntry(to_insert_tuple.KeyFromTuple(table_info->schema_, index->key_schema_,
index->index_->GetKeyAttrs()),
*rid, exec_ctx->GetTransaction());
});
insert_count++;
}
}
std::vector<Value> values{};
values.reserve(GetOutputSchema().GetColumnCount());
values.emplace_back(TypeId::INTEGER, insert_count);
*tuple = Tuple{values, &GetOutputSchema()};
is_end_ = true;
return true;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class InsertExecutor : public AbstractExecutor {
public:
/**
* Construct a new InsertExecutor instance.
* @param exec_ctx The executor context
* @param plan The insert plan to be executed
* @param child_executor The child executor from which inserted tuples are pulled
*/
InsertExecutor(ExecutorContext *exec_ctx, const InsertPlanNode *plan,
std::unique_ptr<AbstractExecutor> &&child_executor);

/** Initialize the insert */
void Init() override;

/**
* Yield the number of rows inserted into the table.
* @param[out] tuple The integer tuple indicating the number of rows inserted into the table
* @param[out] rid The next tuple RID produced by the insert (ignore, not used)
* @return `true` if a tuple was produced, `false` if there are no more tuples
*
* NOTE: InsertExecutor::Next() does not use the `rid` out-parameter.
* NOTE: InsertExecutor::Next() returns true with number of inserted rows produced only once.
*/
auto Next([[maybe_unused]] Tuple *tuple, RID *rid) -> bool override;

/** @return The output schema for the insert */
auto GetOutputSchema() const -> const Schema & override { return plan_->OutputSchema(); };

private:
/** The insert plan node to be executed*/
const InsertPlanNode *plan_;

const TableInfo *table_info_;

std::unique_ptr<AbstractExecutor> child_executor_;
std::vector<IndexInfo *> table_indexes_;
bool is_end_{false};

第三个任务Delete

使用delete来规划函数

实现思路

  • 目标在表中删除几个tuple
    • tuple获取:从子节点next()获取
    • 要对哪个表 :从Tableinfo里面查找
    • 表相关联的索引怎么找:系统表中有这些信息

实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class DeleteExecutor : public AbstractExecutor {
public:
/**
* Construct a new DeleteExecutor instance.
* @param exec_ctx The executor context
* @param plan The delete plan to be executed
* @param child_executor The child executor that feeds the delete
*/
DeleteExecutor(ExecutorContext *exec_ctx, const DeletePlanNode *plan,
std::unique_ptr<AbstractExecutor> &&child_executor);

/** Initialize the delete */
void Init() override;

/**
* Yield the number of rows deleted from the table.
* @param[out] tuple The integer tuple indicating the number of rows deleted from the table
* @param[out] rid The next tuple RID produced by the delete (ignore, not used)
* @return `true` if a tuple was produced, `false` if there are no more tuples
*
* NOTE: DeleteExecutor::Next() does not use the `rid` out-parameter.
* NOTE: DeleteExecutor::Next() returns true with the number of deleted rows produced only once.
*/
auto Next([[maybe_unused]] Tuple *tuple, RID *rid) -> bool override;

/** @return The output schema for the delete */
auto GetOutputSchema() const -> const Schema & override { return plan_->OutputSchema(); };

private:
/** The delete plan node to be executed */
const DeletePlanNode *plan_;
/** The child executor from which RIDs for deleted tuples are pulled */
std::unique_ptr<AbstractExecutor> child_executor_;

const TableInfo *table_info_;

std::vector<IndexInfo *> table_indexes_;
bool is_end_{false};
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
DeleteExecutor::DeleteExecutor(ExecutorContext *exec_ctx, const DeletePlanNode *plan,
std::unique_ptr<AbstractExecutor> &&child_executor)
: AbstractExecutor(exec_ctx), plan_{plan}, child_executor_{std::move(child_executor)} {
this->table_info_ = this->exec_ctx_->GetCatalog()->GetTable(plan_->table_oid_);
}

void DeleteExecutor::Init() {
child_executor_->Init();
table_indexes_ = exec_ctx_->GetCatalog()->GetTableIndexes(table_info_->name_);
}

auto DeleteExecutor::Next([[maybe_unused]] Tuple *tuple, RID *rid) -> bool {
if (is_end_) {
return false;
}
Tuple to_delete_tuple{};
RID emit_rid;
int32_t delete_count = 0;

while (child_executor_->Next(&to_delete_tuple, &emit_rid)) {
bool deleted = table_info_->table_->MarkDelete(emit_rid, exec_ctx_->GetTransaction());

if (deleted) {
std::for_each(table_indexes_.begin(), table_indexes_.end(),
[&to_delete_tuple, &rid, &table_info = table_info_, &exec_ctx = exec_ctx_](IndexInfo *index) {
index->index_->DeleteEntry(to_delete_tuple.KeyFromTuple(table_info->schema_, index->key_schema_,
index->index_->GetKeyAttrs()),
*rid, exec_ctx->GetTransaction());
});
delete_count++;
}
}
std::vector<Value> values{};
values.reserve(GetOutputSchema().GetColumnCount());
values.emplace_back(TypeId::INTEGER, delete_count);
*tuple = Tuple{values, &GetOutputSchema()};
is_end_ = true;
return true;
}

第四个任务 索引扫描 index_scan

迭代索引以检索元组的RID,使用这些RID在相应的表中检索它们的元组,最后逐个发出这些元组

实现思路

  • 目标 按照索引顺序输出所有tuple数据

  • 索引的迭代器:rid

    表的信息:tuple

实现步骤

  • 获取索引的迭代器
  • 从迭代器中获取rid
  • 通过rid拿到对应的tuple数据
  • 返回tuple和rid

再着重介绍一下火山模型

火山模型是数据库界已经很成熟的解释计算模型,该计算模型将关系代数中每一种操作抽象为一个 Operator,将整个 SQL 构建成一个 Operator 树,从根节点到叶子结点自上而下地递归调用 next() 函数。

在SQL中,有如下例子:

1
2
3
SELECT Id, Name, Age, (Age - 30) * 50 AS Bonus
FROM People
WHERE Age > 30

image-20230828215624962

对应火山模型如上所示,其中:

User:客户端;

Project:垂直分割(投影),选择字段;

Select(或 Filter):水平分割(选择),用于过滤行,也称为谓词;

Scan:扫描数据。

这里包含了 3 个 Operator,首先 User 调用最上方的 Operator(Project)希望得到 next tuple,Project 调用子节点(Select),而 Select 又调用子节点(Scan),Scan 获得表中的 tuple 返回给 Select,Select 会检查是否满足过滤条件,如果满足则返回给 Project,如果不满足则请求 Scan 获取 next tuple。Project 会对每一个 tuple 选择需要的字段或者计算新字段并返回新的 tuple 给 User。当 Scan 发现没有数据可以获取时,则返回一个结束标记告诉上游已结束。

聚集

针对group by、count等函数

任务目标

  • group by
  • 聚集
    • min()
    • max()
    • sum()
    • count(列名)
    • count(*)

先分组再聚集

样例

  • 普通表
1
select sum(age) from table group by grade;
姓名name 年龄age 等级grade
张三 18
李四 19
王二 20
麻子 18
张麻子 45
  • group by
    • 【上】
      • 张三 18
      • 张麻子 45
    • 【中】
      • 李四 19
    • 【下】
      • 王二 20
      • 麻子 18
  • 新插入一条数据
    • 孔乙己 50 上
  • 聚集SUM(age)
    • 累加对应的属性列18+45+50

sort limit topn

不想写了