Research | Gengrui (Edward) Zhang

Research Group of Distributed Computing and Systems

Who are we and what we do?

DISCOS is a leading research lab in the Department of Electrical & Computer Engineering at Concordia University. We address real-world problems and advance state-of-the-art solutions. Driven by a commitment to impact both academia and industry, our work supports a wide range of applications, including distributed machine learning, blockchains, cloud computing, and data management.

At DISCOS, we are devoted to designing core components of distributed computing that advance performance, scalability, and availability. We rigorously prove the correctness of these components and implement them in real-world distributed systems (e.g., physically distributed servers or cloud environments). We then evaluate their performance using standard benchmarks and real-world workloads.

The "ABCDs" of Our Research Directions

Our research contributes to both the theoretical foundations and practical applications of distributed computing and systems.

Our primary research directions include, but not limited to:

AI-supporting DS
Blockchains
Cloud computing
Data management

AI-supporting DS

Blockchains
Cloud computing

Data management

The astronomically growing complexity of data and models drives the need for distributed training across multiple machines. Our research in AI-supporting distributed systems is focused on the efficiency and robustness of distributed training processes. For example, in large-scale data parallelism tasks, where data and computation are spread across multiple nodes, we aim to address the following key questions:

How can we design and implement more efficient synchronization mechanisms to minimize communication overhead and enable faster model convergence?
How can we improve fault tolerance and availability for general distributed training processes, mitigating the impact of single points of failure?

AI-supporting DS

Blockchains
Cloud computing

Data management

Blockchain systems, where multiple nodes collectively agree on system-wide operations, are fundamentally distributed applications. Their performance is closely tied to their consensus mechanisms. Following Lampson's principle for system design -"the normal case must be fast; the worst case must make some progress" - our research has extensively studied and will continue to investigate the following key aspects:

(Make normal case faster): Improve the throughput and reduce the latency of consensus algorithms (e.g., weighted consensus, DAG consensus).
(Make worst case progress more): Optimize the coordination process including leader election to increase system availability, especially under Byzantine failures (e.g., reputation-based LE).

Read our work on distributed systems and blockchains

Reputation-based consensus: Procecutor [Middleawre'21], PrestigeBFT [ICDE'24]
Weighted consensus: ESCAPE [ICDCS'22]
Blockchains for Vehicles: V-Guard
A survey on state-of-the-art BFT algorithms

AI-supporting DS

Blockchains
Cloud computing

Data management

Cloud computing, which provides an abstraction of underlying computation, storage, and networking resources, is a key application of distributed systems. In today's cloud-native applications, resource sharing and management are pivotal factors influencing performance, scalability, and availability. In both traditional cloud computing and the emerging field of sky computing, which involves multiple clouds, our research focuses on the following key areas:

Cost-efficient resource management for cloud computing: Exploring hybrid scaling strategies for containerized clouds to handle diverse workloads, global-scale scalability, fault tolerance, and failure recovery.
Cost-optimization for sky computing: Facilitating the interconnection and interoperability of multicloud in a heterogeneous architecture (e.g., computation and storage are separate), including developing cost-aware exchanging services between clouds.

Read our work on cloud computing

Cost-efficient resource allocation: Drone [SoCC'23]

AI-supporting DS

Blockchains
Cloud computing

Data management

The exponential growth of big data has fueled the rise of distributed databases. Ensuring more efficient and reliable replication and transaction management in these systems is crucial. Our research focuses on, but not limited to, two key areas:

On-demand consistency for distributed transactions: Developing a versatile consistency model that adapts to the specific needs of each transaction. This model will allow transactions to operate under varying consistency guarantees based on requirements—ranging from linearizability under CFT to causal consistency under BFT.
Scalable data structures for distributed DBMS: Designing optimal data structures tailored for distributed DBMS that maintain high performance as the system scales. These data structures also enhance fault tolerance and failure recovery, particularly in large-scale systems.

The ABCD are not isolated topics.

Their distributed nature leads us to
exciting challenges
when connecting the dots

AI-supporting DS

Blockchains
Cloud computing

Data management

Decentralized AI is a promising and emerging field at the intersection of machine learning and blockchain technology. It moves away from centralized control of data, models, and governance, fostering transparency, trust, and accountability in AI systems.

Research directions:

Fairness assurance: As AI models increasingly influence critical decision-making processes, ensuring fairness is critical in numerous applications. Our research aims to leverage blockchain technology to collaboratively detect and mitigate biases in data and models. By redistributing data using blockchain, we strive to reduce biases and enhance the fairness of AI systems.
Federated learning: While federated learning has been widely explored, current approaches still rely on centralized coordination during the training process. Our future work seeks to advance this field by exploring the possibility of fully decentralized training, potentially through the development of leaderless training models, to further minimize data sharing and enhance decentralization.

AI-supporting DS

Blockchains
Cloud computing

Data management

Distributed ledger technologies (DLT) bridge blockchains and DBMS, sharing many core technologies. The advancement of DLT can enhance key components of both systems, including coordination, consistency, fault tolerance, and failure recovery.

Cross-chain transactions: As the ecosystem of blockchains grows, enabling seamless transactions across different chains opens up new challenges. Our research will develop and implement protocols that facilitate secure and efficient cross-chain transactions. This includes addressing challenges related to interoperability, consistency, and atomicity, ensuring that assets and data can move freely and safely between distinct ledgers.
Database technologies for enhancing DLT performance: Obtaining high performance in DLT involves applying advanced database technologies, such as using LSM trees in DAG replications and other blockchain data management aspects. This research aims to improve data replication, query, and storage efficiency.

AI-supporting DS

Blockchains
Cloud computing

Data management

Vector databases are becoming spotlights due to the rising volume of vectorization in AI training models. Handling and optimizing large-scale vector storage and queries are crucial for machine learning applications, particularly for LLMs.

Efficient partitioning and vector search: Our research is interested in optimizing the partitioning algorithms that better support vector searches. Each vector search is a classifier that may involve high-dimensional partitioning. This work will enable faster and more accurate searches.
More scalable architecture: Vector databases are inherently column-based, which provides advantages in handling vectorized data. Our research aims to develop more scalable architectures to enhance the ability of vector databases to manage increasing volumes of data without significant degradation in performance.

AI-supporting DS

Blockchains
Cloud computing

Data management

Coordination-as-a-Service standardizes and streamlines the processes of replicating and executing distributed transactions, supporting transaction-level execution preferences, such as consistency, synchronization, and fault tolerance guarantees. While these functions are widely used across various applications (e.g., blockchains and cloud databases), they are developed in isolation and do not allow for customized execution controls. For instance, some transactions may prioritize strong consistency over fault tolerance, while others may do the reverse. Current solutions lack the adaptability to cater to each transaction's specific execution preferences. Our research addresses this challenge by enabling Coordination-as-a-Service with the following features:

The "Lego" of execution services: This approach will enable transaction-level execution preferences. The execution plan can be customized by assembling services provided in the standard library, offering fine-grained control over each transaction. This modular design provides versatile guarantees while enhancing performance.
Automated verification: After assembling the execution services, automated verification will be applied to guarantee the correctness of each customized execution and show what properties can be obtained. This feature safeguards the integrity and reliability of distributed transactions.
Cloud-native and easy migration: The standardized library is designed to support services developed in containerized environments, facilitating easy migration of distributed transactions across different cloud computing platforms.

Research Group of Distributed Computing and Systems

Who are we and what we do?

The "ABCDs" of Our Research Directions

The ABCD are not isolated topics. Their distributed nature leads us to exciting challenges when connecting the dots

The ABCD are not isolated topics.

Their distributed nature leads us to
exciting challenges
when connecting the dots