November 30th, 2020            Click here to download the PDF file

Meeting the New Challenges of Distributed Storage Systems
-- Part 2 of 2


In our October newsletter, we talked about the background of the rise of distributed storage and the application of software-defined storage in distributed storage. In this newsletter, we will introduce three types of data storage architecture and some open source software that be used by distributed storage, in order to help readers having more information to understand of the challenges for distributed storage and how to plan and manage distributed storage system. s

Distributed Storage Architecture: Blocks, Files and Objects

Based on the way data is formatted, organized and presented, there are three types of data storage architecture used by distributed storage: block, file, and object storage. To no surprise, there are significant differences in their respective capabilities as well as limitations. To help you understand these differences, this section highlights the advantages and disadvantages of each when planning for a more complete distributed storage system.

Block storage, primarily partitions and formats the mounted bare disk space and maps it to the host for use. The advantage there is that multiple inexpensive hard disks can be combined into a large-capacity logical disk to provide external services. In addition to increasing the total capacity, the multiple hard disks making up the combined logical disk can be written to in parallel to improve efficiency. When the block storage is deployed in a SAN architecture, the high transmission rate and efficient transmission protocol of SAN will result in increased transmission and read/write speed. However, when the host operating system uses different file systems or the host servers are not clustered, data cannot be shared among these hosts. This is the disadvantage of block storage.

File storage, unlike block storage, has its own file management system. The host does not need to format the file storage and directly upload and download files to the file storage. At the same time, the file system manages user permissions, file locking and other security measures. It allows multiple users to access files at the same time and avoids the block storage’s non-sharing disadvantage. Using file transmission protocol, good scalability, low price and ease of management are some of its main advantages. But because it runs on Ethernet, the upload and download speed is slower. Compared with block storage with dozens of or even a hundred hard drives working in parallel, file storage is much slower. On the other hand, the poor performance caused by low network bandwidth and high latency also makes it undesirable for high-performance clusters.

Today, most data used in businesses and even government are unstructured data. Given that both block storage and file storage have prominent limitations when processing large data volumes, object storage has emerged to address these shortcomings. Relative to the other two methods of data organization, storage objects separate metadata from the application data. When accessing data in the object storage, the metadata server is first accessed to identify which server the target object is located, which improves data access time. Compared with the other two storage methods, the main advantage comes about because object storage supports data object management by the application itself. This means that object storage does not require a real file system, which is why it is easier to expand than the other two; it overcomes the shortcomings of block storage and file storage, while combining the advantages of both.

Since object storage offers the benefits of both block storage and file storage, why use block storage or file storage in distributed storage at all?

1)Some applications require storage to be in the format of mapped bare disk. Take database for example, storage disks are first mapped to the database, the database file system then formats them for its exclusive use; where disks that have been formatted by other file systems cannot be used. Block storage is more suitable for this type of application.

2)Object storage requires special object storage software and large-capacity hard drives, which makes it more expensive than file storage. Therefore, for file sharing applications that do not require massive volume of data, it is more cost-effective to use file storage.

From this discussion, we see that the data storage methods of block storage, file storage, and object storage each has its own characteristics. A distributed storage system can be designed with flexibility, taking deployment scenarios and the applicability of these methods into considerations, finding the best combination of methods so as to maximize services provided to the applications.

The following table compares the previous three storage methods:

Distributed Storage Architecture: Blocks, Files and Objects

We now turn to few commonly used open-source software used for distributed storage: Ceph, GlusterFS, and DRBD.

Ceph. This is an open-source unified distributed file system that provides object, block (through RBD), and file storage in a unified system, which can store and manage a large amount of data within IT infrastructure. Ceph is a storage solution that stores data in the form of objects regardless of the type of data. It can perform data replication, fault detection and recovery, data migration and rebalancing across cluster nodes, as well as protect data on different nodes as much as possible to ensure data consistency. It has the characteristics of high performance, high expansibility, and high availability.

GlusterFS. Like Ceph, this is an open-source, distributed file system. It uses the concept of "converters" to allow the creation of file systems with various functions, including mirroring and copying, striping, load balancing, disk caching, read-ahead, transmit-after-write, and self-repair. One of its advantages is that it does not require a metadata server, but uses a hashing mechanism to locate data instead. Compared with Ceph, GlusterFS provides faster storage expansion and is easier to scale.

Distributed Replicated Block Device (DRBD). This is a software-based, shared-nothing, and replicated block storage solution under Linux. It mirrors block devices (hard disks, partitions, logical volumes, etc.) between servers across network, which is similar to a Network RAID1 function. Using DRBD as a component in a high availability (HA) solution can eliminate the need for a shared disk array. The data on the local host (primary node) and remote host (standby node) can be synchronized in real time. When the local host fails, the application may run uninterrupted using a copy of the same data on the remote host.

Shown below are the characteristics of these three distributed storage software applications.

Want more information?

If you are interested in learning more about distributed storage or constructing innovative storage solutions after reading this article, please contact our company through the contact information below in this newsletter.


Wanna get the latest news? Follow us on Facebook and subscribe our YouTube channel!