Can HBase store unstructured data

Software infrastructure

HBase

HBase is the database written in Java from Hadoop, an open source project of the Apache Software Foundation. The Hadoop project includes various interrelated technologies for robust processing of large amounts of data in computer clusters. HBase's data model is similar to Google's BigTable, which is why HBase is also known as a BigTable clone. Like Google's BigTable, HBase works with strong consistency, so that, like a relational database, all users have the same view of the data at all times. The data model is based on tables - just like with relational databases: An HBase table is a sorted list of rows that, in contrast to the relational model, consist of a variable number of columns. Functionally related columns are grouped into column families for performance reasons and saved in separate files. This model is much simpler than the relational model, but at the same time more powerful than a key-value store, so that complex data models can also be mapped without compromising too much in terms of performance.

Strictly speaking, HBase does not store its data itself, but uses the functions of the distributed file system HDFS (Hadoop Distributed File System). Thanks to the redundant storage and distribution of the data on different servers, HBase is tolerant of hardware failure thanks to HDFS. HBase stores a configurable number of versions of the records. A new version is added with each write operation; this is known as incremental updates. Versions that are too old or too many can be deleted automatically. HBase is a sorted data store: the rows in the tables are sorted according to their row key, the columns in the rows according to their name and the actual data records in the columns according to their version. This sorting allows very fast index-based read and write operations.

HBase is good for storing counters

HBase makes sense when the columns are relatively small and there are many incremental updates to be performed. The direct use of HDFS is more suitable for storing large binary data. While HDFS is optimized for queries with low latency, HDFS relies on high data throughput. Due to the sorting of the keys, scanning and sequential reading are more efficient than random point queries. The write operations are the fastest. This is why HBase is suitable for storing counters. Real-time analyzes can be created on the basis of the counters. For relational databases, counting up is expensive because the counter has to be read, modified and written. Because incremental updates are so efficient, a large number of counters can be used at the same time. For example, the number of views of individual articles in an online magazine can be counted in order to find out in real time which articles were most popular in the past hour. The columns of a table are not fixed and can differ between individual rows. This schema-less data model is very suitable for unstructured or semi-structured data. HBase is particularly suitable for records in which not all fields have values. Fields without values ​​do not require any storage space.

Hadoop MapReduce and Hive can be used in combination with HBase to perform analysis and complex queries, which are then distributed to all nodes in the cluster. Configurations can be carried out on the command line with the HBase Shell. Either the Thrift RPC API, which supports different programming languages, or the JSON RESTful API over HTTP can be used to access HBase. HBase is part of the active Hadoop community and is developing rapidly. Companies such as Facebook, eBay and NAVTEQ are successfully using HBase. A special Hadoop distribution as well as support for companies who want to use Hadoop productively is offered by Cloudera. Further information on HBase can be found on the project website http://hbase.apache.org/.

advantages

+ Works like RDBMS with strong consistency

+ Can perform incremental updates very efficiently

+ Supports analyzes on huge amounts of data with MapReduce and Hive

disadvantage

- Unsuitable for large binary data

- At least without MapReduce and Hive, no direct data queries without a primary key are possible

In the video: NoSQL databases - CouchDB