Basic

  • Transforming HBase into a Relational Database
  • Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications(typically Hbase )
    • the power of standard SQL and JDBC APIs with full ACID transaction capabilities and
    • the flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase as its backing store

image.png

phoenix 没有服务,只需在服务端RegionServer加一个jar包,然后重启habse就好了。 Client这边加一个驱动jar包就可以连接了

Phoenix follows the philosophy of bringing the computation to the data by using:

  • coprocessors to perform operations on the server-side thus minimizing client/server data transfer* custom filters to prune data as close to the source as possible In addition, to minimize any startup costs, Phoenix uses native HBase APIs rather than going through the map/reduce framework.

procedure

  • compiling your SQL queries to native HBase scans
  • determining the optimal start and stop for your scan key
  • orchestrating the parallel execution of your scans
  • bringing the computation to the data by
  • pushing the predicates in your where clause to a server-side filter
  • executing aggregate queries through server-side hooks (called co-processors)

    enhancements

  • secondary indexes to improve performance for queries on non row key columns

  • stats gathering to improve parallelization and guide choices between optimizations
  • skip scan filter to optimize IN, LIKE, and OR queries
  • optional salting of row keys to evenly distribute write load

    Spark-Phoenix

  • Although Spark supports connecting directly to JDBC databases, it’s only able to parallelize queries by partioning on a numeric column. It also requires a known lower bound, upper bound and partition count in order to create split queries.

  • (貌似只有对数字列的查询才能被并行化?)
  • 所以一般spark jdbc读取phoenix,都是一个并发?
  • In contrast, the phoenix-spark integration is able to leverage the underlying splits provided by Phoenix in order to retrieve and save data across multiple workers. All that’s required is a database URL and a table name. Optional SELECT columns can be given, as well as pushdown predicates for efficient filtering.

Apache Spark Plugin

  • The phoenix-spark plugin extends Phoenix’s MapReduce support to allow Spark to load Phoenix tables as DataFrames, and enables persisting them back to Phoenix.