nosql - HBase RowKey for Hierarchical data -
i have recommendation on designing hbase table/rowkey efficient search
here, sample data set:
| column families | column qualifiers | row 1 | row 2 | row 3 | --------------------------------------------------------------------------------------------------------- | country | code | | | uk | | | | full name |united states of america| |united kingdom | | | capital |washington, dc | |london | | | leader | president | president |prime minister | |state | | texas | | | |district | |houston |la | | |county | |harris |harris | cambridge | |city | |houston city |duke city | | |road | |bellare | |downing street | |family | |doe |wade | | |person | name |john doe |doo |smith | | | location |35.00 n, 99.00 w |31.00 n , 100.00 w | | | | gender | |female |male | | | religion |atheist |maya |christian |
note: in above sample data set column qualifiers detailed first , last column family avoid data cluttering.
data overview: hierarchical data set information @ level can missing example, row 2 doesn't have country information row 3, uk doesn't have concept of state, state hierarchy missing uk records
the requirement search following scenarios:
|search criteria |returned records | ----------------------------------------------------------------- |all records |3 count, row1, row2 & row3 | |country = usa |1 record, row 1 | |gender = male |1 record, row 3 | |county = harris |2 records, row 1 & row 2 | |latitude > 30 , longitude < 101 |2 records, row 1 & row 2 | |all atheist usa |1 record, row1 |
proposed design solution:
1. create 8 column families each level since there can added information needs searched @ each level (example, leaders name, position, timezone, area level city, county, district, country etc.)
|level 1 |country | |level 2 |state | |level 3 |district | |level 4 |county | |level 5 |city | |level 6 |road | |level 7 |family | |level 8 |person |
- design rowkey composite key, combination of all level codes since search can @ level level1code:level2code:level3code:level4code:level5code:level6code:level7code:level8code: example, us:101:102:103:104:105:106:107
other alternative thinking create secondary index have explore further on performance end since plan use hbase backend web application.
thanks in advance sharing expertise!
about number of column families i'd stick lower number of families possible (hbase docs). given case i'd try 2 families: 1 location data, , 1 person-related data, way can search location without reading person or vice-versa. remind if scan 1 family speed things, can retrieve data of family, in case if people within won't able names unless add other family, , if that's case, there's no gain having multiple families. multiple families used if data need different configurations (ttl, versions, compression...) or if can queried independently without needing data other families.
in end depends on common queries , when need reduce amount of data read: let's perform lot of queries based on geolocation, in case i'd move lat & long columns own family avoid reading else.
about composite rowkeys: i'm not sure levelcode ids come from, normalized values stored elshewere?. rowkeys seem pretty large if store them string within rowkey (any char take 1 byte, if have large levelcode ids you'll end +60 byte rowkeys), please notice in hbase every cell includes row key + column family + column + timestamp + value, so, pretty overhead in terms of storage not performance gain when querying.
if levelcode ids normalized integers , want use them filter data perhaps should consider adding ids family (with values integer columns), way scan family , filter out don't need singlecolumnvaluefilter
.
if have spare time take great introduction hbase schema design amandeep khurana.
about real-time answers
if need provide real-time answers queries both approaches won't work given amount of data. hbase data storage, not search engine, can run searches , analysis jobs in huge datasets won't real-time.
with proper design, hbase can work real-time backend simple searches, should follow these mantras when designing tables:
- avoid full scans @ costs, if have filters. full scan = read every row = not suitable real-time.
- write fast. denormalize , write data need in order speed data retrieval.
this means need secondary indexes each query expect run if need retrieve them in real-time. of scan queries should have @ least start row key, , requires writing same data multiple tables under different rowkeys (or same table using prefixes each index type, practice wouldn't recommend, makes splitting hard , have hotspotting issues).
please notice of queries combine multiple fields ("all atheist usa" or "latitude > 30 , longitude < 101"), in cases you'll need secondary index them , every combination of them, make things lot more complex if want handle them hbase.
i tend not recommend switching other systems because things can done more or less effort, but, based on use case, think better opt search engine takes care of indexing itself.
perhaps you'll find elasticsearch useful task: fast, easy learn, flexible, reliable, scalable , has nice api in lot of languages.
Comments
Post a Comment