TAGS :Viewed: 7 - Published at: a few seconds ago

[ How to index a wide table of booleans ]

My question is on building indexes when your client is using a lot of little fields.

Consider a search of the following: (can't change it, this is what the client is providing)

SKU zone1   zone2   zone3   zone4   zone5   zone6   zone7   zone8   zone9   zone10  zone11
A123                1       1       1       1       1       1       1       1   
B234                1       1       1       1       1       1       1       
C345                        1       1       1       1       1       1   

But it is much wider, and there are many more categories than just Zone. The user will be looking for skus that match at least one of the selected zones. I intend to query this with (if the user checked "zone2, zone4, zone6")

select SKU from TABLE1 where (1 IN (zone2,zone4,zone6))

Is there any advantage to indexing with a multi tiered index like so:

create index zones on table1 (zone1,zone2,zone3,zone4,zone5,zone6,zone7,zone8,zone9,zone10,zone11)

Or will that only be beneficial when the user checked zone1?

Thanks, Rob

Answer 1


You should structure the data as:

create table SKuZones (
    Sku int not null,
    zone varchar(255)
)

It would be populated with the places where a SKU has a 1. This can then take great advantage of an index on SKUZones(zone) for an index. A query such as:

select SKU
from SKUZones
where zone in ('zone2', 'zone4', 'zone6');

will readily take advantage of an index. However, if the data is not structured in a way appropriate for a relational database, then it is much harder to make queries efficient.

One approach you could take if you can add a column to the table is the following:

  • Add a new column called zones or something similar.
  • Use a trigger to populate it with values for each "1" in the columns (so "zone3 zone4 zone5 . . ." for the first row in your data).
  • Build a full text index on the column.
  • Run your query using match against

Answer 2


Indexing boolean values is almost always useless.

What if you use a SET datatype? Or BIGINT UNSIGNED?

Let's talk through how to do it with some sized INT, named zones

zone1 is the bottom bit (1<<0 = 1) zone2 is the next bit (1<<1 = 2) zone3 is the next bit (1<<2 = 4) zone4 is the next bit (1<<3 = 8) etc.

where (1 IN (zone2,zone4,zone6)) becomes where (zones & 42) != 0.

To check for all 3 zones being set: where (zones & 42) = 42.

As for indexing, no index will help this design; there will still be a table scan.

If there are 11 zones, then SMALLINT UNSIGNED (2 bytes) will suffice. This will be considerably more compact than other designs, hence possibly faster.

For this query, you can have a "covering" index, which helps some: select SKU from TABLE1 where (zones & 42) != 0 .. INDEX(zones, SKU)

(Edit)

42 = 32 | 8 | 2 = zone6 | zone4 | zone2 -- where | is the bitwise OR operator.

& is the bitwise AND operator. See http://dev.mysql.com/doc/refman/5.6/en/non-typed-operators.html

(zones & 42) = 42 effectively checks that all 3 of those bits are "on".
(zones & 42) = 0 effectively checks that all 3 of those bits are "off".
In both cases, it is ignoring the other bits.

42 could be represented as ((1<<5) | (1<<3) | (1<<1)). Because of precedence rules, I recommend using more parentheses than you might think necessary.

1 << 5 means "shift" 1 by 5 bits.