TAGS :Viewed: 16 - Published at: a few seconds ago

[ Is it bad to store calculated columns in Postgres for a Tableau connection? ]

I recently bought a Professional license (coming from Personal) and I loaded some data into postgreSQL. This is my first time using a database but I have read a lot about them (SQL Demystified, Data Warehouse)

My habit with .csv files was to create calculated data within the raw data. I use pandas (python analysis library) to clean my files, add columns, do joins etc. This helped me simplify some calculated fields in Tableau, and my end-users would sometimes open these files and it was nice having the extra fields.

Example: In pandas I can create a column which is based on some logic and math on other columns. This would create a column in the raw data which is 'redundant', because it can be calculated.

Or I can calculate that with IF statements or LOD calculations in Tableau. I have read that creating columns which can be calculated and storing them in a database is a no-no. On the other hand it seems a bit silly having Tableau do extra work/calculations for values which are persistent and common.

(Example, inbound_interaction = 1 if a certain column = 'inbound' and another column = not null) (Example 2, total_time = sum of several columns)

Any thoughts or best practices on this subject? Since I am literally just starting fresh this week, I might as well get off to a good start.

Answer 1

The question is a bit broad and there is no golden bullet answer to it. Certain costs you have to consider are

  • Having stale vs fresh data
  • Space used by the materialized columns
  • The cost of computation on the server vs the client
  • Cost of maintaining the materialized columns

Now we have 4 solutions:

  • Raw data in the tables, extra calculations done by the client
  • Views that present calculated data to the client but which are done on the server on the fly, these can be aided by indexes on expressions. Also the views make it easier to choose what data is accessible by given user and you can have many without taking space.
  • Materialized views which are like views but store calculated data, are not fresh but are easier to maintain than some self-made materialization. Indexes are also possible on them
  • Tables that contain data calculated by applications outside of the database, for the things that your database just can't handle

I'd recommend to go with indexes and if that doesn't perform well switch to materialized indexes. Of course you may adopt a partial solution as some data is needed always up to date, other data might take a lot of time on the server but is not accessed often enough to store it materialized and some other might be really fast to compute but accessed so often that materializing it will make the whole system perform much better.