TAGS :Viewed: 8 - Published at: a few seconds ago

[ Charting massive amounts of data ]

We are currently using ZedGraph to draw a line chart of some data. The input data comes from a file of arbitrary size, therefore, we do not know what the maximum number of datapoints in advance. However, by opening the file and reading the header, we can find out how many data points are in the file.

The file format is essentially [time (double), value (double)]. However, the entries are not uniform in the time axis. There may not be any points between say t = 0 sec and t = 10 sec, but there might be 100K entires between t = 10 sec and t = 11 sec, and so on.

As an example, our test dataset file is ~2.6 GB and it has 324M points. We'd like to show the entire graph to the user and let her navigate through the chart. However, loading up 324M points to ZedGraph not only is impossible (we're on a 32-bit machine), but also not useful since there is no point of having so many points on the screen.

Using the FilteredPointList feature of ZedGraph also appears to be out of question, since that requires loading the entire data first and then performing filtering on that data.

So, unless we're missing anything, it appears that our only solution is to -somehow- decimate the data, however as we keep working on it, we're running into a lot of problems:

1- How do we decimate data that is not arriving uniformly in time?

2- Since the entire data can't be loaded into memory, any algorithm needs to work on the disk and so needs to be designed carefully.

3- How do we handle zooming in and out, especially, when the data is not uniform on the x-axis.

If data was uniform, upon initial load of the graph, we could Seek() by predefined amount of entries in the file, and choose every N other samples and feed it to ZedGraph. However, since the data is not uniform, we have to be more intelligent in choosing the samples to display, and we can't come up with any intelligent algorithm that would not have to read the entire file.

I apologize since the question does not have razor-sharp specificity, but I hope I could explain the nature and scope of our problem.

We're on Windows 32-bit, .NET 4.0.

Answer 1

I've needed this before, and it's not easy to do. I ended up writing my own graph component because of this requirement. It turned out better in the end, because I put in all the features we needed.

Basically you need to get the range of data (min and max possible/needed index values), subdivide into segments (let's say 100 segments), and then determine a value for each segment by some algorithm (average value, median value, etc.). Then you plot based on those summarized 100 elements. This is much faster than trying to plot millions of points :-).

So what I am saying is similar to what you are saying. You mention you do not want to plot every X elements because there might be a long stretch of time (index values on the x axis) between elements. What I am saying is that for each subdivision of data determine what is the best value, and take that as the data point. My method is index value based, so in your example of no data between the 0 sec and 10 sec index values I would still put data points there, they would just have the same values among themselves.

The point is to summarize the data before you plot it. Think through your algorithms to do that carefully, there are lots of ways to do so, choose the one that works for your application.

You might get away with not writing your own graph component and just write the data summarization algorithm.

Answer 2

1- How do we decimate data that is not arriving uniformly in time?

(Note - I'm assuming your loader datafile is in text format.)

On a similar project, I had to read datafiles that were more than 5GB in size. The only way I could parse it out was by reading it into an RDBMS table. We chose MySQL because it makes importing text files into datatables drop-dead simple. (An interesting aside -- I was on a 32-bit Windows machine and couldn't open the text file for viewing, but MySQL read it no problem.) The other perk was MySQL is screaming, screaming fast.

Once the data was in the database, we could easily sort it and quantify large amounts of data into singular paraphrased queries (using built-in SQL summary functions like SUM). MySQL could even read its query results back out to a text file for use as loader data.

Long story short, consuming that much data mandates the use of a tool that can summarize the data. MySQL fits the bill (pun intended...it's free).

Answer 3

I would approach this in two steps:

  1. Pre-processing the data
  2. Displaying the data

Step 1 The file should be preprocessed into a binary fixed format file. Adding an index to the format, it would be int,double,double. See this article for speed comparisons:


You can then either break up the file into time intervals, say one per hour or day, which will give you an easy way to express accessing different time intervals. You could also just keep one big file and have an index file which tells you where to find specific times,

1,1/27/2011 8:30:00
13456,1/27/2011 9:30:00

By using one of these methods you will be able to quickly find any block of data by either time, via a index or file name, or by number of entries, due to the fixed byte format.

Step 2 Ways to show data 1. Just display each record by index. 2. Normalize data and create aggregate data bars with open, high, low ,close values. a. By Time b. By record count c. By Diff between value

For more possible ways to aggregate non-uniform data sets, you may want to look at different methods used to aggregate trade data in the financial markets. Of course, for speed in realtime rendering you would want to create files with this data already aggregated.