Simulating First/Last aggregate functions in MySQL

I came across an interesting post at Xaprb.com on selecting the first and last row of a group in MySQL. This solution builds on a comment in that post about using GROUP_CONCAT to achieve this.

As a practical example of the usage of the first and last aggregate functions, consider an application converting financial data between different timeframes.

The table structure and data might look like:

CREATE TABLE EURUSD_3600 (
    datetime datetime NOT NULL,
    open float NOT NULL,
    low float NOT NULL,
    high float NOT NULL,
    close float NOT NULL,
    PRIMARY KEY (datetime)
);
INSERT INTO EURUSD_3600 SELECT '2001-01-03 00:00:00',0.9507,0.9505,0.9509,0.9506;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 01:00:00',0.9506,0.9492,0.951,0.9496;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 02:00:00',0.9496,0.9495,0.9509,0.9505;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 03:00:00',0.9504,0.9498,0.9508,0.9499;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 04:00:00',0.9499,0.9499,0.9507,0.9503;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 05:00:00',0.9503,0.9503,0.9506,0.9506;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 06:00:00',0.9506,0.9505,0.9507,0.9505;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 07:00:00',0.9505,0.9503,0.9513,0.9509;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 08:00:00',0.951,0.9468,0.951,0.9489;
INSERT INTO EURUSD_3600 SELECT '2001-01-03 09:00:00',0.949,0.9489,0.9543,0.9539;

SQL is a suitable tool to group this data into a more compact timeframe, for instance, the 2-hour timeframe. By using a formatting function on the datetime field, a standard GROUP BY SQL statement can be used to achieve this:

CONCAT(
    year(datetime), '-', 
    month(datetime), '-',
    day(datetime), ' ',  
    floor(hour(datetime) / 2) * 2, ':00:00'
)

Calculating the low and high of the period is easy, using the standard min/max aggregate functions:

SELECT
    CAST(
        CONCAT(
            year(datetime), '-',
            month(datetime), '-',
            day(datetime), ' ',
            floor(hour(datetime) / 2) * 2, ':00:00'
        ) AS DATETIME) AS datetime,
        MIN(low) AS low,
        MAX(high) AS high
FROM EURUSD_3600
GROUP BY CAST(
        CONCAT(
            year(datetime), '-',
            month(datetime), '-',
            day(datetime), ' ',
            floor(hour(datetime) / 2) * 2, ':00:00'
        ) AS DATETIME)

GROUP_CONCAT can be used for the opening(first) and close(last), the trick is to be sure to use the ORDER BY clause in the group concatenation, otherwise the returned first and last records might not be as expected:

SUBSTRING_INDEX(
    GROUP_CONCAT(CAST(open AS CHAR) ORDER BY datetime),
    ',',
    1
) AS open

SUBSTRING_INDEX(
    GROUP_CONCAT(CAST(close AS CHAR) ORDER BY datetime),
    ',',
    -1
) AS close

GROUP_CONCAT returns a comma separated string of all groups values, then SUBSTRING_INDEX is used to return the first and last values of the group. This works as long as data is properly sorted and the resulting string fits in the GROUP_CONCAT buffer.

Note however that there is a problem with the preceding code when calculating the close(last) value. SUBSTRING_INDEX looks from the end of the string, but the GROUP_CONCAT buffer has a limited size, and it will be truncated if the concatenated rows exceed the buffer size, in which case the returned value will not be the last in the group as desired.

An easy solution is to reverse the sort order and fetch the first element instead:

SUBSTRING_INDEX(
    GROUP_CONCAT(CAST(close AS CHAR) ORDER BY datetime DESC),
    ',',
    1
) AS close

MySQL will still throw a warning about truncated group_concat when there are too many rows in the source table, but this can be safely ignored for the purpose of this example.

The final query turns out as:

SELECT
    CAST(
        CONCAT(
            year(datetime), '-',
            month(datetime), '-',
            day(datetime), ' ',
            floor(hour(datetime) / 2) * 2, ':00:00'
        ) AS DATETIME) AS datetime,
    SUBSTRING_INDEX(
        GROUP_CONCAT(CAST(open AS CHAR) ORDER BY datetime),
        ',',
        1
    ) AS open,
    MIN(low) AS low,
    MAX(high) AS high,
    SUBSTRING_INDEX(
        GROUP_CONCAT(CAST(close AS CHAR) ORDER BY datetime DESC),
        ',',
        1
    ) AS close
FROM EURUSD_3600
GROUP BY CAST(
        CONCAT(
            year(datetime), '-',
            month(datetime), '-',
            day(datetime), ' ',
            floor(hour(datetime) / 2) * 2, ':00:00'
        ) AS DATETIME);
Simulating First/Last aggregate functions in MySQL

14 thoughts on “Simulating First/Last aggregate functions in MySQL

  1. stephen says:

    I am trying to turn this into a function, but it won’t work. Please, correct me.

    DELIMITER $$

    DROP FUNCTION IF EXISTS `test`.`last`$$
    CREATE FUNCTION `test`.`last`(f1 varchar(12),f2 varchar(12)) RETURNS varchar(12) CHARSET latin1
    return substring_index(group_concat(f1 order by f2 desc),’,’,1);

    $$

    DELIMITER ;

    Like

  2. Joao says:

    Not sure you can do this as a function using the same underlying logic.

    My first approach was to create an aggregate UDF, but that didn’t work because you can’t guarantee the order of data going into the udf.

    Like

  3. Nice little hack :).

    You could do this with a MySQL UDF though. Like the above function you need to pass the field with the value as well as the field to order by.

    SELECT agg_last(close, datetime AS `desc`) FROM …

    Like

  4. Joao says:

    But how do you guarantee sort order ? That’s the bit I couldn’t figure out. Does “datetime as `desc`” somehow make it happen, or is it something that can be implemented inside the UDF ?

    Like

  5. Paolo says:

    Hi Joao,
    thanks for this great post. How would you change the query if I wanted to group days instead of hours? For example, looking at weekly data using daily data.

    Like

  6. Paolo says:

    Hi Joao,
    thanks for responding on Sep 30th dated post. I also used the format for the different time frames you have listed on the linked site. If I may trouble you a bit more, I have minute data on my database and if I wanted to measure a 65 min timeframe instead of the typical 60 min. How would you modify your query. My reason is a 65 min timeframe would divide the trading day in 6 equal parts.
    Again, thanks for you response.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s