Simplifying

Data-agnostic Data Analysis

by Hao Sun

Why?

Because life could be so much better!! 😎

What is Data-agnostic? 🤔

logic NOT changing when data changes!

(across iterations, features, projects, teams, data-storage-medium)

Subjective 😜

Scenario

Calculate

7-day

average

price

for

stock A

Scenario

Calculate

1-month

average

price

for

stock A

Scenario

Calculate

1-month

average

volume

for

stock A

Scenario

Calculate

1-month

average

volume

for

stock B

Scenario

Calculate

7-day, 14-day, 1-month

average

price, volume

for

stock A, B

More complex

Calculate max, avg, min of length of streaks of

0s 1s

in column of 0s and 1s

Goal

  • Separation of data and logic
  • Logic is modular and reusable
  • Distill essence of desired subset of pandas

7-day sum,avg of stock A's price

The logic (a json config) 🤓


          {
            "tranformation": "agg",  # "col_ops", "filter", "merge"
            "agg_type": "disjoint",  # "rolling"
            "func": ["sum", "avg"],
          }
          

The data (another json config) 🤓


        { 
          "data_cols": ["price"],
          "date_col_name": "date",
          "lookback_periods": ["7d"],
          "groupby_cols": []
        }
        

7-day,1-month sum,avg of stock A,B's price,volume

The logic (a json config) 🤓


          {
            "tranformation": "agg",  # "col_ops", "filter", "merge"
            "agg_type": "disjoint",  # "rolling"
            "func": ["sum", "avg"],
          }
          

The data (another json config) 🤓


        { 
          "data_cols": ["price", "volume"],
          "date_col_name": "date",
          "lookback_periods": ["7d", "1m"],
          "groupby_cols": ["stock_name"]
        }
        

max, avg, min of length of streaks of 0s in column "01_col" of 0s and 1s

The logic


          [ { # step 1
              "tranformation": "agg",
              "agg_type": "rolling"
              "func": ["cumsum"], }, 
            { # step 2
              "tranformation": "agg",
              "agg_type": "disjoint",
              "func": ["count"], }, 
            { # step 3
              "tranformation": "agg",
              "agg_type": "disjoint",
              "func": ["max", "avg", "min"], },
            { # step 4 
              "tranformation": "filter",
              "func": ["{col}=0"] }]
          

The data


        [ { # step 1
            "data_cols": ["01_col"],
            # "orderby_cols": ["index_col"], }, 
          { # step 2
            # "data_cols": ["{result}"],
            "groupby_cols": 
              ["01_col", "{result}"], }, 
          { # step 3
            "groupby_cols": [], },
          { # step 4 
            "data_cols": ["01_col"] } ]
        

max, avg, min of length of streaks of 1s in column "01_col" of 0s and 1s

The logic


          [ { # step 0 
              "tranformation": "col_ops",
              "func": ["1-{col}"] },
            { # step 1
              "tranformation": "agg",
              "agg_type": "rolling"
              "func": ["cumsum"], }, 
            { # step 2
              "tranformation": "agg",
              "agg_type": "disjoint",
              "func": ["count"], }, 
            { # step 3
              "tranformation": "agg",
              "agg_type": "disjoint",
              "func": ["max", "avg", "min"], },
            { # step 4 
              "tranformation": "filter",
              "func": ["{col}=0"]
             }]
          

The data


        [ { # step 0
            "data_cols": ["01_col"], }, 
          { # step 1
            # "data_cols": ["{result}"],
            # "orderby_cols": ["index_col"], }, 
          { # step 2
            "groupby_cols": 
              ["01_col", "{result}"], }, 
          { # step 3
            "groupby_cols": [], },
          { # step 4 
            "data_cols": ["01_col"] } ]
        
01_col cumsum
1 1
0 1
0 1
1 2
1 3
0 3
0 3
0 3
1 4
0 4

🤔 Where is

🤯

Can also works for


...

if you write adapters

Config vs Code

Thank you!