blog

some random stuff I've written about

Graphormer: Transformers for Graph-Structured Data Understanding how Graphormer adapts the transformer architecture for graph representation learning

Transformers have revolutionized NLP and computer vision, but applying them to graph-structured data presents unique challenges. Unlike sequences or grids, graphs have irregular topology without a natural ordering of nodes.

Graphormer addresses this by introducing novel structural encodings that capture graph properties within the transformer framework:

  • Centrality Encoding: Captures node importance using degree information
  • Spatial Encoding: Encodes pairwise distances between nodes in the attention mechanism
  • Edge Encoding: Incorporates edge features along shortest paths between node pairs

These encodings allow the standard transformer architecture to reason about graph structure while leveraging the powerful attention mechanism for learning node representations.

Published as part of the GRAM workshop's blogpost track @ ICML.

Paper Presentations Slides from paper reading sessions on databases and ML systems

A collection of presentations I've prepared for paper reading sessions, covering topics in database optimization and machine learning systems.


DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

A structured pruning approach for LLMs that achieves dimension-independent compression. Covers the DISP framework, pruning strategies, and efficiency-accuracy tradeoffs.

Download Slides (PDF)


Optimizing Queries Using Materialized Views

A practical, scalable solution for query optimization using materialized views. Covers view matching algorithms, cost-based optimization, and real-world implementation considerations.

Download Slides (PDF)


Cache-Efficient Top-k Aggregation

Techniques for efficient top-k aggregation over high cardinality datasets. Explores cache-aware algorithms, memory hierarchy optimization, and performance benchmarks.

Download Slides (PDF)


German Strings for Faster Analytics How modern analytical databases optimize string storage with inline prefix techniques

String handling is a significant performance bottleneck in analytical databases. Traditional approaches store strings as pointers to heap-allocated memory, causing cache misses and memory indirection overhead during scans and comparisons.

In this article, I explore "German strings" - an optimization technique used by modern analytical databases like DuckDB. The key insight is storing a prefix of the string inline within the pointer structure itself, enabling:

  • Fast prefix comparisons without dereferencing pointers
  • Better cache utilization by keeping frequently-accessed data together
  • Reduced memory indirection for short strings that fit entirely inline

The technique gets its name from the Umbra database system developed in Germany, which pioneered this approach.

Originally published on the e6data engineering blog.

https://yashrb24.github.io/posts/feed.xml