top of page

My dissertation link: https://www.proquest.com/openview/ef36efb91090fb39e1ef6081bc6b76b3/1?pq-origsite=gscholar&cbl=18750&diss=y
 

Interference Aware workload colocation in Hypersclae datacenters (Research Intern), [Facebook Inc, Summer 2019-Summer 2020] (Link: https://arxiv.org/abs/2207.12499)
 

Datacenters/server farms have a vast amount of compute, memory, network, and power resources which incur a high total cost of ownership. There are many challenges in managing a hyperscale datacenter that runs thousands of jobs and machines and incurs large capital and operating expenditures (CapEx). Among them, improving resource efficiency, which can reduce CapEx while maintaining the service level objective (SLO), is a non-trivial problem to solve. Resource efficiency at scale requires simple, scalable, and feasible solutions. In this endeavor, we investigate mechanisms for better resource efficiency in Facebook/Meta's private datacenters. Using workload characterization, knowledge of server heterogeneity, and understanding workload bottlenecks, we propose a tunable technique for better bin packing.

Over-provisioning can result in varying degrees of under-utilization of the allocated hardware resources (compute, memory, network, flash, etc.) over time without careful management. Multi-tenancy or colocation (running two or more jobs on the same server or host) offers opportunities to reduce under-utilization. Additionally, contingencies in planning for right-sized data centers, such as disaster recovery, host failures, and service growth, also contribute to the over-provisioning of resources. 

I designed and implemented a technique that uses percentile 99 resources utilization-based workload characterization using the K-Means clustering technique. I used interfering microbenchmarks to perform experiments on applications to determine the sensitivity of specific shared resources. This sensitivity is normalized for heterogeneous architecture based on scores associated with the resource capabilities of the server type. Once the sensitivity is determined, it is used along with percentile 99 resource utilization to bin pack jobs on hosts.

Our experimental setup uses ∼37,000 tasks and ∼18,000 hosts, showing the scalability of our approach. Our methodology does not rely on workload classification into latency-critical and batch or on their specific service-level objective (SLO) metrics. It can be applied to colocate two or more jobs without violating their SLOs. Our objective function can be tuned to achieve up to ∼50% reduction in hosts required, reduce TCO by ∼40%, and reduce CPU and memory fragmentation by ∼60% when using percentile 99 resource utilization-based limits, but increases SLO violations by ∼25% relative to the baseline. By incorporating sensitivity scores, the SLO violations are reduced by ∼22% relative to the baseline, but at the cost of an increase of ∼30% in host count. Our techniques allow platform owners to tune this cost vs. SLO trade-off depending on their objectives. 



Managing application performance and system efficiency via resource management (Research Assistant), [University of Rochester, Summer 2018] (Paper link: https://dl.acm.org/doi/10.1145/3501767)

With the increase in resources in servers like compute, memory, etc., it has become necessary to run multiple workloads simultaneously to utilize resources efficiently. Specifically for multi-socket systems, where separate compute sockets have individual memory nodes attached to them, access latency of memory locations varies based on the location of data and computing core. Thus, avoidable cross-socket communication can lead to non-optimal performance due to data placement. 

Another point of contention is that shared resources like memory bandwidth, last-level cache, etc., are not easily partitionable. System efficiency of individual nodes can be better managed by understanding the performance bottlenecks of applications dynamically and making thread-to-core placement decisions to alleviate bottlenecks and improve system efficiency by performance incentive-based resource allocation and management. Using hardware performance counters to quantify application performance to determine the degree of parallelism afforded to applications (number of active threads) and the number of hardware contexts provided using a hill-climbing algorithm with random disturbance. 

We propose MAPPER (Managing Application Performance via Parallel Efficiency Regulation), which makes two important decisions for all participating application applications: how much parallelism to afford to each application in terms of the number of CPU cores and which specific CPU cores to schedule applications on (threads-to-core mapping). My contribution to this project has been designing and implementing the module to program and access hardware performance counters on Intel CPUs. To achieve the goals mentioned above, we use data collected from hardware performance counters to categorize applications and understand resource requirements and bottlenecks. The metrics, once collected, are aggregated per application based on cgroup information available in /proc/cgroups. I also designed a way to multiplex performance counters like Remote HITM, Snoop HITM, and Snoop HITs so that we ended up using five performance counters (two of which are fixed) for six performance events, thus reducing the overhead of our mechanism. 

I helped design the exploratory algorithm and motivated the need to normalize performance. I configured and ran cloud-suite benchmarks with co-running (PARSEC/GraphChi) applications to evaluate performance. I observed that with our approach for service-oriented cloud-suite benchmarks, like Memcached, in-memory analytics, and media-streaming, we could achieve 18% performance improvement over Linux and 11% overall for the workload mixes.  

MAPPER start by initially allocating a fair share count of resources to applications and then use an exploratory phase to determine the maximum performance of an application with cores less than equal to a fair share. We normalize performance across applications to determine which resources can be spare to donate. Once the number of resources afforded to each application is determined, we use performance events to determine bottlenecks. If the bottleneck is cross-socket communication, then threads of an application are confined to a single socket to benefit from data locality; otherwise, if the bottleneck is memory bandwidth, threads are distributed across sockets. I discovered that the minimum quality of service guarantee, when limited to 85% of IPC achieved with a fair share of compute resources works best and is empirically sound. 


Cache replacement policy based expected reuse intervals (Research Assistant), [University of Rochester, Fall 2019] (Paper link: https://dl.acm.org/doi/10.1145/3591195.3595267)

Caches are volatile small, dense static RAM-based memory placed closer to the core (for hardware caches) to reduce memory access latency for better performance. Since caches are of finite size, they are designed to keep a temporal working set of memory for threads/processes running on CPU cores. Set associative caches are divided into sets and ways for keeping cache lines which generally are 64 KB long for private caches.

When ways of a set get filled with cache lines, a replacement policy helps to determine which cache block to replace to bring in a needed cache line that has been missed in the cache. Most replacement policies use history, preferably in statistical form, to determine the victim cache block to replace. For example, the most commonly used replacement policy, Least Recently Used (LRU), keeps track of the least recently used cache block and victimizes it for replacement. Keeping track of the reuse interval (RI), which is the number of accesses to unique cache blocks between re-accessing the same block at a per reference and cache block fashion, helps store arbitrary long history.

We use this information to identify the Least Expected Use (LEU) blocks for replacement. Since we store statistical history using sampling, we can predict the next access distance. This helps us to replace the cache line, which has the longest expected reuse in the future. 

LEU is implemented using dynamic sampling and a proposed novel hardware design that maintains the statistical record needed as a reuse interval histogram. I conceptualized and provided a hardware design for LEU, along with designing the updation and victimization policy itself. Apart from this, I proposed and implemented a sandbox technique to choose replacement policies from LRU, LEU, and MRU, using statistics for trends collected from the system. I evaluated the LEU mechanism using the Cache Replacement Championship simulator, which was heavily modified, on relevant benchmark suites and policies, including the state-of-the-art replacement policy Hawkeye, which approximates optimal caching using a finite-length history. We show that LEU outperforms previous techniques for a broad range of scientific kernels, whose data reuses are longer than those in traces traditionally used in computer architecture studies. 

Prefetcher throttling mechanism using bloom filters (Research Co-Op Engineer), [AMD Research, Santa Clara, US, Fall 2018] 

A prefetcher uses historical access patterns to fetch cache lines into the cache before it is accessed, thus increasing the hit ratio and improving performance. There are different kinds of prefetchers; the most common are stream prefetcher, which detects ascending or descending access patterns; stride prefetcher, which detects accesses linked by a single stride and region prefetcher, which tracks accesses with a narrow memory region triggered by different program counters (PCs). The project aimed to design low-area and low-power overhead structures to estimate prefetcher metrics of accuracy and pollution to throttle L1 prefetchers. Three primary metrics determine the performance of prefetches: accuracy, that is, the fraction of useful prefetches; coverage, the fraction of total misses that were eliminated because of prefetching; and timeliness of prefetches. 

In my project with AMD research, I worked as an independent researcher and looked at optimizing prefetcher performance via prefetcher throttling using probabilistic data structures like bloom filters. 

In this project, I proposed and implemented a mechanism that uses probabilistic data structures (Bloom filters) to detect cache pollution, accuracy, and coverage (adaptive prefetching). A set operation, like a query for the presence of an element in the set, is answered by a definite no or maybe yes (false positives). I performed an empirical study to determine the size of the filter structures with a much lower false positive and reasonable filter density. I discovered that computing the accuracy, coverage, and timelines of all prefetchers at specific periodic intervals and throttling the prefetcher not performing well helps improve performance. For the benchmarks tested, most of the prefetches made by the stream and (a little less accurate) region prefetchers are accurate and timely. 

 In the second half of this project, we implemented state-of-the-art pattern-based prefetchers to understand the performance in real-world computer systems with scientific computation kernels. I used Variable Length Delta Prefetcher (VLDP) and Page Local Delta Prefetcher to compare performance based on storing histogram of delta for prefetching blocks versus page local references. I found that VLDP outperforms page local delta prefetcher for our benchmarks due to variable access strides. In my conclusion, I found using space-optimized bloom filters to throttle prefetchers helps improve performance, and using a VLDP prefetcher for scientific kernels, as used for research by DoE, is helpful to boost performance.

BDD based synthesis of Boolean functions using memristor IMPLY Logic (Research Intern), [Indian Institute of Technology (IIT), Kharagpur, Summer 2014] (Paper link: https://ieeexplore.ieee.org/document/7038601)

Modern computers work using transistors that are able to implement logic and capacitor/resistance-based memory, such as SRAM/DRAMs, for volatile memory like caches and main memory. Non-volatile memory, where information is retained even after power is turned off, helps save states for future computation. Storage devices like hard disks (HDDs/SSDs) fall into this category. Traditionally, they have been kept separate due to the difference in technology used to implement computing and storage. In 2008, HP research labs first fabricated a two-terminal, dimensionally a few nanometers in size, possessing both memory/resistor and logic deciphering properties. They termed this element a memristor which relates electric flux and charge. Changing the voltage applied across a memristor, the resistance of the substance changes and remains the same until a new voltage is applied. This gives the memristor its non-volatile property.

Logic is deciphered based on the resistance the memristor holds when it is read. Ron and Roff are the resistance values of the memristor in closed (logic 1) and open (logic 0) states. Three different voltage levels are selectively applied to the memristors to carry out logic operations, namely, Vcond, Vset, and Vclear. Logic synthesis using memristors relies on logical gate implementation using the memristor IMPLY logic. Material Implication operation (A → B = ~A + B) can be implemented using two memristors and one resistor and by applying an appropriate voltage across terminals.

My contribution to this project was proposing a synthesizing mechanism to implement boolean functions into logical circuits using memristors with an improved design of a 2-to-1 multiplexer. My proposal was for two new implementations, where I came up with both the crossbar design and the netlist operations' conceptualization. The first implementation used three memristors and eight implication steps, and the second with five memristors with six implementation steps.

I also proposed alternatives to support serial and parallel loading operations and have discussed the implications of each for comparison. I was responsible for implementing a novel method of representing a Boolean function as a Reduced Ordered Binary Decision Diagram (ROBDD) using 2-to-1 Multiplexers that maps to the memristor implementation proposed. The ROBDD generated is mapped to a MUX netlist where a 2-to-1 MUX replaces each node. Thus, we showed that Boolean functions can be synthesized using Reduced Ordered Binary Decision Diagrams (ROBDD), where each node is a 2-to-1 MUX.

To see more or discuss possible work let's talk >>
bottom of page