Cloud Computing:


vaneet_photo

Straggler Management with Coding and Queueing Strategies:

The performance of large-scale distributed compute systems is adversely impacted by stragglers when the execution time of a job is uncertain. In order to manage stragglers, redundancy in the execution is essential. However, to optimize the cost of servers, all servers cannot be started in the beginning while a tiered strategy (also called multi-forking) is needed. The figure alongside shows the tradeoff between the service completion time and the server utilization cost for shifted exponential service time with different forking times and different number of initial servers. We consider such tradeoffs to determine parameters at which system should be operated. However, these parameters can influence the task sizes in the split of tasks over the servers since the straggler guarantees are needed. Multi-forking provides a flexibility that the results of some tasks could be seen before opening new tasks which decreases the execution times. Thus, the problem of efficient management of stragglers requires joint split of tasks (using coding-theoretic schemes) and well as the determination of system parameters using queueing-theoretic techniques that determine the tradeoff between the service completion time and the server utilization cost.


vaneet_photo

Framework for Interdependent Task Scheduling:

Distributed computing applications are becoming increasingly sophisticated and heterogeneous, often involving a collection of highly dependent data-processing tasks and network flows that must work in concert to achieve mission-critical objectives. Traditional techniques that are agnostic to such interdependence do not perform well in optimizing these collections, because resource management is largely framed as an optimization of individual task- or flow-level metrics. In this work, we model different dependence constraints between the different tasks including precedence, fractional precedence, and approximate computing. As an example, the figure alongside shows different forms of constraints for a MapReduce job. With these models of task dependence, we have proposed efficient scheduling strategies, with their performance guarantees.

vaneet_photo

Erasure Codes in Distributed Storage for Effective Latency:

Modern distributed storage systems offer large capacity to satisfy the exponentially increasing need of storage space. They often use erasure codes to protect against disk and node failures to increase reliability, while trying to meet the latency requirements of the applications and clients. In order to meet these requirements, the code design, chunk placement and the chunk access (when client requests) need to be optimally found. We propose new solutions for joint latency cost tradeoff, and implement these solutions on Tahoe public grid, and AT&T Cloud Systems. The figure alongside depicts the different key design attributes for a general cloud system that we are working towards for a vision of optimized dynamic storage system.

vaneet_photo

Erasure Codes in Distributed Storage for Effective Repair Bandwidth:

The reliability of erasure-coded distributed storage systems, as measured by the mean time to data loss (MTTDL), depends on the repair bandwidth of the code. Thus, efficient coding schemes that exploit repair bandwidths when a limited number of disks fail is important. We came with the first erasure code designs that achieve points between the two extremes on the repair bandwidth and storage trade-off that are asymptotically optimal. Further, the impact of optimal erasure code designs to mean time to data loss of the system is characterized. The figure alongside shows the state transition for a serial repair system accounting for repair in erasure coded systems.

vaneet_photo

Delivering Deadline based services through Virtualization:

Virtualized cloud-based services can take advantage of statistical multiplexing across applications to yield significant cost savings. However, achieving similar savings with real-time services can be a challenge. In these works, we seek to lower a providers costs for real-time IPTV services through a virtualized IPTV architecture and through intelligent time-shifting of selected services. We provide a generalized framework for computing the amount of resources needed to support multiple services, without missing the deadline for any service. We construct the problem as an optimization formulation that uses a generic cost function. We consider multiple forms for the cost function (e.g., maximum, convex and concave functions) reflecting the cost of providing the service. The solution to this formulation gives the number of servers needed at different time instants to support these services. Our results show about 31% improvement in costs for the deployed IPTV services. Further, Video on Demand (VoD) traffic can be delivered using multicast. I showed how IP multicast can be used to reduce load on the VoD server. This involved modeling the on-demand service and designing an optimal algorithm that minimizes server transmissions. Trace data from a deployed AT&T VoD service suggests that our approach reduces server bandwidth by as much as 65% compared to the standard architecture. The graphic is AT&T trace data showing the distribution of customers receiving different short video segments in a 10 minute interval.

vaneet_photo Home