ETR-RT14

IEEE Communications Society (ComSoc)
Technical Committee on Communications Quality & Reliability (CQR)

Emerging Technology Reliability Roundtable

Monday, May 12, 2014

In Conjunction with IEEE CQR 2014 International Workshop
Westward Look Wyndham Grand Resort & Spa
245 E. Ina Rd
Tucson, AZ 85704, USA
http://westwardlook.com/

Scope of the Roundtable

Discuss and identify the RAS (Reliability, Availability and Serviceability) challenges, requirements and methodologies in the emerging technology areas like the Cloud Computing, Wireless/Mobility, NFV (Network Functions Virtualization), SDN (Software Defined Networking), or similar large-scale distributed and virtualization systems.
Discuss the RAS requirements and technologies for mission-critical industries (e.g., airborne systems, railway communication systems, the banking and financial communication systems, etc.), with the goal to promote the inter-industry sharing of related ideas and experiences.
Identify potential directions for resolving identified issues and propose possible solutions.

From left to right: Spilios Makris, David Lu, Ying Chin (Bob) Yeh, Chunming Qiao, Mike Tortorella

(above) Gengui Xie, Brian Levy, Mehmet Ulema, Chi-Ming Chen

Monday, May 12 Presentations (All 9 Files Zipped):

Chair's Introduction, Spilios Makris
Reliability Challenges for Emerging Technologies Based Networks: A Long Road to Standardization, Spilios Makris
Reliability Aspects of Wireless/Mobility Network, David Lu
Ultra-Reliable Fly-By-Wire Computers for Commercial Airplanes’ Flight Controls Systems, Ying Chin (Bob) Yeh
Challenges and Opportunities in Improving Cloud Service Reliability and Availability, Chunming Qiao
Demystifying the Reliability of Cloud Services, Mike Tortorella
The RAS challenges of NFV, Gengui Xie
Reliability and NFV, Brian Levey
Vulnerabilities and Opportunities in SDN, NFV, and NGSON, Mehmet Ulema

Tuesday, May 13 Roundtable Readout to CQR 2014 Workshop

Related CQR 2014 Presentations

ETE Reliability Challenges and Opportunities in User Defined Network Cloud, David Lu, AT&T (Tuesday, May 13 Keynote)
Resiliency Challenges in Future Communications Infrastructure, Hang Nguyen, Intel (Thursday, May 15 Keynote)

Roundtable Chair: Spilios Makris, PhD, Palindrome Technologies, USA

Spilios Makris is currently the Director of Network Resilience and Business Continuity Management (BCM) in Palindrome Technologies. Spilios has extensive experience in BCM and network resilience serving as Director and Senior Consultant at Telcordia Technologies (formerly Bellcore) for over 28 years, conducting studies and developing methodologies along with industry Best Practices for over 50 Tier 1&2 telecom companies, telecom vendors, and Telecom Regulatory Authorities (TRAs) worldwide. Spilios has served as Chair, Vice-Chair, Lead Contributor of the Standards T1A1.2 WG on "Network Survivability Performance” (was renamed PRQC Reliability Task Force) for 20 years. He successfully managed the development and regular update of Telcordia Generic Reliability Requirements documents establishing them as the “de facto” industry standards (e.g., SR-332 on Reliability Prediction Procedure for Electronic Equipment).

Spilios received his PhD in Industrial Engineering & Operations Research from the University of Massachusetts at Amherst, Mass., MS in Engineering Management from Northeastern University, Boston, Mass., and Diploma (equiv. to MS) in Electrical & Mechanical Engineering from the National Technical University of Athens, Greece.

He is a Certified Business Continuity Professional (CBCP) by the Disaster Recovery Institute International (DRII) and a Senior Member of IEEE.

Topic: Reliability Challenges for Emerging Technologies Based Networks - A Long Road to Standardization

Abstract

There is a growing concern from the telecom community about the reliability of Emerging Technologies based (IP-based) telecommunications networks, including the services provided under failure condition. This presentation will give a quick historic perspective of the standardization efforts regarding the reliability of telecom networks from the conceptual perspective (from component to system to network level). Then, it will discuss the reliability issues for the Emerging Technologies (e.g., SDN, NFV, Cloud Computing, etc.) based networks and the industry’s challenge to maintain the momentum on by avoiding a protracted Reliability Standards’ effort.

David Lu, VP, AT&T, USA

David is currently, Vice President of Business Network & Corporate Solution IT, responsible for Global Service Assurance, Network Capacity Management, Field Operation Dispatching, and Business Billing Solutions at AT&T. He leads an organization with over 3,500 people across the globe.

David is an well respected leader in software architecture and engineering, network performance and traffic management, large data DB implementation/mining/analytics, software reliability and quality, and network operations process engineering.

Since joining AT&T Bell Labs in 1987, he served in various lead positions at AT&T. He holds 21 patents and has frequently appeared as a guest speaker at technical and leadership seminars and conferences throughout the world.

David is married with two Children, and currently lives in Dallas, Texas.

Country of Birth: Shanghai, China (中国, 上海)

Personal Interests: Classical music (performs and teaches cello), history, traveling, arts, and table tennis.

Topic: Reliability Challenges in the Wireless/Mobility Networks

Abstract

The following areas will be covered

Radio Access Network (RAN) Optimization & Self-Organizing Network (SON)
Small Cell Deployment and Coverage
User Experience & Network Performance Correlation
Big Data Analytics for End-To-End (ETE) Trouble Isolation and Preventive Maintenance
ETE Service Management
LTE QoS and Monetization

Dr. Ying Chin (Bob) Yeh, IEEE Fellow; Technical Fellow, Boeing Commercial Airplanes

Ying Chin (Bob) Yeh joined Boeing Commercial Airplanes in 1981, and has been on Boeing Fly-By-Wire (FBW) computers development programs (FFM, 7J7, 777, 7E7, 787, 777X) since 1984, more specifically computer architectures and redundancy management scheme to achieve/certify safety critical electronics systems with E-10 per hour functional integrity and availability.

He received his Ph.D. from University of Ottawa in 1973; M.S. from National Taiwan University, Taiwan, in 1970; B.S. from National Cheng Kung University, Taiwan, in 1967, all in Electrical Engineering.

Bob is an IEEE Fellow for contributions to ultra-reliable real-time embedded system, and a member of IFIP (international federation for information processing) Working Group 10.4: Dependability and Fault Tolerance.

Topic: Ultra-Reliable Fly-By-Wire Computers for Commercial Airplanes’ Flight Controls Systems

Abstract

This presentation begins with the fundamental concept of dependability developed by the dependability and fault-tolerance technical community. It is followed by the 777 FBW design philosophy for safety covering the common mode failure/single point failure, and the dissimilarity.

The command/monitoring concept for the self-monitoring computing channel of 777 Primary Flight Computer (PFC) is elaborated. The closed loop (monitoring) concept of 777 Actuation Control Electronics (ACE) is depicted. It is concluded by the concept and feature for fail-passiveness of the 777 PFC-ACE data paths.

With building blocks of fail-passive electronics, extremely high functional integrity and availability systems can be configured via hardware redundancy.

Professor Chunming Qiao, IEEE Fellow, the State University of New York (SUNY) at Buffalo

Professor Chunming Qiao directs the Lab for Advanced Network Design, Analysis, and Research (LANDER), Department of Computer Science & Engineering at SUNY Buffalo with current foci on survivability/availability issues in cloud computing, cyber transportation systems, and smartphone systems.

He pioneered research on optical burst switching (OBS) in 1997, and in addition, his work on integrated cellular and ad hoc relaying systems (iCAR) in 1999 is also recognized as the harbinger for today's push towards the convergence between heterogeneous wireless technologies, and has been featured in BusinessWeek and Wireless Europe etc.. He has published extensively with an h-index of about 60 (according to Google Scholar), and given more than a dozen of keynotes, and numerous invited talks on the above research topics. He also has 7 US patents and served as a consultant for several IT and Telecommunications companies since 2000. His research has been funded by several major IT and telecommunications companies including Alcatel Research, Fujitsu Labs, Cisco, Google, NEC labs, Nokia Research, Nortel Networks, Sprint Advanced Technology Lab, and Telcordia.

He has and chaired and co-chaired a dozen of international conferences and workshops, and served on the editorial board for several leading IEEE journal. He was the chair of the IEEE Technical Committee on High Speed Networks (HSN) and the IEEE Subcommittee on Integrated Fiber and Wireless Technologies (FiWi) which he founded. He was elected to IEEE Fellow for his contributions to optical and wireless network architectures and protocols.

Topic: Challenges and Opportunities in Improving Cloud Service Reliability and Availability

Abstract

Cloud services may be disrupted by various failures ranging from very frequent small scale failures (such as a few isolated individual server/switch failures) to less frequent, yet non-negligible, large-scale failures (such as rack or cluster failures). With our growing dependence on cloud services for both commercial and personal use, their reliability and availability have become increasingly critical. Despite existing (mostly ad hoc) approaches to improving the cloud service reliability and availability, a recent report found that on average, a service outage lasts about 134 minutes, and these service outages cost about $426 billion of loss worldwide annually. In addition, existing SLAs are often loosely defined, and lack of reliability/availability guarantees has been cited as the top concern over cloud services among IT professionals in a 2012 global survey. In this talk, I will discuss both the challenges and opportunities related to service availability prediction, resource provisioning, and SLA contract design from the perspective of cloud service providers, and present our work on cost-effective solutions to problems ranging from creating survivable virtual infrastructures in a distributed multi-datacenter environment, to availability-aware VM placement/allocation.

Dr. Michael Tortorella, Research Professor, Rutgers, the State University of New Jersey

Dr. Tortorella is a leading communications industry expert in reliability management, engineering, modeling, and life data analysis. Over a 26-year career at Bell Laboratories he was responsible for research and implementations in fundamental system, network, and service reliability engineering methodologies as well as for management of reliability in such critical projects as the SL-280 undersea cable system, the world's first application of fiber-optic technology in an intercontinental, undersea system. He played a major role in many AT&T and Lucent product reliability studies, culminating in the creation of CADRE, a reliability modeling system for circuit packs that encompasses circuit simulation, thermal analysis, and uncertainty modeling in a single package that is fully integrated with computer-aided design systems used for circuit pack creation.

Formerly a Distinguished Member of Technical Staff in the Design for Reliability Processes and Technologies Group in Bell Laboratories, Dr. Tortorella is now Managing Director of Assured Networks, LLC, a consultancy focused on network and service reliability and performance improvement. He is concurrently a research professor of industrial and systems engineering at Rutgers University and an adjunct professor at Stevens Institute of Technology. In addition to teaching courses in industrial engineering, operations research, and statistics, he maintains a robust research program that has direct impact on the concerns of the CQR. This program includes investigations into how stochastic flows in an IP network determine the performance and reliability of services carried on those networks, modeling frameworks for control of IP networks under stressed conditions, and foundational issues in queueing theory. Additional current research interests include stochastic flows, network performance, management, and control, stochastic processes and their applications to reliability, life data analysis, and next-generation networks, as well as design for reliability methods and technologies. Dr. Tortorella has published extensively in these areas. At Bell Labs, his responsibilities included systems and reliability engineering for next-generation networks. He is Advisory Editor for Quality Technology and Quantitative Management, where he has worked to increase the number of publications pertaining to the communications industry. He was formerly Area Editor for Reliability Modeling and Optimization for the IIE Transactions on Reliability and Quality Engineering and was Guest Editor of a recent issue on Reliability Economics. He also served an Associate Editor of Naval Research Logistics and was Guest Editor for a recent issue on Computations in Networks.

Dr. Tortorella received his Ph. D. in mathematics from Purdue University.

Topic: Demystifying Reliability Engineering and Management for Cloud Services

Abstract

Cloud services, including computing, data backup, file sharing, etc., are a major source of revenue for telecom providers. They are often advertised as higher-reliability, safer, and more-secure alternatives to local storage and computing. Service providers need to understand how their customers perceive the reliability of these services so that they can back up their claims and sensibly offer SLAs. This presentation offers a practical perspective on how reliability of cloud services can be evaluated and managed through straightforward quantitative models. We also point out some challenging problems that remain.

Gengui Xie, VP of R&D Competence Center, Huawei Technologies, China

Gengui Xie joined Huawei in 1996 and has more than 20 years experience in telecom area, he is a leading expert on network management system and design for RAS( Reliability, Availability and serviceability). He now is VP of Huawei R&D competence center, specially responsible for designing product architecture, reliability, serviceability, energy saving and emission reduction, technical planning and solution, etc.

Gengui graduated from South-East University in China.

Topic: Design for the RAS challenges of NFV

Abstract

The telecom industry is transferring from legacy network to NFV (Network function virtualization) , and it will bring profound changes for telecom industry. Among the challenges caused by NFV, RAS (Reliability, Availability, Serviceability) will be critical issues should be solved before NFV’s commercial deployment. This presentation will introduce the new RAS challenges caused by NFV, the new technologies we are working on and the latest standardization progress of industry.

Brian Levy, CTO SP SECTOR EMEA, Juniper Networks, UK

Brian has forty years of experience in the Communications, Media and Entertainment Industry. He began his career with British Telecom in 1970 where he became an executive engineer and received a sponsorship to attend Salford University where he obtained an honours degree in electronic communication in 1978.

From there Brian moved to the BBC and spent nine years with them working in all aspects of Broadcasting. He then joined AT&T and became the Director of Network Services for the EMEA region. At AT&T Brian led the deployment of the EMEA Interspan frame relay network and deployed AT&T's first IP backbones in EMEA and major Web hosting services in the region.

Brian was one of the co-founders of a $67million start-up company Aduronet which developed transformational network architecture.

In 2002 he rejoined BT as the Group Technology Officer for Service Strategy and Innovation and initiated BT Vision (IP TV), BT FON (Wi-Fi sharing) and many of BT's leading services of today.

In 2006 he joined HP as the CTO for their world wide $1bn software business in the Communications, Media and Entertainment sector.

Brian joined Juniper as VP & CTO for the Junos Applications Software Business in Jan 2012 and now holds the position of CTO for Juniper’s $1bn Service Provider business in EMEA.

Topic: Reliability and NFV

Abstract

In this presentation we explain the basic architecture and lifecycle associated with the ETSI NFV model and then using this context look at the implications from a reliability prospective.

The splitting of the management plane
The overall system reliability
The VNF runtime issues
Resiliency at different levels of abstraction
Infrastructure resiliency

We conclude with a short Summary

Professor Mehmet Ulema, Computer Information Systems, Manhattan College, New York

Previously, Dr. Ulema held management and technical positions in AT&T Bell Labs, Bellcore, Daewoo Telecom, and Hazeltine. Dr. Mehmet Ulema has more than 30 years experience in the telecommunications field as a professor, director, project manager, researcher, systems engineer, network architect, and software developer. . Dr. Ulema has been involved in a variety of telecom projects including Software Defined Networks, IP based Multimedia Services (IMS), network management, wireless networks, service overlay networks and architectures, intelligent networking, and multimedia communications.

He has been on the editorial board of a number of journals including the IEEE Transactions on Network and Service Management, Elsevier Journal of Computer Networks, ACM/Springer Journal of Wireless Network, and the Springer Journal of Network and Services Management.

He has been an active member of IEEE. Currently he is the Director of Standards Development in IEEE ComSoc. Dr. Mehmet Ulema has been involved in many international conference and workshops. He served as the Technical Program Chair of for the two premier ComSoc conferences: GLOBECOM 2009 and ICC 2006. He also served as the General Co-chair of NOMS in 2008. ISCC 2012. Currently, he is serving as a General Co-chair of IEEE BlackSeaCom 2014.

He received MS & Ph.D. in Computer Science at Polytechnic University (now called Polytechnic Institute of New York University), Brooklyn, New York. U.S.A. He also received BS & MS degrees at Istanbul Technical University, Turkey.

Topic: Vulnerabilities and opportunities in the emerging network and service technologies such as SDN, NfV, and NGSON

Abstract

After a brief discussion of emerging technologies and trends in networking and services such as SDN, NfV, NGSON, in general terms, the talk will focus on RAS related areas of these new technologies. Implications of OpenFlow in the face of existing routing protocols, insertion of RAS related features into OpenFlow, designing and building secure platforms with automatic failure recovery and fault tolerant features, self organizing capabilities, as well as vulnerabilities observed in the design and implementation of these technologies will be some of the topics to be discussed.

Advisory Board, IEEE ComSoc CQR Technical Committee: Chi-Ming Chen PhD, AT&T Labs, USA

Chi-Ming Chen joined AT&T in 1995. His current responsibility is the operations support system (OSS) architecture. Prior to joining AT&T, Chi-Ming was with Bell Communications Research (Bellcore) from 1985 to 1995. He was a faculty member at Tsing Hua University, Hsinchu, Taiwan from 1975 to 1979.

He received his Ph.D. in Computer and Information Science from the University of Pennsylvania in 1985; M.S. in Computer Science from the Pennsylvania State University in 1981; M.S. and B.S. in Physics from Tsing Hua University, Taiwan, in 1973 and 1971 respectively.

Chi-Ming Chen is a senior member of IEEE and ACM. He is an Advisory Board Member of IEEE Communications Society (ComSoc) Technical Committee on Communications Quality & Reliability (CQR) and a member of the IEEE GLOBECOM & ICC Management & Strategy (GIMS) Standing Committee. He has chaired several GLOBECOM and ICC Industry Forums.

Last updated on Tuesday, October 13, 2015