My Web Markups - Hao Wei
a generator and a discriminator
GANs are a framework for teaching a DL model to capture the training data’s distribution so we can generate new data from that same distribution.
DCGAN Tutorial — PyTorch Tutorials 1.9.0+cu102 documentation
The returned tensor shares the underling data with the original tensor.
contiguous(), which will return a new contiguous tensor. In plain words, it will create a new memory space for the new tensor and copy the value from the non-contiguous tensor to the new tensor.
Difference between view, reshape, transpose and permute in PyTorch - jdhao's blog
Batch normalization is being done separately on every mini-batch and not on the global batch, which causes them to not be completely equivalent to running the same model using the global batch size.
The main idea here is that we should play around with different batch sizes until we find one that would be optimal for the specific neural network and dataset we are using.
Small batch sizes may lead to slow convergence of the learning algorithm.
Large batch sizes may cause bad generalization
Different neural networks and different datasets may have different optimal batch sizes.
Batch size and GPU memory limitations in neural networks | Towards Data Science
Gradient accumulation means running a configured number of steps without updating the model variables while accumulating the gradients of those steps and then using the accumulated gradients to compute the variable updates.
What is Gradient Accumulation in Deep Learning? | by Raz Haleva | Towards Data Science
A custom Dataset class must implement three functions: __init__, __len__, and __getitem__.
we ideally want our dataset code to be decoupled from our model training code for better readability and modularity.
Datasets & Dataloaders — PyTorch Tutorials 1.9.0+cu102 documentation
Parameters are never broadcast between processes. The module performs an all-reduce step on gradients and assumes that they will be modified by the optimizer in all processes in the same way.
DistributedDataParallel — PyTorch master documentation
Besides, when loading the module, you need to provide an appropriate map_location argument to prevent a process to step into others’ devices.
When using DDP, one optimization is to save the model in only one process and then load it to all processes, reducing write overhead.
users are responsible for balancing workloads distributions across processes
DDP registers an autograd hook for each parameter given by model.parameters() and the hook will fire when the corresponding gradient is computed in the backward pass
Gradient synchronization communications take place during the backward pass and overlap with the backward computation.
DDP broadcasts model states from rank 0 process to all other processes in the DDP constructor, you don’t need to worry about different DDP processes start from different model parameter initial values.
DataParallel is single-process, multi-thread, and only works on a single machine, while DistributedDataParallel is multi-process and works for both single- and multi- machine training.
Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.9.0+cu102 documentation
主要会带来 GPU Kernel 的排队延迟，和 PCIE 总线带宽的争抢
大部分模型本身占用的显存并不多，使用的显存多来自 mini-batch 过程中，在单个 mini-batch 中会被申请和释放。
OSDI'20 论文赏：ANTMAN: DYNAMIC SCALING ON GPU CLUSTERS FOR DEEP LEARNING | 高策
针对深度学习的GPU共享 - 知乎
容错是指作业不受其中进程数量变化的影响，在弹性调度过程中，作业里的进程数量会随集群 workload 情况相应增减
ElasticDL：蚂蚁金服开源基于 TensorFlow 的弹性分布式深度学习系统 · SOFAStack
application code can define and configure a parent logger in one module and create (but not configure) a child logger in a separate module, and all logger calls to the child will pass up to the parent. Here is a main module:
Logging Cookbook — Python 3.9.6 documentation
even a tiny amount of progress every day adds up to a huge difference.
Our Favorite Stories About Battling Distraction | by Shaq Cheris | Creators Hub | Jun, 2021 | Medium
但 PS 和 Worker 在一台机器上的时候，PCIe 等级别上的互相干扰等问题都需要考虑
Tiresias: A GPU Cluster Manager for Distributed Deep Learning · Issue #133 · dyweb/papers-notebook
DDP uses multi-process parallelism, and hence there is no GIL contention across model replicas.
This is because the implementation of DataParallel replicates the model in every forward pass, and its single-process multi-thread parallelism naturally suffers from GIL contentions.
single-program multiple-data training paradigm.
PyTorch Distributed Overview — PyTorch Tutorials 1.9.0+cu102 documentation
always running allreduce in the bucket index order instead of actual bucket ready order.
So after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same.
an asynchronous allreduce
This mode allows running backward on a subgraph of the model, and DDP finds out which parameters are involved in the backward pass by traversing the autograd graph from the model output and marking all unused parameters as ready for reduction.
registers autograd hooks
The reason for using the reverse order is because DDP expects gradients to become ready during the backward pass in approximately that order.
Then, each DDP process creates a local Reducer, which later will take care of the gradients synchronization during the backward pass.
Distributed Data Parallel — PyTorch 1.9.0 documentation
place linear layers and tensors on proper devices.
The high-level idea of model parallel is to place different sub-networks of a model onto different devices, and implement the forward method accordingly to move intermediate outputs across devices.
where the model is too large to fit into a single GPU
Single-Machine Model Parallel Best Practices — PyTorch Tutorials 1.9.0+cu102 documentation
1. ensure that all of the workers are synchronized in their training and 2. do so in a manner that minimizes the overhead.
In many cases the framework’s performance is dependent on the model architecture, model hyperparameters, and other details of the distributed training implementation.
Before even considering scaling to multiple workers you should first make sure you have exhausted all potential for getting the most performance out of your single worker.
A priori, it is not immediately obvious how this goal should be achievable given the fact that data-distributed training includes the overhead of an additional step that does not exist when training on a single worker, i.e. the gradient sharing.
As we will see below, reaching this goal often requires tuning some of the elements of the training algorithm.
how to adjust the optimizer settings, how to minimize the overhead of sharing gradients, and how the training data is processed.
A Guide to (Highly) Distributed DNN Training | by Chaim Rand | Towards Data Science
always tries to keep the scheduled resource(s) busy, if there are submitted jobs ready to be scheduled
Work-conserving scheduler - Wikipedia
since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
Programming Guide :: CUDA Toolkit Documentation
the graph of type cudaGraph_t contains the information defining the structure and content of the graph; and the instance of type cudaGraphExec_t is an “executable graph”: a representation of the graph in a form that can be launched and executed in a similar way to a single kernel.
The kernels will still execute in order (since they are in the same stream), but this change allows a kernel to be launched before the previous kernel completes, allowing launch overhead to be hidden behind kernel execution.
They address the above issue by providing a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduce overheads.
If each of these operations is launched to the GPU separately, and completes quickly, then overheads can combine to form a significant overall degradation to performance.
there are overheads associated with the submission of each operation to the GPU – also at the microsecond scale
Getting Started with CUDA Graphs | NVIDIA Developer Blog
A stream is a sequence of commands (possibly issued by different host threads) that execute in order.
Programming Guide :: CUDA Toolkit Documentation
all future work submitted to stream
CUDA Runtime API :: CUDA Toolkit Documentation
Also, when the tile size matches the hardware warp size, the compiler can elide the synchronization while still ensuring correct memory instruction ordering to avoid race conditions.
a template parameter
pass a group as an explicit parameter to a function and depend on a consistent interface across a variety of thread group sizes.
32 is a common choice, because it corresponds to a warp
Collective operations, or simply collectives, are operations that need to synchronize or otherwise communicate amongst a specified set of threads.
The fundamental type in Cooperative Groups is thread_group, which is a handle to a group of threads.
collective functions can take an explicit argument representing the group of participating threads.
allow kernels to dynamically organize groups of threads.
Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Developer Blog
The Nsight Systems CLI supports concurrent analysis by using sessions.
The actual profiling commands and data are transferred through a raw, unencrypted socket.
Nsight Systems User Guide :: Nsight Systems Documentation
when it is called enough times, it will interrupt program execution (interpretation), and send that code to be opt-compiled
not all optimization is actually good.
the union of the compiler and the interpreter
A Most Perfect Union: Just-In-Time Compilers | by Vaidehi Joshi | basecs | Medium
Redis is an extendable in-memory data structures server
Announcing RedisAI 1.0: AI Serving Engine for Real-Time Applications | Redis Labs
AAAI 2019 论文解读：卷积神经网络继续进步 - 知乎
the L2 set-aside cache portion is shared among all these concurrent CUDA kernels
Which specific memory accesses are classified as persisting (the hitProp) is random
A portion of the L2 cache can be set aside to be used for persisting data accesses to global memory.
When a CUDA kernel accesses a data region in the global memory repeatedly, such data accesses can be considered to be persisting. On the other hand, if the data is only accessed once, such data accesses can be considered to be streaming.
Programming Guide :: CUDA Toolkit Documentation
Neural network embeddings are learned low-dimensional representations of discrete data as continuous vectors.
t-Distributed Stochastic Neighbor Embedding (TSNE).
The operation of one-hot encoding categorical variables is actually a simple embedding where each category is mapped to a different vector.
An embedding is a mapping of a discrete — categorical — variable to a vector of continuous numbers.
a method used to represent discrete variables as continuous vectors
Neural Network Embeddings Explained | by Will Koehrsen | Towards Data Science
Containers are similar to VMs, but they have relaxed isolation properties to share the Operating System (OS) among the applications.
Automatic bin packing You provide Kubernetes with a cluster of nodes that it can use to run containerized tasks. You tell Kubernetes how much CPU and memory (RAM) each container needs. Kubernetes can fit containers onto your nodes to make the best use of your resources.
Containers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime
What is Kubernetes? | Kubernetes
The double dash (--) separates the arguments you want to pass to the command from the kubectl arguments.
Get a Shell to a Running Container | Kubernetes
Install Docker Engine on Ubuntu | Docker Documentation
Could not find "AddResource" method when pulling docker images - Stack Overflow
L1 transactions are 128 bytes, and L2 and texture transactions are 32 bytes.
Memory Statistics - Caches
When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.
Programming Guide :: CUDA Toolkit Documentation
Scalability is the measure of a system’s ability to increase or decrease in performance and cost in response to changes in application and system processing demands.
Definition of Scalability - Gartner Information Technology Glossary
learning on directed or relational graphs, and how one can use learned graph embeddings for further tasks down the line, etc.
meaningful smooth embeddings
GCN model as a generalized, differentiable version of the well-known Weisfeiler-Lehman algorithm
multiplication with AAA means that, for every node, we sum up all the feature vectors of all neighboring nodes but not the node itself (unless there are self-loops in the graph). We can "fix" this by enforcing self-loops in the graph: we simply add the identity matrix to AAA.
filter parameters are typically shared over all locations in the graph
Graph Convolutional Networks | Thomas Kipf | University of Amsterdam
【GNN】万字长文带你入门 GCN - 知乎
matrix representation of a graph.
Laplacian matrix - Wikipedia
(12 封私信 / 80 条消息) 如何理解 Graph Convolutional Network（GCN）？ - 知乎
GNN综述：Review of Methods and Applications - 知乎
Creating a cluster with kubeadm | Kubernetes
text encodings are independent among each other
In short, ZSL is trying to learn to classify a class given NO training samples
One successful employment of GNN in CV is using graphs to model the relationships between objects detected by a CNN based detector. After objects are detected from the images, they are then fed into a GNN inference for relationship prediction.
the citation network is trying to predict the label of each paper in the network given by the paper citation relationship and the words that are cited in other papers.
In graph classification, the task is to classify the whole graph into different categories.
In link prediction, the task is to understand the relationship between entities in graphs and predict if two entities have a connection in between
In node classification, the task is to predict the node embedding for every node in a graph.
They all try to learn a function to pass the node information around and update node state by this message passing process.
Spatial Convolutional Network adopts the same idea by aggregate the features of neighboring nodes into the center node.
The intuition of GNN is that nodes are naturally defined by their neighbors and connections.
a graph is in general hard to visualize for human interpretation
a graph does not have a fixed form
An Introduction to Graph Neural Network(GNN) For Analysing Structured Data | by Shanon Hong | Towards Data Science
Zero-shot learning aims at predicting a large number of unseen classes using only labeled data from a small set of classes and external knowledge about class relations.
existing language knowledge base
Applications of Zero-Shot Learning | by Alexandre Gonfalonieri | Towards Data Science
Systems for training and serving GNN models at scale
The core idea is to explore the relationships among data samples to learn high-quality node, edge, and graph representations.
GNNSys’21 – Workshop on Graph Neural Networks and Systems
A third characteristic, the use of Network Address Translation in addition to a software bridge, creates additional overhead which can affect network performance throughput and latency as well as potentially increases the consumption of CPU and memory resources.
This mean that external systems and their networking components have no knowledge of or way to route network traffic directly to a KVM guest OS on separate KVM host.
interfaces associated to the NAT are not, by default, visible outside of the KVM host running the NAT.
the bridge used with NAT-based connectivity is typically configured to use private IP addresses from a 192.168.x.x subnet.
NAT-based networking allows KVM guests sharing the same bridge to communicate together
KVM default NAT-based networking - IBM Documentation
However, there are requirements to run trusted Pods (i.e. Kubernetes plugin) in a native container like runc, and to run untrusted workloads with isolated sandboxes
documentation/containerd-kata.md at master · kata-containers/documentation · GitHub
the source and target domains are the same, yet the source and target tasks are different from each other.
A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning | by Dipanjan (DJ) Sarkar | Towards Data Science
generating random variables from a given distribution
Generative Adversarial Networks belong to the set of generative models.
Understanding Generative Adversarial Networks (GANs) | by Joseph Rocca | Towards Data Science
通俗理解生成对抗网络GAN - 知乎
One of the biggest advantages of Kata Containers over traditional VMs is that it seamlessly plugs to existing container orchestration platforms like Kubernetes.
Contrary to the runC runtime, the Kata Containers runtime uses a hypervisor to provide isolation when spawning containers.
What is Kata Containers and why should I care? | Ubuntu
A Pod models an application-specific "logical host": it contains one or more application containers which are relatively tightly coupled.
Pods | Kubernetes
Consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.
Kubernetes Components | Kubernetes