Introduction¶
This work was done at Jio AI-CoE in collaboration with Mr.Divy Kala, a master's student in Computer Science at IIIT Hyderabad who interned with my team in summer of 2020.
Project Idea¶
In this documentation we describe our work on an ambitious project. It is ambitious in the sense that it embraces the current shift in computer vision from 2D to 3D. The project was aimed at researching ways of reconstructing accurate 3D faces from pictures and/or videos. This work would open up avenues in a wide variety of domains including bio-metrics, augmented and virtual reality, medical procedures, animation, genetics, etc.
Work¶
The project started with looking at existing methods of reconstructing 3D faces. No discussion of 3D face reconstruction would be complete without discussing the 3D Morphable Models. This is where we first started.
3D Morphable Models¶
3D Morphable Models or 3DMMs are a traditional way of using a low dimensional parametric model to represent detailed faces. The model basically takes a few parameters as input, and outputs a 3D face. A 3DMM could describe the 3D face space with PCA as
where $S$ is a 3D face, $\bar S$ is the mean shape, $A_{id}$ is the principal axes trained on the 3D face scans with neutral expression and $\alpha_{id}$ is the shape parameter. $ A_{exp}$ is the principal axes trained on the offset between expression scans and neutral scans and $\alpha_{exp}$ is the expression parameter. While reading about 3DMMs, we anticipated 2 problems that could eventually arise from using 3DMMs to reconstruct faces. Firstly, a low dimensional parametric model may not be able to capture the variations of real faces. Next, most of the 3DMM models already available are trained on foreigners. In fact, one such example is the 2003 paper which scans faces of 100 males and 100 females, all caucasians save for one asian participant. This could potentially lead to poor performance when used on Indian faces.
We decided to create our own dataset to replicate and analyze the existing methods of 3D face reconstruction.
Creating Synthetic Face Dataset¶
Using Unity¶
Divy created a small synthetic face dataset featuring 3D scans of 9 different people. The synthetic dataset was created by animating a camera around 3D scans of 9 different people at 120fps, 1920x1080. The videos were rendered under a scene lit by multiple directional lights. The background of the scene was a sandy textured wall lacking any strong features or characteristics. Aditionally, two objects of interest, namely, a blackboard and a poster of a dog can be added to the background wall. Two snapshots from the dataset are shown below. Variations on this dataset can be created by changing the background texture, lighting, camera settings, etc as needed.
Using FaceBlender¶
We also tried creating synthetic dataset using FaceBlender, a plugin for Blender that lets us create 3D faces manually using images of a person with pose variations. A 30 minute interview of Shah Rukh Khan on YouTube was used to extract snapshots that were shot with the same focal length, sensor size, neutral expression, and were well lit. The manual fitting process looked a little something like this:
%%html
<img src='https://9tdp4q.bn.files.1drv.com/y4mF6ZsJ7q8p77YgfHnLQ7bMHDOZIX_zoNjY2PwSAoRqh1YkQjekLTBajF03OIBfF2C60IrUwlYrja3v06mHnb-eBW0Fx4vKz1BlgnF10Jm6hrMOGnrXOnsT-4qFD6eB0oCPoxlqOXKMIFi0GjREoKDf9DOTeqqmWR6ktAIhhOpNgkf4-QK9B8jms5MP7sUtynYq7MtKzJ-FFKJqVLr9ZUOOg?width=937&height=664&cropmode=none' width='35%'/>
<img src='https://9zdp4q.bn.files.1drv.com/y4mbsWUF2YuS5Y81uTqaEkALaXe93CHSFyiT6zZ3ah2rCgDo8oc64E5Sw2krCL4kR01kD6S9eu7lsd-8fJhNqUDx8tfkInIikp3mB8SoDQdVvhYUHgRdZSdVBthpeHkaSQZ2idS6kDVe8zC1Fsd_HRMHU7BojNPgxEjNZFixa_BrsVHoBpNCb08O8qV7j1A_VafNsgPcyjCbzC6zx9eZOtZtA?width=960&height=679&cropmode=none' width='35%'/>
<img src='https://9jdp4q.bn.files.1drv.com/y4mpr-BNbdoXGEavGNeS1zK2X-kn-nFtzE-yMzRJeNeKQwkYLsxLpScgKYXoF10WSyUgyZRvr4l0hl_EX-VZ19nJ3nUXS9hDGLPIonJ2WUVzJIKJ_0L32x5_--ArJENECId-zCrWjN89LyB5dj8u8HPqOIWFmr4p7AiPefkbKGUxO2TA8jC4csEfla8f6VHl9pYlOb6WCZVxXaejcFd3UNNCg?width=1097&height=791&cropmode=none' width='35%'/>
<img src='https://pde7eq.bn.files.1drv.com/y4mPk7MMvvxaUkO751xsRtB5Eu5EJtizd86y5AVeTgMKyCFRYRoKPI9evXoq4HbzjzSUgH8nokv-4AezZuHsHkoHSvTQooMrFiOSJG1lapX7j0Rf_4Q6NWUB3fIT7-CABuUPNrffXRNfL6gaAWdtgl3qFbs7N5JmJTOFQ2B7SkXgG2zTV3GBJZX10LopMPH2DzMYP9kM1NDzMAtpN_6_y78Eg?width=932&height=685&cropmode=none' width='35%'/>
The results were not encouraing, the reconstructed model did not resemble Shah Rukh Khan. The FaceBlender works decently well on foreign faces, so this poor result could be a manifestation of using foreign faces as a prior in the reconstruction.
%%html
<img src='https://pje7eq.bn.files.1drv.com/y4mgexihR6Ay0k_d-7Dc4WgN35lU-XmGJf16qaqbNLxYCxBKngY46U9wKsprdTHSHD78y6MN1PGsqS9njy8dts8oNQ14b7YOLxsFJKIob1PxpjXJ-vS28CZATC1lfyNEJFbhuHg6jpb7dhj6qmdOMu47yp_YzgvtN6ANZ07ghehD2JpGiSgfrE7TWpBqkyKP8iLvR_tW3HluUOMN3iKdh0mbw?width=341&height=360&cropmode=none' />
<img src='https://9te7eq.bn.files.1drv.com/y4mFnPVYU_P4AMMoMRmgUzcWsdBSlkUeW5O9ZvtMqv3aVuyz7KLWI-e5WzwFNNoum5ivcIkBvqtcO_4qSfzbGMVyfDaxHMdzrA-_sjCieT-hfs4f-gCphwiRnHe_eooYQaxiyn5n6ZmQST6B0K3mBJkWwRQf3TcARIRlk-O-nySUu8ThkcK_AK7P140dGMP4ADxLyVWCbI7llVustCgEvx5rA?width=339&height=358&cropmode=none' />
To test this theory out, we used FaceBlender on my Divy's face. The results were even worse, the reconstruction was asymmetric and very distorted. This reveals the holes in current methods of 3D face reconstruction. Namely, the methods do not work well with faces of all races.
%%html
<img src='https://9ze7eq.bn.files.1drv.com/y4mztcKYJW5RHtfwelcXnbXHC8Q5dl9S58HkIPBCrS-h-FwJOADvXlWPr0YYwzsKasilMxsmkNxRMPJEaIbmbRDygUvQ1w8706iHXry1AXOsxsrQiLAx9qGSmQ_qye7xRw_TNC5_hD9h-ujqFJre-29pg3HTcAxkXAH6p3NdqPCiYYfEg-TAWbOck9bNO2c9sDCnCG2LxSJUCn3SXnAXUz9BQ?width=584&height=539&cropmode=none' width='500'/>
<img src='https://adi7eq.bn.files.1drv.com/y4mujUy8BoH9TO7nnfMXXOd9SUcWt6k6iVY2J9_Bs0XNqULiwWm2LXnDRBOgHoyLLkT-QOy-oACHO_DULPERW80mCS8nPu2KP1VF5UfV6y8Ph0Ohi7uuD349g9VQWw_YVigbfT7FAi_aoUgyazydEq_owSxYGhdusijGnYk6vPlisXpZ9QZP7FWt4we8crE8vNkvp0PsQS_Qavb9Hp5XitMgw?width=744&height=648&cropmode=none' width='500' />
<img src='https://ptcgnq.bn.files.1drv.com/y4m3_v264uirl5DBuYdbl_DrH77koxq6L5gcbEuu0RR1S4RvtIQqGgNzCkfhLf7_wQsVHut0nHBKiQXMswLnaGY_clg9MVMv0OpIHo6XFFnbFyMC2osNCg2jVJ4_5ZMIRtYN7vk0BvGYVzv96KcG1vRHdWbGz9VtZCvFEXynchqllQCq8hrNXFp2jNNnPUuUpWlAJsCB_1s7Fao0Djje6SiTQ?width=699&height=620&cropmode=none' width='500' />
<img src='https://pzcgnq.bn.files.1drv.com/y4mRLWa1z8x4_mskD5vAvk3A1g0gOBCkQq6i6oQ-3gxCmc6xXNrMVuBEflkFjQUyqOuKO2W4dQmRUmv_Dj57QYqPj3zMSe-vjPlKyUcDSC2nh-ds9GeN3LSRksrF5ZE-46ziVoKn7GfhHTwzsW64LaM5TO2WO-ntAuFJ03Hmb01AumXRvTge7SgbSO88_S-T1WRezQCd0I_Q_PycGWoy2Toxg?width=669&height=628&cropmode=none' width='500' />
<img src='https://9tcgnq.bn.files.1drv.com/y4mIqEn-X4MGavk_C1dEk2P4Y7VQ6ykqQ3eXsIKp7jQPEembO9ILH4u1_oF-SKmmJfwFiQK_-IDfJFhJAMHVG83Lxjrqi4P8TP7Z-mb5aytkvd4x7AbNBrCPc2v7tfzPlgczaV9boO3wm7izAq84cMCTGIBEeQv3j4Jsv5Ifd0OMx5Vgh-y55XD2yb1nFvbQbINHNRu_tFOLHEXWhizeAcdjg?width=788&height=576&cropmode=none' width='500' />
<img src='https://9jcgnq.bn.files.1drv.com/y4mNAUpgNlw_Psfohi4eQNimsN3XPbSki63omtGB11cwKrnxJIja3s9sLnXThICEucm4BmmVzojt92hvKIoSIbdB-mlxSZm8AInJOYMVxybvJix27ioRTL5SXbzqH1fWHAIXDkE6bbPWcwhaSotDHEa8C4uZ-UZsiAlCUz4IY6XYaIum7j2YBRGgbbD_VtKyMbi9a5JVaCFd1GkqCTBYKO57A?width=962&height=728&cropmode=none' width='500' />
3DDFA¶
A popular method of 3D face reconstruction is 3DDFA. 3DDFA, or 3D Dense Face Alignment, is a dense face alignment framework used to obtain pose, depthmap, 3d model, and facial landmarks from a single face image. 3DDFA is a robust method that can work with pictures of people taken from different angles. It can also estimate the position of facial features that have been occluded because of the angle from which a picture may have been taken.
The key idea behing 3DDFA is that it works by optimizing the model parameters, namely, scale, rotation, translation, shape, and expression of a 3DMM face to give a good fit on an input image.
3D Morphable Model: 3DDFA uses Cascaded Regression by using CNNs as a regressor given by
$p^{k+1} = p^k + Net^k(Fea( I,p^k))$
where $p$ is the estimated model parameters, $Fea$ are the image features and $Net$ is the CNN structure and cost function.
3DDFA uses the 3DMM model described above, given by
$S = \bar S + A_{id}\alpha_{id} + A_{exp}\alpha_{exp}$
The 3D face can be projected orthographically on the image plane with
$V(p) = f * Pr * R * (\bar S + A_{id}\alpha_{id} + A_{exp}\alpha_{exp}) + t_{2d}$
where $V(p)$ is the model construction and projection function which constructs and projects the 3D face model using the parameters, then projects it. $f$ f is the scale factor, $Pr$ is the orthographic projection matrix, $R$ is rotation matrix and $t_{2d}$ is the translation vector. Together, they form the model parameter for regressing $p = [f,R,t_{2d},\alpha_{id},\alpha_{exp}]^T$. The scale factor can be incorporated into quaternion rotation vector $q_0,q_1,q_2$ and $q_3$, hence the final fitting objective becomes $p = [q_0,q_1,q_2,q_3,t_{2d},\alpha_{id},\alpha_{exp}]^T$.
In 3D face reconstructions, there are two popular kinds of features, namely, image-view and model-view. In 3DDFA, both image-view and model-view features are used. The former sends the original image directly to the regressor so that the information provided by the image isn't lost, whereas the latter rearranges the image pixels according to the current best fitting model, making the feature free from pose variation. The model-view feature Pose Adaptive Feature (PAF), along with the image-view feature Projected Normalized Coordinate Code (PNCC) are used in 3DDFA.
Pose Adaptive Features (PAF) The idea behind PAF is to convolve along semantically consistent locations on the face. For the current model parameter p, the 3DMM face model is projected onto the image by
anchor = p @ (u_filter + w_filter @ alpha_shp + w_exp_filter @ alpha_exp).reshape(3, -1, order='F') + offset
( in 3DDFA/utils/paf.py here.) to get 64x64x2 feature projected anchors. Now we crop a dxd patch (d=5 in this case) at each feature anchor and concatenate them into a $(64*5)\times(64*5)$ patch map, as shown below.
#code snippet from /utils/paf.py
img_paf = np.zeros((64 * kernel_size, 64 * kernel_size, 3), dtype=np.uint8)
offsets = gen_offsets(kernel_size)
for i in range(kernel_size * kernel_size):
ox, oy = offsets[:, i]
index0 = anchor[0] + ox
index1 = anchor[1] + oy
p = img_crop[index1, index0].reshape(64, 64, 3).transpose(1, 0, 2)
img_paf[oy + delta::kernel_size, ox + delta::kernel_size] = p
return img_paf
This image is now free of pose variation based on the curent model parameter p, and is therefore ready for convolution based on semantically consistent locations on face. dxd convolutions are performed with stride d on the patch map to generate 64x64 maps.
Projected Normalized Coordinate Code: This image-view feature is obtained as follows. Firstly we normalize the 3D mean face in all three axes as given by
$NCC_d = \dfrac {\bar S_d - min(\bar S_d)} {max(\bar S_d) - min(\bar S_d)} $
where d=x,y,z. This distrubutes the 3D coordinate of each vertex uniqely between [0,0,0] and [1,1,1]. Since NCC (Normalized Coordinate Code) has 3 channels, and therefore can be used as face texture when converted to RGB.
Then, the face estimated with the current model parameter p is rendered using Z-buffer, colored by NCC as
$PNCC = Z-Buffer(V_{3d}(p), NCC)$
$V_{3d} = R * (\bar S + A_{id}\alpha_{id} + A_{exp}\alpha_{exp}) + [t_{2d},0]^T$
$Z-Buffer (a,b)$ renders 3D mesh $a$ colored by $b$. The rendered image is called the Projected Normalized Coordinate Code (PNCC). This is stacked with the input image and sent to the CNN.
In implementation, $NCC_d$ is a constant stored in train.configs/pncc_code.npy
.
#code snippet from /utils/render.py
def cpncc(img, vertices_lst, tri):
"""cython version for PNCC render: original paper"""
h, w = img.shape[:2]
c = 3
pnccs_img = np.zeros((h, w, c))
for i in range(len(vertices_lst)):
vertices = vertices_lst[i]
pncc_img = crender_colors(vertices.T, tri.T, pncc_code.T, h, w, c)
"""crender_colors (vertices, triangles, colors, h, w, c=3, BG=None): render mesh with colors
Args:
vertices: [nver, 3]
triangles: [ntri, 3]
colors: [nver, 3]
h: height
w: width
c: channel
BG: background image
Returns:
image: [h, w, c]. rendered image./rendering.
"""
pnccs_img[pncc_img > 0] = pncc_img[pncc_img > 0]
pnccs_img = pnccs_img.squeeze() * 255
return pnccs_img
Network: A cascaded network of CNNs is used. At iteration k, model parameters $p^k$ is used to construct PNCC and PAF using image $I$ and train a two-stream CNN $Net^k$. The output of the network predicts the update to the model parameters, this process is repeated iteratively, i.e.
$\Delta p^k = Net^k(PAF(p^k,I), PNCC(p^k,I)) $
Cost Function: Different parameters in the 3DMM create different impacts on the fitting accuracy. To obtain the best possible fit, different priorities must be assigned to each 3DMM parameter.
The author allows for three cost functions as described below.
Parameter Distance Cost (PDC): This is the simplest cost function which moves the initial model parameter $p^0$ closer and closer to the groundtruth model parameters $p^g$.
$E_{pdc} = ||\Delta p - (p^g - p^0)||^2 $
The disadvantage with this cost function is that it does not assign an internal priorty to the model estimates. For example, changing the pose may cause a better alignment as compared to the face shape, but they are given equal priorities.
Vertex Distance Cost (VDC): This cost function minimizes the vertex distances between the current and groundtruth 3D face.
$ E_{vdc} = ||V(p^0 + \Delta p) - V(p^g)||^2$ where $V(.)$ is the face construction and projection. This cost function is not convex and optimization is not guarenteed.
Weighted Parameter Distance Cost (WPDC): The parameters are explicitly weighted by their importance. $E_{wpdc} = (\Delta p - (p^g - p^0))^T diag(w) (\Delta p - (p^g - p^0))$,
where $w$ is the parameter importance vector $w=(w_1, w_2,..., w_i, ... ,w_p)$ where $w_i = ||V(p^{de,i}) - V(p^g)||/Z$
$p^{de,i} = (p_1^g, ...,p_{i-1}^g, (p^0 + \Delta p)_i, p_{i+1}^g, ..., p_p^g)$ where $p$ is the number of parameter, p^{de,i} is the i-degraded parameter whose ith element comes from the predicted parameter $(p^0 + \Delta p)$ and the others come from the ground truth parameter $p^g$, $Z$ is a regular term which is the maximum of $w$.
The problem with WPDC is that it models the importance and not the priority, in fact the parameters become important sequentially which WPDC does not capture.
Optimized Weighted Parameter Distance Cost: The author talks about another cost function which models the priority, but does not implement it in the code, saying it is not effective for this repo. The issue can be looked at here.
#While training, appropriate argument to "--loss" can be given to choose the cost function. The default loss function is vdc.
#wpdc and pdc can also be chosen if needed, see /3DDFA/train.py for more details
Results
The 3DDFA has been used on several different images. Some results are shown below.
Shah Rukh Khan's model created using FaceBlender and results of running 3DDFA on it.
%%html
<img src='https://knq2mw.bn.files.1drv.com/y4mBTwafCZEXyLkW_8AgQ74_w-WgPE7VXfLnhQXYt_KLJO06bVCEp6TkcdiBFqQ7mr19tQpPLy3qftN0c7akLMy0ULcAStfbtzUqJUqLWGiF8n520FDJH43u8OSIRbr-I5uVkYOuojEPyc58g2ZjK5RzQoaBkYOhQUpfifXB02cVuWoRR2LxZt2FyvaO8vDZywtpoJvAJoopud-JUeevk6J7w?width=210&height=298&cropmode=none'/>
<img src='https://ktq2mw.bn.files.1drv.com/y4mw5-ShvU3LpMf7tvwQuu9u_JOt2w8IxDNxM7qIgG2Gx3DbW41W0Bn9WE61I3pF1Pq_mRzqLTfCxLCBOei4v4S5fzOuAQW68ZmhBvCrWYMhciijI24rZrGrAgloRD7-u_dDDcnwc88MSzHFosDQ2JoM-FoUd9GjmX_vxyqNdVvA5XCQ3HFooOjTdjOpzX8SFf2xihC9EVN0VChubKlrC7Fow?width=206&height=293&cropmode=none'/>
<img src='https://jnq2mw.bn.files.1drv.com/y4mi9HcGv4DwPHmCs0synwus2fw9byFKxJyhDuQ4rhjyaOVkbcvSlqxZzrx3vUz3G-WH8g12RaHPfA5bXHHtOLxFSA2_-pHw_UlxfrcIyh-B_rQ4sPyG5cypIcWgOb1S804UkY6blJwvRIKOEn8aERR5cX74MGeImxQx-dGIMvCFSDtB0n6XdhxOISHH-AziX2OoxPxC0q-3rCThiJLebeGFA?width=215&height=304&cropmode=none'/>
<img src='https://mnprea.bn.files.1drv.com/y4mwkxzt2fRiIzq5vwtYs34oHr0kHYyRziYm3VH49PKFCtcADxBeuFUCGu-X9XXvJd3YfWHPXJaU7hES5vFmgsnkdNHgYt9HgSoROBBIe1p45hKrgtkMexvswXJ3JGd3u6oZ4gNjCm17OedG80ZEp5OgN7UH97KvwNgZXwk7g9_OlnXvG--3ts513txviBlhLfZFx3fEQAXaO-IRrcfoTeblw?width=201&height=284&cropmode=none'/>
Results of 3DDFA on synthetic dataset
%%html
<img src='https://jtq2mw.bn.files.1drv.com/y4mKJ21Eyk1KUDjoada2x0zNNZqz3NupAu5lYT8c04tNf-QuNVJRqtzWY7ulZFE_aUoWSlkMXiIguoh_n5Zl3Omcbm5atxY9PWwBVb55_8idG8Axdn59kmLIF6AXfh-0ZbaL-ljDLuitngAPc9IxL23hsl6s83KeQ88BK37ICXNV0-_AeZFtbMezR5yl3h4wiO2pNwcWgGnRZo9wskADU2DTg?width=199&height=283&cropmode=none'/>
<img src='https://ktprea.bn.files.1drv.com/y4mBzdm80XL2unkENoXHxITGxCaINFeh6rcxs9ejciOQ6mjopW3yAGxlT6hoZsN-r_VP0KPwZ8aUHGy020UeYm2m7Y6ppE7xhweGHCvzk4TU3TtgsJLc9gGBiPPLlidfSZnd-fbq4DBslnMgoKMmb_JRHbDZLhsJ99xfV-NChBBPYQPvlK_BoDiKmu56XMeXz727keIe-ZtwFY4Sk2BvUUxPw?width=200&height=282&cropmode=none'/>
<img src='https://kdprea.bn.files.1drv.com/y4mYFZlkMSsWl378hagjj4B2SseFTmZqtqbp5Rp2K7bxVJ_d-GqAMMlYeiEDKlVgzbnY75bgimLFuB2qx0BmkHM4gwsgxgyInxb19O_BD-3sP94zlgqtw-ZDYcumvR34rjfQV2Aaa8NcjDRrlZswn-xPmmLxnTMLmqkNgPFbemBC4B46mG_rStmKb6smm1G0hGhlDv9kPGRBAwezVZ6SKYRzw?width=200&height=282&cropmode=none'/>
<img src='https://jnprea.bn.files.1drv.com/y4mHNplVzWJy1GMp3puXoAt7qV_Rn5Q9uSH1lpbqLDBOzVvzeBV8mQQ1k7L3BjDaLCYYCocweE0z6bYJivFjIxXdWyJQrWAd2bD9Ah9nEplukN3j-U5Xptu-jixXP_SSgXS_o9zTDtSa-NdSFWRGZpFRxtCXnL3ZlIxgKF_QrRXtaVsmI3phaYvufMkGXWv9W23nB6Bi25gMq4HR5sn5gjucQ?width=200&height=283&cropmode=none'/>
%%html
<img src='https://jdprea.bn.files.1drv.com/y4mlmhRvckSYiZ5Vc7oA4vd58tavTLz7bCF8eiB9uI3NQxl3ExdCavX2WbnWXIlIbXgxqO81FV1wOhBMVkU4nxKmfBkciYt-VbCeJ5MBWpp-gQk9QYNIpsxEvTXIN-mOEknlTmLB7SwPwJnvYpynxiFdltuoO3igRC55hRCf7Gras_r-GXcmT_EayijUVuYEPanWiDMdFSOjz8Osc9RRPTgAQ?width=196&height=277&cropmode=none'/>
<img src='https://ktobjw.bn.files.1drv.com/y4mRgSRfn0-SYIslAYNZahaLbmIgkgOKcOS9smwqzCrZ597lvCOH_p8c0XFQaI2GBC3rmblslEnjlMgmfP7Pzi-TGJeQEdLteevC4dbEVco5jC37PDhzFGpMCA-02ZgTITwKg59klfqjPUAVYaY7IvC1vkhL5S8PAyKmxDitrpXbIRhZW-aoFl6EC65ZgyzO9c7FygmFjenMEeLiKmFB7vyHw?width=201&height=286&cropmode=none'/>
<img src='https://jdq2mw.bn.files.1drv.com/y4mVukaLl3w2_dFLmzA4cXkPyVOVvLU-M9cRAKjkcIwvooqU1AFhNIWcY67GpkwRtrv99rIQ5TDyBCkQBixU2oh6eRIwwQK0kvI2gu_fde6AfYSTUfyesraPhOOlEK3DS1dDK0P5JpxqmtnW6ytahw9ntMnJn8G9GeKwXbAVswtWNYzw1wz8DAJwWfgz4yAm3ohA9Se5zKRaxngSz8K0KAKzQ?width=202&height=286&cropmode=none'/>
<img src='https://knobjw.bn.files.1drv.com/y4mReKhImlMxKyLH2ZKnmb-DmYCWffWUW9tngCq9BtnMDkWDRqMSZK8WSCI5oWlu0LP_pQPAeyWUrFPEpNbSX-5fh-NpIC3KQIJ04TOul3fgdxB81IP-NG3axG9zCmniSIN1YmDe1KXTQgV6-wl3sUm9WUMil37Xr-B2lFpWZzyWeUdPNT7vzTkhCyptBQSVMbzb4cQ9EuS85_ao3MS_I2XHQ?width=202&height=286&cropmode=none'/>
Comparison of Shah Rukh Khan's FaceBlender 3D model (left) with 3DDFA output for the same.
%%html
<img src='https://kdobjw.bn.files.1drv.com/y4m3ZlbkN-d_l2gzMKp8jMtdpihY5qAbqdbigJRv0yLd1sNvJI5nkY7CCQvlsc35tW1pHND21x8rmgojRzFpkhWfEY_4GBXreJwSG_qDkB_yfLrT7RLLWjHyK1lem5LVDyGS86NYiSutGVRWNNIf_VPi3cXj3ipUcVhKmYASgNkO-pJmHslb6tCJTV_Vks7ALGoM_7c8TDowiVqJ5Mr-HQauQ?width=1382&height=675&cropmode=none' width='75%'/>
Analysis of the results The 3DDFA fits a 3DMM to the face. This means it is heavily dependent on the implementation of 3DMM used. The number and variety of faces used in training the 3DMM model is expected to play a huge role in the results given by 3DDFA. In this case, since we already know that 3DMMs have not currently been trained on Indian faces, and the number of people used for training is possibly too few to capture the large variety of human faces in existence we expect this method to be limited by the current 3DMM implementations.
- All reconstructed 3D faces looked similar and did not resemble the groundtruth faces, possibly owing to the little variety 3DMMs offer.
- Poses looked to be estimated fairly correctly. Optimization function seems to be working well for pose estimation.
- The facial landmark detection was fairly accurate. 68 landmarks have been shown in the images which appear to be well aligned. More landmarks can be extracted as 3DDFA is a dense alignment method.
The results were good for landmark and pose estimation, but the reconstructed face needs a lot of improvement. The method leaves a lot to be desired even for face reconstructions of foreigners. The quality of reconstructions is not appropriate for methods such as bio-metrics, genetics, medical procedures, etc.
EOS¶
In contrast with 3DDFA, EOS regresses sparse facial features to estimate the pose, and then tries to fit a 3DMM model by minimizing the differences in detected sparse facial landmarks and sparse facial landmarks obtained from 3DMM reprojection.
The paper “3D Face Tracking and Texture Fusion in the Wild” uses single monocular images. It utilizes a cascaded regressor based face tracking followed by a 3D Morphable Face Model shape fitting. It can also be used to generate texture information if multiple poses are provided as input.
This method uses 3DMM as prior knowledge about faces. For tracking faces, cascaded-regression is used to regress sparse facial landmark points. Specifically, a series of linear regressors are used defined by,
$R_{n}: \delta \theta=\mathbf{A}_{n} \mathbf{f}(\mathbf{I}, \theta)+\mathbf{b}_{n}$
where $\mathbf{A}_{n}$ is the projection matrix and $\mathbf{b}_{n}$ is the offset (bias) of the $n$ th regressor. and $\mathbf{f}(\mathbf{I}, \theta)$ extracts $\mathrm{HOG}$ features from the image.
The landmarks are initialized at the locations from the previous frame if a video stream is available, using the model's mean landmarks as a regularization.
The pose of the camera can be known from the correspondence between the regressed sparse facial landmarks and the 3DMM model. The Gold Standard Algorithm of Hartley and Zisserman can be used to obtain a camera matrix.
The 3DMM shape parameters are estimated by minimizing the following cost function.
$\mathbb{E}=\sum_{i=1}^{3 N} \frac{\left(y_{m 2 D, i}-y_{i}\right)^{2}}{2 \sigma_{2 D}^{2}}+\|\boldsymbol{\alpha}\|_{2}^{2}$
where $N$ is the number of landmarks, $y$ are detected or labelled 2D landmarks in homogeneous coordinates, $\sigma_{2 D}^{2}$ is an optional variance for these landmark points, and $y_{m 2 D}$ is the projection of the 3D Morphable Model shape to 2D using the estimated camera matrix.
Expressions can be fit using a set of expression blendshapes $B$ that have been 3D scanned. These blendshapes are added to the PCA model as
$ S=\overline{\mathbf{v}}+\sum_{i}^{M} \alpha_{i} \sigma_{i} \mathbf{v}_{i}+\sum_{j}^{L} \psi_{j} \mathbf{B}_{j} $
where $\mathbf{B}_{j}$ is the $j$ -th column of $\mathbf{B}$
To solve the blendshape coefficients a solver that minimizes the distance between the current model projection and the sparse 2D facial landmarks is used.
Further, contour refinement is done by using the semi-fixed 2D-3D correspondences. A set of vertices $V$ along the outline of the 3D face is shortlisted to give the closest vertex for each detected 2D contour point according to
$\hat{v}=\underset{v \in V}{\arg \min }\|P v-y\|^{2}$
where $y$ is a 2D contour landmark, $\hat v$ is the optimal corresponding 3D vertex, $P$ is the currently estimated projection matrix.
Results:
The results when this technique was applied to the sample image provided in the paper is shown below.
On our synthetic dataset, this paper failed to give any useful result as shown below.
Analysis of the results
The model also depends on 3DMMs so we expect the same drawbacks as seen in 3DDFA. Moreover, the method failed to work on our synthetic dataset at all, showing that the cascaded regression failed to regress sparse features on our dataset. On real faces, the method gave good pose estimation thanks to cascaded regression, but failed to give a high accuracy face reconstruction. This is again expected because the 3DMM is fitted using sparse landmarks, as opposed to dense alignment in 3DDFA. The method is also based on 3DMMs again, so we again observe how all the faces look similar because of the relatively little variety in faces 3DMMs provide. This shows that there is still a lot of improvement to be done in 3D face reconstructions.
Deep 3D Portrait¶
Deep 3D portrait is a CVPR2020 method which can produce high-fidelity 3D head geometry and head pose manipulation results. This method can also identify and extract hair and ears. However, even this method uses 3DMMs, we decided to evaluate this method with our synthetic dataset and analyze the results.
Results of this method on some face portraits is shown below.
Technical Details:
Preprocessing The input portrait image is rescaled and centralized before being segmented to include the head region (denoted as $\mathcal{S}$) which includes face, hair and ear regions.
3D Head Reconstuction The method described creates a 3DMM face as well as a depth map for other head regions. A 3DMM model can represent the face shape $\mathbf{F}$ and texture $\mathbf{T}$ by
$ \mathbf{F}=\mathbf{F}(\boldsymbol{\alpha}, \boldsymbol{\beta})=\overline{\mathbf{F}}+\mathbf{B}_{i d} \boldsymbol{\alpha}+\mathbf{B}_{e x p} \boldsymbol{\beta}$
$\mathbf{T}=\mathbf{T}(\boldsymbol{\delta})=\overline{\mathbf{T}}+\mathbf{B}_{t} \boldsymbol{\delta} $
where $\overline{\mathbf{F}}$ and $\overline{\mathbf{T}}$ are the average face shape and texture; $\mathbf{B}_{i d}$ $\mathbf{B}_{e x p},$ and $\mathbf{B}_{t}$ are the PCA bases of identity, expression, and texture respectively; $\alpha, \beta,$ and $\delta$ are the corresponding coefficient vectors. In this implementation, we have $\alpha \in \mathbb{R}^{80}, \beta \in \mathbb{R}^{64}$ and $\delta \in \mathbb{R}^{80}$. The parmeters to be estimated can be represented by a vector $(\boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\delta}, \mathbf{p}, \boldsymbol{\gamma}) \in$ $\mathbb{R}^{239},$ where $\gamma \in \mathbb{R}^{9}$ is the Spherical Harmonics coefficient vector for scene illumination. $I$ is the input image and $I^{\prime}$ is its reconstruction. $\mathcal{F}$ denotes the rendered face region and $\|\cdot\|_{2}$ denotes the $\ell_{2}$ norm for residuals on r, g, b channels. The photometric error is minimized as given by $ l_{\text {photo}}=\int_{\mathcal{F}}\left\|I-I^{\prime}(\boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\delta}, \boldsymbol{\gamma}, \mathbf{p})\right\|_{2} $
Perpetual discrepancy between rendered and real faces is also minimized, as given below.
$ l_{p e r}=1-\frac{<f(I), f\left(I^{\prime}\right)>}{\|f(I)\| \cdot\left\|f\left(I^{\prime}\right)\right\|} $
where $f(\cdot)$ denotes a face recognition network for identity feature extraction. Few other losses such as 2D facial landmark loss and coefficient regularization loss are also applied.
The rest of the method deals with reconstructing hair, ear and estimating pose. Since these are not of prime importance in 3D face reconstruction, we skip those details.
Results Results on a celebrity.
Result featuring a person of asian origin.
*Analysis of the results
Even though this is one of the latest and greatest method of 3D face reconstruction, it is still plauged by the disadvantages of 3DMMs.
- Hair and ears were also extracted. This can find applications in VR and AR technologies.
- The reconstructed models, although good, still leave a lot of room for improvement even for a 2020 paper. The reason for this could be the limits of 3D faces that 3DMMs can capture.
- Relatively better results were obtained for an asian face, but still a lot of improvement is required for any applications involving bio-metrics, medical procedures, etc.
PRNet¶
It seems that though 3DMMs simplify the process of representing 3D faces, they also come with their own disadvantages. We decided to explore methods that did not use 3DMMs as a prior. PRNet is one such method.
PRN is a method to jointly regress dense alignment and 3D face shape in an end-to-end manner. It uses a encoder-decoder type network to generate UV position maps, which have a direct dense correspondence to facial geometry.
This method does not rely on a prior face model such as the 3DMM, this means it is not restricted to a lower dimensional model space. A CNN is trained to regress a UV position map from a single input image. The UV position map records the information of a 3D face and gives dense correspondence to the semantic meaning of each point on the map.
UV Position Map: The UV Position map is expressed as $\operatorname{Pos}\left(u_{i}, v_{i}\right)=\left(x_{i}, y_{i}, z_{i}\right),$ where $\left(u_{i}, v_{i}\right)$ represents the UV coordinate of $i$ th point in face surface and $\left(x_{i}, y_{i}, z_{i}\right)$ represents the corresponding 3D position of facial structure with $\left(x_{i}, y_{i}\right)$ representing corresponding 2D position of face in the input RGB images and $z_{i}$ representing the depth of this point. In other words, UV position map is a texture map but the r, g, b values are replaced with the x, y, z coordinates.
Network Architechture: An encoder-decoder structure utilizing convolutional layer is used to learn the function to transform input RGB images into UV position maps. The network starts in the encoder part with one convolution layer followed by 10 residual blocks which reduce the 256x256x3 input image into 8x8x512 feature maps, the decoder then contains 17 transposed convolution layers togenerate the 256x256x256 UV position map. Kernel size of 4 for all convolutions and ReLU is used for activation.
Loss Function: A simple MSE loss function could have been used, but the authors observed that the central region of faces have more discriminative features than other regions and therefore employed a weight mask to form a loss function that gives highest weightage to the 68 facial keypoints and relatively high weightage to nose, ears, eyes, and lips and 0 weightage to the neck region.
The loss function is given by,
$\operatorname{Loss}=\sum\|\operatorname{Pos}(u, v)-\operatorname{Pos}(u, v)\| \cdot W(u, v)$
where $\operatorname{Pos}(u, v)$ is the predicted UV position map for $u, v$ representing each pixel coordinate. $\operatorname{Pos}(u, v)$ is the groundtruth UV position map and $W(u, v),$ is the aforementioned weight mask.
Results: are shown below.
3D models reconstruction on synthetic dataset.
Trying PRNet on Divy's face
Result showing that the reconstructed faces all look the same
Analysis of the results There is a clear difference between the results of this method and the other methods that we examined so far. The faces look very different from what a 3DMM based method produced.
This method is not based on 3DMM, hence the faces do not look like the generic faces of other methods.
The reconstructed models are not good enough for applications that require high accuracy such as bio-metrics, medical procedures, etc. The poor quality can be observed by looking at the ridges and furrows on all the output faces.
Poor performance of dense alignment and reconstruction geometry was observed on author's face. This performed way worse than the traditional 3DMM methods, or the EOS method that did cascaded regression for sparse landmarks.
The method tended to go toward an average face. This is seen as the econstructed faces look quite similar.
Training done on the synthetic 300W-LP dataset, based on Basel Face Model, which has no Indian participants.
Although the method claims to be model-free, in practice, training was done using UV position maps generated using 3DMMs. This however does not mean that the reconstructed faces are limited by 3DMM model space.
Reflecting on the results so far
So far in my journey we have learnt that there is a lot of improvement to be done in the methods of 3D face reconstruction using single image. This is to be expected as 3D face reconstruction using a single image is an ill posed problem and lots of methods must depend on prior knowledge about faces. One method of giving prior knowledge is using 3DMMs, however, 3DMMs have traditionally not been trained on Indian faces, and therefore a practical problem.
Next we decided to explore creating 3D faces using stereo images. Stereo images capture more information about the scene and constructing 3D models is a bit easier. If stereo could be used to create 3D faces, we could find groundtruth data for training our own 3DMMs with Indian faces by using stereo images from 3D Indian movies or later by capturing our own data.
3D Face Reconstruction Using Stereo Data¶
Stereo images are captured by stereo cameras which are essentially 2 cameras separated (generally horizontally) by a distance called the baseline distance. The two cameras capture photographs at the same time, what this does is that it captures 2 images from slightly different positions. We can evaluate the disparity between the objects in two images to infer the distance at which those objects lie from the camera. For example, a subject very close to the camera would be at very different positions in the 2 stereo images. However, a subject placed far away from the camera would be at almost the same position in both the images. This information can be used to infer the depth information of a scene. We decided to explore whether this technique would be able to help us find the depth on a person's face and hence enable us to reconstruct the 3D model of a face.
We decided to use the readily available INIRA Stereo Face dataset to evaluate the current methods of 3D face reconstruction using stereo data for some preliminary results. The dataset had videos shot with stereo cameras. A still from one of the videos is shown below.
Real-Time Self-Adaptive Deep Stereo¶
We decided to use some of the latest deep neural networks to estimate the depth maps of the faces. However, the problem with deep neural networks trained end-to-end is that these networks perform very differently if the input is different from the one given at training time. For example, real vs synthetic, indoor vs outdoor play a huge role in gauging the performance of these networks. We decided to use the CVPR2019 paper "Real-Time Self-Adaptive Deep Stereo" as it was an unsupervised method and preserved its accuracy independent of the sensed environment.
Results The result on the stereo image pair shown above from the INIRA dataset is shown below.
Analysis of the Results The results were not encouraging. The method failed to capture the true disparity of the scene. Moreover, it can be seen that the variation in disparity estimated in the face of the person on the left is not accurate. The tip of the nose which should be closest to the camera is dark, wheras her collar which is further away from the camera is brighter. The method failed to give useful results on this dataset.
The result from the paper was not very encouraging for faces. We perhaps went too far too fast, and should first establish a baseline using traditional methods for estimating disparity with stereo images, such as SGBM.
Semi-global Block Matching¶
SGBM is an old and traditional way (2008) of estimating depth maps from stereo images Hirschmuller, H. (2008). We used this technique on the INIRA dataset to establish a baseline result which the other newer methods should perform better than. The results of this technique are presented below.
Results
Analysis of Results
The results were far from perfect, but so was expected from an older algorithm. The lady on the right blends in completely with the wall behind her at the distance. There is not enough variation in the faces in the disparity map to extract any useful depth information. The RGB values on the face were almost constant, meaning that there is in fact not much information. These results do not give much support to the idea of creating 3D face datasets of Indian people from 3D Indian movies, but more work is required to come to a definite conclusion. Perhaps using different baseline distances and having higher resolution images could give better results.
PIFuHD¶
Before carrying on our work with stereo, we came across another CVPR2020 paper that was relevant to our work.
PIFuHD is a CVPR2020 paper that lays out a method of reconstructing complete human models from head-to-toe images of people. This method is currently the state-of-the-art on single image human shape reconstruction.
The method uses an end-to-end multi-level framework that uses high resolution (1K) images to reconstruct 3D geometry of clothed humans.
Pixel-Aligned Implicit Function
Instead of estimating the occupancy of each voxel in 3D space explicitly, PIFu models a function $f(\mathbf{X})$ which predicts the binary occupancy value for any given 3D position in continuous camera space $\mathbf{X}=\left(\mathbf{X}_{x}, \mathbf{X}_{y}, \mathbf{X}_{z}\right) \in$ $\mathbb{R}^{3}:$
$f(\mathbf{X}, \mathbf{I})=\left\{\begin{array}{ll} 1 & \text { if } \mathbf{X} \text { is inside mesh surface } \\ 0 & \text { otherwise } \end{array}\right.$
where I is a single $\mathrm{RGB}$ image.
The advantage of modelling a function over explicitly estimating occupancy of voxels is memory efficiency. Additionally, not discretizing the 3D space into voxels also means that the obtained 3D geometry wll be of higher fidelity.
PIFu models the function $f$ via a neural network architechture.
The way $f$ works is that it first extracts an image feature embedding $\Phi(\mathbf{x}, \mathbf{I})$ from the orthogonally projected 2D location. Then the occupancy of the query point $\mathbf{X}$ is estimated, thus
$f(\mathbf{X}, \mathbf{I})=g(\Phi(\mathbf{x}, \mathbf{I}), Z)$
where $Z=\mathbf{X}_{z}$ is the depth along the ray defined by the 2D projection x. Note that all 3D points along the same ray have exactly the same image features $\Phi(\mathbf{x}, I)$ from the same projected location $\mathrm{x},$ and thus the function $g$ should focus on the varying input depth $Z$ to disambiguate the occupancy of 3D points along the ray.
A Convolutional Neural Network (CNN) architecture is used for the 2D feature embedding function $\Phi$ and a Multilayer Perceptron (MLP) for the function $g$, as shown.
The network works on 1K resolution images, i.e 1024x1024. It is divided into two modules, a coarse level which takes as input 512x512 downsampled images and produces backbone image features of 128x128 resolution. The other module called the fine level adds more subtle details by taking the original 1024x1024 resolution image as input and producing backbone image features of 512x512 resolution. The fine level also takes 3D embedding features from coarse level instead of the absolute depth value as in the case of PIFu.
The coarse level is the same as PIFu with a little modification in that it takes predicted frontside and backside normal maps. The normal maps are predicted using the pix2pixHD network.
$ f^{L}(\mathbf{X})=g^{L}\left(\Phi^{L}\left(\mathbf{x}_{L}, \mathbf{I}_{L}, \mathbf{F}_{L}, \mathbf{B}_{L},\right), Z\right) $ where $\mathbf{I}_{L}$ is the lower resolution input and $\mathbf{F}_{L}$ and $\mathbf{B}_{L}$ are predicted normal maps at the same resolution. $\mathbf{x}_{L} \in \mathbb{R}^{2}$ is the projected $2 \mathrm{d}$ location of $\mathrm{X}$ in the image space of $\mathbf{I}_{L} .$
The fine level is described as $ f^{H}(\mathbf{X})=g^{H}\left(\Phi^{H}\left(\mathbf{x}_{H}, \mathbf{I}_{H}, \mathbf{F}_{H}, \mathbf{B}_{H},\right), \Omega(\mathbf{X})\right) $ where $\mathbf{I}_{H}, \mathbf{F}_{H}, \mathbf{B}_{H}$ are the input image, frontal normal map, and backside normal map respectively at a resolution of $1024 \times 1024 . \mathbf{x}_{H} \in \mathbb{R}^{2}$ is the $2 \mathrm{d}$ projection location at high resolution.
The function $\Phi^{H}$ encodes the image features from the high-resolution input and has structure similar to the low-resolution feature extractor $\Phi^{L}$. $\Omega(\mathbf{X})$ is a $3 \mathrm{D}$ embedding extracted from the coarse level network, where we take the output features from an intermediate layer of $g^{L}$.
Loss function The loss function is an extended Binary Cross Entropy loss, which uses points sampled from a mixture of uniform volume samples and importance sampling around the surface using Gaussian pertubation around uniformly sampled surface points. This gives additional weightage to the points neat the surface of the model, giving more sharper results.
$ \begin{aligned} \mathcal{L}_{o} &=\sum_{\mathbf{X} \in \mathcal{S}} \lambda f^{*}(\mathbf{X}) \log f^{\{L, H\}}(\mathbf{X})+(1-\lambda)\left(1-f^{*}(\mathbf{X})\right) \log \left(1-f^{\{L, H\}}(\mathbf{X})\right) \end{aligned} $
where $\mathcal{S}$ denotes the set of samples at which the loss is evaluated, $\lambda$ is the ratio of points outside surface in $\mathcal{S}, f^{*}(\cdot)$ denotes the ground truth occupancy at that location, and $f^{\{L, H\}}(\cdot)$ are each of the pixel-aligned implicit functions from the network.
Analysis of Results The full body reconstruction was not perfect, but the face was extracted properly. It must be noted that the results were from a downscaled image as it was the maximum that would fit into Colab. Full capability of the network was not used. Regardless, the results were far better than what we had received so far. The reconstructed faces resembled the people. Since this method does not intrinsically rely on 3DMMs, much of the disadvantages that come with it were minimized. We observe that the reconstructed models were able to capture a wide variety of human faces, something that may not have been possible in the restricted face space in 3DMMs.
On closer inspection of Modi's model, it can be seen that Modi's kurta has a cut in the back that is representative of the cut that is seen in western blazers. This again shows how the methods are biased towards foreigners.
This network was trained on 3D scanned dataset of people by renderpeople.com. To train our own model with Indian faces, we will need a lot of highly accurate 3D scanned Indian people.