Modification in the stack first approach to detect WHIM using Planck data

Project by: Akshay Priyadarshi and Pratyush Kumar Das

Motivation

The project aims to detect the diffused thermal Sunyaev-Zeldovich (tSZ) effect from the Gas Filaments between the Luminous Red Galaxy (LRG) pairs by stacking the individual frequency maps. The detection of SZ can help in determining and verifying the value of the Hubble constant. Basically in the ILC algorithm, we compare the temperatures of the stacked image generated by us with the source image and try to minimize the temperature difference. We hope that instead of assigning a single weight to each sub-image, if we assign pixel dependent weights then we can reach closer to the source image, thereby reducing the error. This method of computation will also be tested if it helps in reducing the runtime compared to the previous method.

Data Source:

We will use around 100,000 LRG pairs from the SDSS DR12 catalogue . The selection criterion ensures that the data will have minimal contamination, i.e securing data from the Galactic CO emission as well as the tSZ signal from the clusters of galaxies.

Stack First approach:

In this process the authors of the source paper have first stacked the Planck channel maps and then performed the Internal Linear Combination method to extract the diffused y_sz signal. This approach makes the component separation a lot easier as the stacking greatly suppresses the noise and the Cosmological Microwave Background (CMB) contributions. As the dust foreground turns homogeneous throughout the stacked patch in the spectral domain, it helps in component separation. In this way the CMB component is removed and the remaining foreground turns simpler even before we use the component separation algorithm.

Fig. Example of stacking images (credits: Chandra, Harvard)

Mechanism:

Maps of LRG pairs were taken in 6 different frequency ranges from 70 GHz to 545 GHz. There might be several sources of emission in the given frequency range, so the total emission observed at a given frequency channel map, \(x_i\) , can be expressed using a linear equation: \[x_{i}=\Sigma_{j=1}^{N}a_{ij}(p)S_j(p) + n_i(p)\] This equation gives the superposition of \(N\) different astrophysical components (\(S_j\)) and instrumental noise \(n_i\) where we know the emission law of individual sources. Here, p denotes the pixel number, and \(a_{ij}\) is the mixing matrix which gives the weight of each individual sources in the map. The previous equation can be simplified further by taking the assumption that the weight matrix is uniform throughout the sky. \[x(p)=aS(p) + n(p)\] In this equation x(p) will be the vector of N observations, S(p) will be a matrix related to all the N source maps, n will be the vector for all the N noise maps. Using the Internal Linear Combination (ILC) Algorithm, the astrophysical components can be calculated from the observed map as \[S(p)=\Sigma_iw_ix_i(p)\] where, \(w_i\) are the weights. [To try out this approximation of getting weights, we plotted it graphically for numbers instead of maps, and it seemed to hold for a wide range of values]. The weights of this linear combination were calculated by minimising the variance of "S", and taking into consideration that "S" is preserved, i.e. \(\Sigma w_ia_i=1\)

Similarly, a frequency map might have CMB, dust, foreground, noise, and tSZ components. The tSZ component that we are interested in can be extracted using this method of ILC, i.e. \(y_{sz}(p)=\Sigma_iw_ix_i(p)\), where \(x_i\) are the maps at 6 different frequency ranges.

Explnation for Internal Linear Combination — Fig. Block diagram of getting tSZ component from 6 frequency maps.

Our proposed plan and modification

The basic change that we propose is, while calculating weights we plan to consider individul pixels instead of the whole image. Earlier the weights generated were of the form of uniform matrix, but now each component of the matrix will be dependent on individual pixels of each sub-image.

Timeline:

The project timeline:

Week	Initially Proposed Work	Updates/Progress on the go
Week 4:	Proposal and Initiation. Searching for appropriate reading material and sources.	Done as planned.
Week 5:	Literature-Reading and understanding the science and computation behind the project.	Initiated the plan. We have many technical doubts which are being resolved on the way.
Week 6:	continued from the previous week.	Finding difficulty in understanding the concept. So we tried to understand it computationally while applying it. Started the work of week 7 and 8.
Week 7:	Data Selection for the project. Milestone Presentation.	Milestone presentation postponed. Started the code implementation and testing on a small scale.
Week 8:	Code preparation and Training.	Milestone presentation to be done along with a test run.
Week 9:	Testing the code.	A ML code on linear regression was prepared. The results were suitable and up to the mark. This led us to complete the planned code implementation.
Week 10:	Code implementation.	As the proposed task was over, we tried implementing different methods viz. ANN and SVR.
Week 11:	continued from the previous week.	Faced some difficulty in implementing ANN due to computational limitations. But, both ANN and SVR (linear and rbf) implementation was done and result analysis is awaited.
Week 12:	Additional work of implementing Random Forest algorithm was done. The 5 different models were compared. Result analysis, error, conclusion. Final Report Presentation.	As per the proposal.

The changes in the timeline are updated through the course of the project in the updates column.

Tentative Algortihm and implementation

Step-1: We will select a random number of pairs from the 88,000 samples that we have, and then stack on the simulated frequency maps, and the LIL map. Using this method, we hope to generate 100- 1000 random samples which will be our training dataset.

Step-2: Next, we will train the Neural network using this training dataset and get the values of pixel dependent weight maps such that the difference between the linearly combined map and the stacked LIL maps is minimum.

Step-3: Subsequently, we will use some other stacked pairs, which are different from the training set, to Test the accuracy of our trained model.

Step-4: If the predictions is satisfactory, we will implement the code on all the remaining data available. Then, we'll compare the outputs with the respective reported results.

Step-5 (New): 5 different algorithms were used to generate the model. The results were compared and the best was taken for implementation on rest of the dataset.

Ensuing Tasks:

We have divided our project in 3 sections.

First chapter: Understanding the science behind our project

Second chapter: In this section we will choose our training set appropriately and train the neural network. Finally test the accuracy using the test data.

Final chapter: In this section we will we will implement the upgraded algorithm. Since we have a huge dataset we will also use MPI and parallelize the code.

Training and Testing

X used for testing — Fig. A sample of 6 "x" values used for testing. Corresponding "y" for these values is the Y\(_{test}\).

As an initial run, a ML codes were prepared and run on randomly selected 100 samples. 99 samples were used for training and the rest 1 sample were used for testing.

1. Linear Regression:

"Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope."

Weights generated from the model — Fig. Example of Weights generated from the linear regression model.

Y_test and Y_found — Fig. Example of Y\(_{test}\) (left) and Y\(_{found}\)(right).

The accuracy of the found values using this method came out to be very high.

2. ANN:

"Artificial Neural Networks (ANN) are multi-layer fully-connected neural nets. They consist of an input layer, multiple hidden layers, and an output layer."

We have only 99 set of images and the model couldnt train properly using such less dataset. The prediction was really bad, so we dropped the idea of using ANN.

3. SVR:

"The objective of the Support Vector Machine - Regression (SVR) gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data."

a) SVR 'rbf'

"Gaussian RBF (Radial Basis Function) is one of the popular Kernel methods used in SVR models for more. RBF kernel is a function whose value depends on the distance from the origin or from some point." (support vector machine)

The accuracy of the found values using this method came out to be very low.

b) SVR 'linear'

The accuracy of the found values using this method was very low.

As none of the results from SVR are good enough, we dropped the idea of using SVR too, but the results gave a hint that linear model ran better than rbf so we can expect our dataset will work better for other linear models as well.

4. Random Forest:

"Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction."

Summary of output

S.N.	Algorithm	R\(^2\) (Higher is better)	MSE (Lower is better)
0.	Stack First Approach (Source Paper)	0.59	1.75xE-16
1.	Linear Regression	0.66	8.10xE-17
2.	ANN	NA	NA
3.	SVR 'rbf'	-2.86	1.38xE-15
4.	SVR 'linear'	0.01	4.86xE-16
5.	Random Forest	0.63	1.0xE-16

Comparison with ILC

1. ILC vs Linear Regression

Comparing the results from Linear Regression algorithm with the traditional ILC method.

Fig. Y\(_{ILC}\) (left) and Y\(_{Linear Regression}\) (right).

2. ILC vs Random Forest

Analysis and Conclusion

After trying out 5 different algorithms for getting the weights, we found out linear regression and random forest algorithms to yield the highest accuracy. The linear regression algorithm is also comparatively easier to code and run.

Link to presentation: link.

Link to drive folder containing our codes: link.

References

"Detection of WHIM in the Planck data using Stack First approach" https://arxiv.org/abs/2001.08668.
"Internal Linear Combination" method for the separation of CMB from Galactic foregrounds in the harmonic domain. https://arxiv.org/abs/0811.4277.
Towards Data Science. towardsdatascience.com/.

Acknowledgements

We would like to thank Mr. Baibhav Singari and Dr. Tuhin Ghosh for his help in the project selection and guidance.

If you come across any errors, or if you have any suggestions or questions regarding this project, kindly contact us at: akshay.priyadarshi@niser.ac.in, pratyushk.das@niser.ac.in

Odd Semester 2020-21
Last updated on 15-12-2020.
~ Akshay Priyadarshi (1611012)^a & Pratyush Kumar Das (1611082)^a
^a School of Physical Sciences, National Institute of Science Education & Research (NISER), Bhubaneswar.